51
|
Liu ZP, Wu LY, Wang Y, Zhang XS, Chen L. Bridging protein local structures and protein functions. Amino Acids 2008; 35:627-50. [PMID: 18421562 PMCID: PMC7088341 DOI: 10.1007/s00726-008-0088-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2008] [Accepted: 03/10/2008] [Indexed: 12/11/2022]
Abstract
One of the major goals of molecular and evolutionary biology is to understand the functions of proteins by extracting functional information from protein sequences, structures and interactions. In this review, we summarize the repertoire of methods currently being applied and report recent progress in the field of in silico annotation of protein function based on the accumulation of vast amounts of sequence and structure data. In particular, we emphasize the newly developed structure-based methods, which are able to identify locally structural motifs and reveal their relationship with protein functions. These methods include computational tools to identify the structural motifs and reveal the strong relationship between these pre-computed local structures and protein functions. We also discuss remaining problems and possible directions for this exciting and challenging area.
Collapse
Affiliation(s)
- Zhi-Ping Liu
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, 100080, Beijing, China
| | | | | | | | | |
Collapse
|
52
|
Rangwala H, Karypis G. f
RMSDPred: Predicting local RMSD between structural fragments using sequence information. Proteins 2008; 72:1005-18. [DOI: 10.1002/prot.21998] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
53
|
Sam V, Tai CH, Garnier J, Gibrat JF, Lee B, Munson PJ. Towards an automatic classification of protein structural domains based on structural similarity. BMC Bioinformatics 2008; 9:74. [PMID: 18237410 PMCID: PMC2267780 DOI: 10.1186/1471-2105-9-74] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2007] [Accepted: 01/31/2008] [Indexed: 11/10/2022] Open
Abstract
Background Formal classification of a large collection of protein structures aids the understanding of evolutionary relationships among them. Classifications involving manual steps, such as SCOP and CATH, face the challenge of increasing volume of available structures. Automatic methods such as FSSP or Dali Domain Dictionary, yield divergent classifications, for reasons not yet fully investigated. One possible reason is that the pairwise similarity scores used in automatic classification do not adequately reflect the judgments made in manual classification. Another possibility is the difference between manual and automatic classification procedures. We explore the degree to which these two factors might affect the final classification. Results We use DALI, SHEBA and VAST pairwise scores on the SCOP C class domains, to investigate a variety of hierarchical clustering procedures. The constructed dendrogram is cut in a variety of ways to produce a partition, which is compared to the SCOP fold classification. Ward's method dendrograms led to partitions closest to the SCOP fold classification. Dendrogram- or tree-cutting strategies fell into four categories according to the similarity of resulting partitions to the SCOP fold partition. Two strategies which optimize similarity to SCOP, gave an average of 72% true positives rate (TPR), at a 1% false positive rate. Cutting the largest size cluster at each step gave an average of 61% TPR which was one of the best strategies not making use of prior knowledge of SCOP. Cutting the longest branch at each step produced one of the worst strategies. We also developed a method to detect irreducible differences between the best possible automatic partitions and SCOP, regardless of the cutting strategy. These differences are substantial. Visual examination of hard-to-classify proteins confirms our previous finding, that global structural similarity of domains is not the only criterion used in the SCOP classification. Conclusion Different clustering procedures give rise to different levels of agreement between automatic and manual protein classifications. None of the tested procedures completely eliminates the divergence between automatic and manual protein classifications. Achieving full agreement between these two approaches would apparently require additional information.
Collapse
Affiliation(s)
- Vichetra Sam
- Mathematical and Statistical Computing Laboratory, DCB, CIT, NIH, DHHS, Bethesda, MD, USA.
| | | | | | | | | | | |
Collapse
|
54
|
Zemla AT, Zhou CLE. Structural Re-Alignment in an Immunogenic Surface Region of Ricin a Chain. Bioinform Biol Insights 2008; 2:5-13. [PMID: 19812763 PMCID: PMC2735970 DOI: 10.4137/bbi.s437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
We compared structure alignments generated by several protein structure comparison programs to determine whether existing methods would satisfactorily align residues at a highly conserved position within an immunogenic loop in ribosome inactivating proteins (RIPs). Using default settings, structure alignments generated by several programs (CE, DaliLite, FATCAT, LGA, MAMMOTH, MATRAS, SHEBA, SSM) failed to align the respective conserved residues, although LGA reported correct residue-residue (R-R) correspondences when the beta-carbon (Cb) position was used as the point of reference in the alignment calculations. Further tests using variable points of reference indicated that points distal from the beta carbon along a vector connecting the alpha and beta carbons yielded rigid structural alignments in which residues known to be highly conserved in RIPs were reported as corresponding residues in structural comparisons between ricin A chain, abrin-A, and other RIPs. Results suggest that approaches to structure alignment employing alternate point representations corresponding to side chain position may yield structure alignments that are more consistent with observed conservation of functional surface residues than do standard alignment programs, which apply uniform criteria for alignment (i.e. alpha carbon (Ca) as point of reference) along the entirety of the peptide chain. We present the results of tests that suggest the utility of allowing user-specified points of reference in generating alternate structural alignments, and we present a web server for automatically generating such alignments: http://as2ts.llnl.gov/AS2TS/LGA/lga_pdblist_plots.html.
Collapse
Affiliation(s)
- Adam T. Zemla
- Computational Biology for Countermeasures Group, Lawrence Livermore National Laboratory, Livermore, CA, U.S.A. 94550
| | - Carol L. Ecale Zhou
- Computational Biology for Countermeasures Group, Lawrence Livermore National Laboratory, Livermore, CA, U.S.A. 94550
| |
Collapse
|
55
|
Menke M, Berger B, Cowen L. Matt: local flexibility aids protein multiple structure alignment. PLoS Comput Biol 2008; 4:e10. [PMID: 18193941 PMCID: PMC2186361 DOI: 10.1371/journal.pcbi.0040010] [Citation(s) in RCA: 172] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2007] [Accepted: 12/06/2007] [Indexed: 11/20/2022] Open
Abstract
Even when there is agreement on what measure a protein multiple structure alignment should be optimizing, finding the optimal alignment is computationally prohibitive. One approach used by many previous methods is aligned fragment pair chaining, where short structural fragments from all the proteins are aligned against each other optimally, and the final alignment chains these together in geometrically consistent ways. Ye and Godzik have recently suggested that adding geometric flexibility may help better model protein structures in a variety of contexts. We introduce the program Matt (Multiple Alignment with Translations and Twists), an aligned fragment pair chaining algorithm that, in intermediate steps, allows local flexibility between fragments: small translations and rotations are temporarily allowed to bring sets of aligned fragments closer, even if they are physically impossible under rigid body transformations. After a dynamic programming assembly guided by these "bent" alignments, geometric consistency is restored in the final step before the alignment is output. Matt is tested against other recent multiple protein structure alignment programs on the popular Homstrad and SABmark benchmark datasets. Matt's global performance is competitive with the other programs on Homstrad, but outperforms the other programs on SABmark, a benchmark of multiple structure alignments of proteins with more distant homology. On both datasets, Matt demonstrates an ability to better align the ends of alpha-helices and beta-strands, an important characteristic of any structure alignment program intended to help construct a structural template library for threading approaches to the inverse protein-folding problem. The related question of whether Matt alignments can be used to distinguish distantly homologous structure pairs from pairs of proteins that are not homologous is also considered. For this purpose, a p-value score based on the length of the common core and average root mean squared deviation (RMSD) of Matt alignments is shown to largely separate decoys from homologous protein structures in the SABmark benchmark dataset. We postulate that Matt's strong performance comes from its ability to model proteins in different conformational states and, perhaps even more important, its ability to model backbone distortions in more distantly related proteins.
Collapse
Affiliation(s)
- Matthew Menke
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
| | - Lenore Cowen
- Department of Computer Science, Tufts University, Medford, Massachusetts, United States of America
| |
Collapse
|
56
|
ProCKSI: a decision support system for Protein (structure) Comparison, Knowledge, Similarity and Information. BMC Bioinformatics 2007; 8:416. [PMID: 17963510 PMCID: PMC2222653 DOI: 10.1186/1471-2105-8-416] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2007] [Accepted: 10/26/2007] [Indexed: 11/19/2022] Open
Abstract
Background We introduce the decision support system for Protein (Structure) Comparison, Knowledge, Similarity and Information (ProCKSI). ProCKSI integrates various protein similarity measures through an easy to use interface that allows the comparison of multiple proteins simultaneously. It employs the Universal Similarity Metric (USM), the Maximum Contact Map Overlap (MaxCMO) of protein structures and other external methods such as the DaliLite and the TM-align methods, the Combinatorial Extension (CE) of the optimal path, and the FAST Align and Search Tool (FAST). Additionally, ProCKSI allows the user to upload a user-defined similarity matrix supplementing the methods mentioned, and computes a similarity consensus in order to provide a rich, integrated, multicriteria view of large datasets of protein structures. Results We present ProCKSI's architecture and workflow describing its intuitive user interface, and show its potential on three distinct test-cases. In the first case, ProCKSI is used to evaluate the results of a previous CASP competition, assessing the similarity of proposed models for given targets where the structures could have a large deviation from one another. To perform this type of comparison reliably, we introduce a new consensus method. The second study deals with the verification of a classification scheme for protein kinases, originally derived by sequence comparison by Hanks and Hunter, but here we use a consensus similarity measure based on structures. In the third experiment using the Rost and Sander dataset (RS126), we investigate how a combination of different sets of similarity measures influences the quality and performance of ProCKSI's new consensus measure. ProCKSI performs well with all three datasets, showing its potential for complex, simultaneous multi-method assessment of structural similarity in large protein datasets. Furthermore, combining different similarity measures is usually more robust than relying on one single, unique measure. Conclusion Based on a diverse set of similarity measures, ProCKSI computes a consensus similarity profile for the entire protein set. All results can be clustered, visualised, analysed and easily compared with each other through a simple and intuitive interface. ProCKSI is publicly available at for academic and non-commercial use.
Collapse
|
57
|
Kim C, Lee B. Accuracy of structure-based sequence alignment of automatic methods. BMC Bioinformatics 2007; 8:355. [PMID: 17883866 PMCID: PMC2039753 DOI: 10.1186/1471-2105-8-355] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2007] [Accepted: 09/20/2007] [Indexed: 11/10/2022] Open
Abstract
Background Accurate sequence alignments are essential for homology searches and for building three-dimensional structural models of proteins. Since structure is better conserved than sequence, structure alignments have been used to guide sequence alignments and are commonly used as the gold standard for sequence alignment evaluation. Nonetheless, as far as we know, there is no report of a systematic evaluation of pairwise structure alignment programs in terms of the sequence alignment accuracy. Results In this study, we evaluate CE, DaliLite, FAST, LOCK2, MATRAS, SHEBA and VAST in terms of the accuracy of the sequence alignments they produce, using sequence alignments from NCBI's human-curated Conserved Domain Database (CDD) as the standard of truth. We find that 4 to 9% of the residues on average are either not aligned or aligned with more than 8 residues of shift error and that an additional 6 to 14% of residues on average are misaligned by 1–8 residues, depending on the program and the data set used. The fraction of correctly aligned residues generally decreases as the sequence similarity decreases or as the RMSD between the Cα positions of the two structures increases. It varies significantly across CDD superfamilies whether shift error is allowed or not. Also, alignments with different shift errors occur between proteins within the same CDD superfamily, leading to inconsistent alignments between superfamily members. In general, residue pairs that are more than 3.0 Å apart in the reference alignment are heavily (>= 25% on average) misaligned in the test alignments. In addition, each method shows a different pattern of relative weaknesses for different SCOP classes. CE gives relatively poor results for β-sheet-containing structures (all-β, α/β, and α+β classes), DaliLite for "others" class where all but the major four classes are combined, and LOCK2 and VAST for all-β and "others" classes. Conclusion When the sequence similarity is low, structure-based methods produce better sequence alignments than by using sequence similarities alone. However, current structure-based methods still mis-align 11–19% of the conserved core residues when compared to the human-curated CDD alignments. The alignment quality of each program depends on the protein structural type and similarity, with DaliLite showing the most agreement with CDD on average.
Collapse
Affiliation(s)
- Changhoon Kim
- Laboratory of Molecular Biology, Center for Cancer Research, National Cancer Institute National Institutes of Health, Bethesda, Maryland, USA
| | - Byungkook Lee
- Laboratory of Molecular Biology, Center for Cancer Research, National Cancer Institute National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
58
|
Martínez L, Andreani R, Martínez JM. Convergent algorithms for protein structural alignment. BMC Bioinformatics 2007; 8:306. [PMID: 17714583 PMCID: PMC1995224 DOI: 10.1186/1471-2105-8-306] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2007] [Accepted: 08/22/2007] [Indexed: 11/15/2022] Open
Abstract
Background Many algorithms exist for protein structural alignment, based on internal protein coordinates or on explicit superposition of the structures. These methods are usually successful for detecting structural similarities. However, current practical methods are seldom supported by convergence theories. In particular, although the goal of each algorithm is to maximize some scoring function, there is no practical method that theoretically guarantees score maximization. A practical algorithm with solid convergence properties would be useful for the refinement of protein folding maps, and for the development of new scores designed to be correlated with functional similarity. Results In this work, the maximization of scoring functions in protein alignment is interpreted as a Low Order Value Optimization (LOVO) problem. The new interpretation provides a framework for the development of algorithms based on well established methods of continuous optimization. The resulting algorithms are convergent and increase the scoring functions at every iteration. The solutions obtained are critical points of the scoring functions. Two algorithms are introduced: One is based on the maximization of the scoring function with Dynamic Programming followed by the continuous maximization of the same score, with respect to the protein position, using a smooth Newtonian method. The second algorithm replaces the Dynamic Programming step by a fast procedure for computing the correspondence between Cα atoms. The algorithms are shown to be very effective for the maximization of the STRUCTAL score. Conclusion The interpretation of protein alignment as a LOVO problem provides a new theoretical framework for the development of convergent protein alignment algorithms. These algorithms are shown to be very reliable for the maximization of the STRUCTAL score, and other distance-dependent scores may be optimized with same strategy. The improved score optimization provided by these algorithms provide means for the refinement of protein fold maps and also for the development of scores designed to match biological function. The LOVO strategy may be also used for more general structural superposition problems such as flexible or non-sequential alignments. The package is available on-line at http://www.ime.unicamp.br/~martinez/lovoalign.
Collapse
Affiliation(s)
- Leandro Martínez
- Institute of Chemistry, State University of Campinas, Campinas, SP, Brazil
| | - Roberto Andreani
- Department of Applied Mathematics, IMECC-UNICAMP, State University of Campinas, Campinas, SP, Brazil
| | - José Mario Martínez
- Department of Applied Mathematics, IMECC-UNICAMP, State University of Campinas, CP 6065, 13081-970, Campinas, SP, Brazil
| |
Collapse
|
59
|
Comparative analysis of protein structure alignments. BMC STRUCTURAL BIOLOGY 2007; 7:50. [PMID: 17672887 PMCID: PMC1959231 DOI: 10.1186/1472-6807-7-50] [Citation(s) in RCA: 66] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/30/2007] [Accepted: 07/26/2007] [Indexed: 11/25/2022]
Abstract
Background Several methods are currently available for the comparison of protein structures. These methods have been analysed regarding the performance in the identification of structurally/evolutionary related proteins, but so far there has been less focus on the objective comparison between the alignments produced by different methods. Results We analysed and compared the structural alignments obtained by different methods using three sets of pairs of structurally related proteins. The first set corresponds to 355 pairs of remote homologous proteins according to the SCOP database (ASTRAL40 set). The second set was derived from the SISYPHUS database and includes 69 protein pairs (SISY set). The third set consists of 40 pairs that are challenging to align (RIPC set). The alignment of pairs of this set requires indels of considerable number and size and some of the proteins are related by circular permutations, show extensive conformational variability or include repetitions. Two standard methods (CE and DALI) were applied to align the proteins in the ASTRAL40 set. The extent of structural similarity identified by both methods is highly correlated and the alignments from the two methods agree on average in more than half of the aligned positions. CE, DALI, as well as four additional methods (FATCAT, MATRAS, Cα-match and SHEBA) were then compared using the SISY and RIPC sets. The accuracy of the alignments was assessed by comparison to reference alignments. The alignments generated by the different methods on average match more than half of the reference alignments in the SISY set. The alignments obtained in the more challenging RIPC set tend to differ considerably and match reference alignments less successfully than the SISY set alignments. Conclusion The alignments produced by different methods tend to agree to a considerable extent, but the agreement is lower for the more challenging pairs. The results for the comparison to reference alignments are encouraging, but also indicate that there is still room for improvement.
Collapse
|
60
|
Abstract
BACKGROUND Discerning the similarity between molecules is a challenging problem in drug discovery as well as in molecular biology. The importance of this problem is due to the fact that the biochemical characteristics of a molecule are closely related to its structure. Therefore molecular similarity is a key notion in investigations targeting exploration of molecular structural space, query-retrieval in molecular databases, and structure-activity modelling. Determining molecular similarity is related to the choice of molecular representation. Currently, representations with high descriptive power and physical relevance like 3D surface-based descriptors are available. Information from such representations is both surface-based and volumetric. However, most techniques for determining molecular similarity tend to focus on idealized 2D graph-based descriptors due to the complexity that accompanies reasoning with more elaborate representations. RESULTS This paper addresses the problem of determining similarity when molecules are described using complex surface-based representations. It proposes an intrinsic, spherical representation that systematically maps points on a molecular surface to points on a standard coordinate system (a sphere). Molecular surface properties such as shape, field strengths, and effects due to field super-positioning can then be captured as distributions on the surface of the sphere. Surface-based molecular similarity is subsequently determined by computing the similarity of the surface-property distributions using a novel formulation of histogram-intersection. The similarity formulation is not only sensitive to the 3D distribution of the surface properties, but is also highly efficient to compute. CONCLUSION The proposed method obviates the computationally expensive step of molecular pose-optimisation, can incorporate conformational variations, and facilitates highly efficient determination of similarity by directly comparing molecular surfaces and surface-based properties. Retrieval performance, applications in structure-activity modeling of complex biological properties, and comparisons with existing research and commercial methods demonstrate the validity and effectiveness of the approach.
Collapse
Affiliation(s)
- Rahul Singh
- Department of Computer Science, San Francisco State University, San Francisco, CA 94132, USA.
| |
Collapse
|
61
|
Pandini A, Mauri G, Bordogna A, Bonati L. Detecting similarities among distant homologous proteins by comparison of domain flexibilities. Protein Eng Des Sel 2007; 20:285-99. [PMID: 17573407 DOI: 10.1093/protein/gzm021] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Aim of this work is to assess the informativeness of protein dynamics in the detection of similarities among distant homologous proteins. To this end, an approach to perform large-scale comparisons of protein domain flexibilities is proposed. CONCOORD is confirmed as a reliable method for fast conformational sampling. The root mean square fluctuation of alpha carbon positions in the essential dynamics subspace is employed as a measure of local flexibility and a synthetic index of similarity is presented. The dynamics of a large collection of protein domains from ASTRAL/SCOP40 is analyzed and the possibility to identify relationships, at both the family and the superfamily levels, on the basis of the dynamical features is discussed. The obtained picture is in agreement with the SCOP classification, and furthermore suggests the presence of a distinguishable familiar trend in the flexibility profiles. The results support the complementarity of the dynamical and the structural information, suggesting that information from dynamics analysis can arise from functional similarities, often partially hidden by a static comparison. On the basis of this first test, flexibility annotation can be expected to help in automatically detecting functional similarities otherwise unrecoverable.
Collapse
Affiliation(s)
- Alessandro Pandini
- Dipartimento di Scienze dell'Ambiente e del Territorio, Università degli Studi di Milano-Bicocca, 20126 Milano, Italy
| | | | | | | |
Collapse
|
62
|
Lerman G, Shakhnovich BE. Defining functional distance using manifold embeddings of gene ontology annotations. Proc Natl Acad Sci U S A 2007; 104:11334-9. [PMID: 17595300 PMCID: PMC2040899 DOI: 10.1073/pnas.0702965104] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Although rigorous measures of similarity for sequence and structure are now well established, the problem of defining functional relationships has been particularly daunting. Here, we present several manifold embedding techniques to compute distances between Gene Ontology (GO) functional annotations and consequently estimate functional distances between protein domains. To evaluate accuracy, we correlate the functional distance to the well established measures of sequence, structural, and phylogenetic similarities. Finally, we show that manual classification of structures into folds and superfamilies is mirrored by proximity in the newly defined function space. We show how functional distances place structure-function relationships in biological context resulting in insight into divergent and convergent evolution. The methods and results in this paper can be readily generalized and applied to a wide array of biologically relevant investigations, such as accuracy of annotation transference, the relationship between sequence, structure, and function, or coherence of expression modules.
Collapse
Affiliation(s)
- Gilad Lerman
- *Department of Mathematics, University of Minnesota, Minneapolis, MN 55455; and
- To whom correspondence may be addressed. E-mail: or
| | - Boris E. Shakhnovich
- Program in Bioinformatics, Boston University, Boston, MA 02215
- To whom correspondence may be addressed. E-mail: or
| |
Collapse
|
63
|
Dalton JAR, Jackson RM. An evaluation of automated homology modelling methods at low target template sequence similarity. Bioinformatics 2007; 23:1901-8. [PMID: 17510171 DOI: 10.1093/bioinformatics/btm262] [Citation(s) in RCA: 71] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION There are two main areas of difficulty in homology modelling that are particularly important when sequence identity between target and template falls below 50%: sequence alignment and loop building. These problems become magnified with automatic modelling processes, as there is no human input to correct mistakes. As such we have benchmarked several stand-alone strategies that could be implemented in a workflow for automated high-throughput homology modelling. These include three new sequence-structure alignment programs: 3D-Coffee, Staccato and SAlign, plus five homology modelling programs and their respective loop building methods: Builder, Nest, Modeller, SegMod/ENCAD and Swiss-Model. The SABmark database provided 123 targets with at least five templates from the same SCOP family and sequence identities </=50%. RESULTS When using Modeller as the common modelling program, 3D-Coffee outperforms Staccato and SAlign using both multiple templates and the best single template, and across the sequence identity range 20-50%. The mean model RMSD generated from 3D-Coffee using multiple templates is 15 and 28% (or using single templates, 3 and 13%) better than those generated by Staccato and Salign, respectively. 3D-Coffee gives equivalent modelling accuracy from multiple and single templates, but Staccato and SAlign are more successful with single templates, their quality deteriorating as additional lower sequence identity templates are added. Evaluating the different homology modelling programs, on average Modeller performs marginally better in overall modelling than the others tested. However, on average Nest produces the best loops with an 8% improvement by mean RMSD compared to the loops generated by Builder.
Collapse
Affiliation(s)
- James A R Dalton
- Institute of Molecular and Cellular Biology, Faculty of Biological Sciences, University of Leeds, Leeds, UK
| | | |
Collapse
|
64
|
Abstract
In this paper, we study the problem of computing the similarity of two protein structures by measuring their contact-map overlap. Contact-map overlap abstracts the problem of computing the similarity of two polygonal chains as a graph-theoretic problem. In R3, we present the first polynomial time algorithm with any guarantee on the approximation ratio for the 3-dimensional problem. More precisely, we give an algorithm for the contact-map overlap problem with an approximation ratio of sigma where sigma = min{sigma(P1), sigma(P2)} <or= O(n(1/2)) is a decomposition parameter depending on the input polygonal chains P1 and P2. In R2, we improve the running time of the previous best known approximation algorithm from O(n(6)) to O(n(3) log n) at the cost of decreasing the approximation ratio by half. We also give hardness results for the problem in three dimensions, suggesting that approximating it better than O(n(epsilon)), for some epsilon > 0, is hard.
Collapse
Affiliation(s)
- Pankaj K Agarwal
- Dept. of Computer Science, Duke University, Durham, North Carolina, USA.
| | | | | |
Collapse
|
65
|
Wang Y, Makedon F, Ford J, Huang H. A bipartite graph matching framework for finding correspondences between structural elements in two proteins. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2007; 2004:2972-5. [PMID: 17270902 DOI: 10.1109/iembs.2004.1403843] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
A protein molecule consists one or more chains of amino acid sequences that fold into a complex three-dimensional structure. A protein's functions are often determined by its 3D structure, and so comparing the similarity of 3D structures between proteins is an important problem. To accomplish such comparison, one must align two proteins properly with rotation and translation in 3D space. Finding the correspondences between structural elements in the two proteins is the key step in many protein structure alignment algorithms. We introduce a new graph theoretic framework based on bipartite graph matching for finding sufficiently good correspondences. It is capable of providing both sequence-dependent and sequence-independent correspondences. It is a general framework for pair-wise matching of atoms, amino acids residues or secondary structure elements.
Collapse
Affiliation(s)
- Yuhang Wang
- Dept. of Comput. Sci., Dartmouth Coll., Hanover, NH, USA
| | | | | | | |
Collapse
|
66
|
Hill AD, Reilly PJ. Comparing programs for rigid-body multiple structural superposition of proteins. Proteins 2006; 64:219-26. [PMID: 16568449 DOI: 10.1002/prot.20975] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Different programs and methods were employed to superimpose protein structures, using members of four very different protein families as test subjects, and the results of these efforts were compared. Algorithms based on human identification of key amino acid residues on which to base the superpositions were nearly always more successful than programs that used automated techniques to identify key residues. Among those programs automatically identifying key residues, MASS could not superimpose all members of some families, but was very efficient with other families. MODELLER, MultiProt, and STAMP had varying levels of success. A genetic algorithm program written for this project did not improve superpositions when results from neighbor-joining and pseudostar algorithms were used as its starting cases, but it always improved superpositions obained by MODELLER and STAMP. A program entitled PyMSS is presented that includes three superposition algorithms featuring human interaction.
Collapse
Affiliation(s)
- Anthony D Hill
- Department of Chemical and Biological Engineering, Iowa State University, Ames, Iowa 50011-2230, USA
| | | |
Collapse
|
67
|
Ohlson T, Aggarwal V, Elofsson A, MacCallum RM. Improved alignment quality by combining evolutionary information, predicted secondary structure and self-organizing maps. BMC Bioinformatics 2006; 7:357. [PMID: 16869963 PMCID: PMC1562450 DOI: 10.1186/1471-2105-7-357] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2006] [Accepted: 07/25/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein sequence alignment is one of the basic tools in bioinformatics. Correct alignments are required for a range of tasks including the derivation of phylogenetic trees and protein structure prediction. Numerous studies have shown that the incorporation of predicted secondary structure information into alignment algorithms improves their performance. Secondary structure predictors have to be trained on a set of somewhat arbitrarily defined states (e.g. helix, strand, coil), and it has been shown that the choice of these states has some effect on alignment quality. However, it is not unlikely that prediction of other structural features also could provide an improvement. In this study we use an unsupervised clustering method, the self-organizing map, to assign sequence profile windows to "structural states" and assess their use in sequence alignment. RESULTS The addition of self-organizing map locations as inputs to a profile-profile scoring function improves the alignment quality of distantly related proteins slightly. The improvement is slightly smaller than that gained from the inclusion of predicted secondary structure. However, the information seems to be complementary as the two prediction schemes can be combined to improve the alignment quality by a further small but significant amount. CONCLUSION It has been observed in many studies that predicted secondary structure significantly improves the alignments. Here we have shown that the addition of self-organizing map locations can further improve the alignments as the self-organizing map locations seem to contain some information that is not captured by the predicted secondary structure.
Collapse
Affiliation(s)
- Tomas Ohlson
- Stockholm Bioinformatics Center, Stockholm University, SE-106 91 Stockholm, Sweden
| | - Varun Aggarwal
- Stockholm Bioinformatics Center, Stockholm University, SE-106 91 Stockholm, Sweden
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Arne Elofsson
- Stockholm Bioinformatics Center, Stockholm University, SE-106 91 Stockholm, Sweden
- Center for Biomembrane Research, Stockholm University, SE-106 91 Stockholm, Sweden
| | - Robert M MacCallum
- Stockholm Bioinformatics Center, Stockholm University, SE-106 91 Stockholm, Sweden
- Division of Cell and Molecular Biology, Imperial College London, London, UK
| |
Collapse
|
68
|
Zotenko E, O'Leary DP, Przytycka TM. Secondary structure spatial conformation footprint: a novel method for fast protein structure comparison and classification. BMC STRUCTURAL BIOLOGY 2006; 6:12. [PMID: 16762072 PMCID: PMC1526735 DOI: 10.1186/1472-6807-6-12] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/26/2005] [Accepted: 06/08/2006] [Indexed: 11/10/2022]
Abstract
BACKGROUND Recently a new class of methods for fast protein structure comparison has emerged. We call the methods in this class projection methods as they rely on a mapping of protein structure into a high-dimensional vector space. Once the mapping is done, the structure comparison is reduced to distance computation between corresponding vectors. As structural similarity is approximated by distance between projections, the success of any projection method depends on how well its mapping function is able to capture the salient features of protein structure. There is no agreement on what constitutes a good projection technique and the three currently known projection methods utilize very different approaches to the mapping construction, both in terms of what structural elements are included and how this information is integrated to produce a vector representation. RESULTS In this paper we propose a novel projection method that uses secondary structure information to produce the mapping. First, a diverse set of spatial arrangements of triplets of secondary structure elements, a set of structural models, is automatically selected. Then, each protein structure is mapped into a high-dimensional vector of "counts" or footprint, where each count corresponds to the number of times a given structural model is observed in the structure, weighted by the precision with which the model is reproduced. We perform the first comprehensive evaluation of our method together with all other currently known projection methods. CONCLUSION The results of our evaluation suggest that the type of structural information used by a projection method affects the ability of the method to detect structural similarity. In particular, our method that uses the spatial conformations of triplets of secondary structure elements outperforms other methods in most of the tests.
Collapse
Affiliation(s)
- Elena Zotenko
- Department of Computer Science, University of Maryland,College Park, MD 20742, USA
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Dianne P O'Leary
- Department of Computer Science, University of Maryland,College Park, MD 20742, USA
- Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742,USA
| | - Teresa M Przytycka
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
69
|
Abstract
The Sixth Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP6) held in December 2004 focused on the prediction of the structures of 90 protein domains from 64 targets. Thirty-eight of these were classified as "fold recognition," defined as being similar in fold to proteins of known structure at the time of submission of the predictions. Only the "first" predictions and those longer than 20 amino acids for each domain were assessed, resulting in 4527 predictions from 165 groups. The assessment was accomplished by the use of six structure alignment programs and three scoring measures based on these alignments. The use of a variety of measures resulted in scoring insensitive to the peculiarities of any one alignment method. The top-ranked methods in the prediction of structures that were clearly homologous to proteins in the Protein Data Bank primarily used servers and other programs based on achieving a consensus of many remote homology detection and fold recognition methods. The top-ranked methods in prediction of structures less clearly related or unrelated to proteins of known structures used fragment building methods in addition to the fold recognition meta methods.
Collapse
Affiliation(s)
- Guoli Wang
- Institute for Cancer Research, Fox Chase Cancer Center, Philadelphia, Pennsylvania 19111, USA
| | | | | |
Collapse
|
70
|
Abstract
We present a novel algorithm named FAST for aligning protein three-dimensional structures. FAST uses a directionality-based scoring scheme to compare the intra-molecular residue-residue relationships in two structures. It employs an elimination heuristic to promote sparseness in the residue-pair graph and facilitate the detection of the global optimum. In order to test the overall accuracy of FAST, we determined its sensitivity and specificity with the SCOP classification (version 1.61) as the gold standard. FAST achieved higher sensitivities than several existing methods (DaliLite, CE, and K2) at all specificity levels. We also tested FAST against 1033 manually curated alignments in the HOMSTRAD database. The overall agreement was 96%. Close inspection of examples from broad structural classes indicated the high quality of FAST alignments. Moreover, FAST is an order of magnitude faster than other algorithms that attempt to establish residue-residue correspondence. Typical pairwise alignments take FAST less than a second with a Pentium III 1.2GHz CPU. FAST software and a web server are available at http://biowulf.bu.edu/FAST/.
Collapse
Affiliation(s)
- Jianhua Zhu
- Bioinformatics Program, Boston University, Boston, Massachusetts 02215, USA
| | | |
Collapse
|
71
|
Sam V, Tai CH, Garnier J, Gibrat JF, Lee B, Munson PJ. ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification. BMC Bioinformatics 2006; 7:206. [PMID: 16613604 PMCID: PMC1513609 DOI: 10.1186/1471-2105-7-206] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2005] [Accepted: 04/13/2006] [Indexed: 11/30/2022] Open
Abstract
Background Current classification of protein folds are based, ultimately, on visual inspection of similarities. Previous attempts to use computerized structure comparison methods show only partial agreement with curated databases, but have failed to provide detailed statistical and structural analysis of the causes of these divergences. Results We construct a map of similarities/dissimilarities among manually defined protein folds, using a score cutoff value determined by means of the Receiver Operating Characteristics curve. It identifies folds which appear to overlap or to be "confused" with each other by two distinct similarity measures. It also identifies folds which appear inhomogeneous in that they contain apparently dissimilar domains, as measured by both similarity measures. At a low (1%) false positive rate, 25 to 38% of domain pairs in the same SCOP folds do not appear similar. Our results suggest either that some of these folds are defined using criteria other than purely structural consideration or that the similarity measures used do not recognize some relevant aspects of structural similarity in certain cases. Specifically, variations of the "common core" of some folds are severe enough to defeat attempts to automatically detect structural similarity and/or to lead to false detection of similarity between domains in distinct folds. Structures in some folds vary greatly in size because they contain varying numbers of a repeating unit, while similarity scores are quite sensitive to size differences. Structures in different folds may contain similar substructures, which produce false positives. Finally, the common core within a structure may be too small relative to the entire structure, to be recognized as the basis of similarity to another. Conclusion A detailed analysis of the entire available protein fold space by two automated similarity methods reveals the extent and the nature of the divergence between the automatically determined similarity/dissimilarity and the manual fold type classifications. Some of the observed divergences can probably be addressed with better structure comparison methods and better automatic, intelligent classification procedures. Others may be intrinsic to the problem, suggesting a continuous rather than discrete protein fold space.
Collapse
Affiliation(s)
- Vichetra Sam
- Mathematical and Statistical Computing Laboratory, DCB, CIT, NIH, DHHS, Bethesda, MD, USA
| | - Chin-Hsien Tai
- Laboratory of Molecular Biology, CCR, NCI, NIH, DHHS, Bethesda, MD, USA
| | - Jean Garnier
- Mathematical and Statistical Computing Laboratory, DCB, CIT, NIH, DHHS, Bethesda, MD, USA
- Mathematique Informatique et Genome, INRA, Jouy-en-Josas, France
| | | | - Byungkook Lee
- Laboratory of Molecular Biology, CCR, NCI, NIH, DHHS, Bethesda, MD, USA
| | - Peter J Munson
- Mathematical and Statistical Computing Laboratory, DCB, CIT, NIH, DHHS, Bethesda, MD, USA
| |
Collapse
|
72
|
Jeong J, Berman P, Przytycka T. Fold classification based on secondary structure--how much is gained by including loop topology? BMC STRUCTURAL BIOLOGY 2006; 6:3. [PMID: 16524467 PMCID: PMC1434743 DOI: 10.1186/1472-6807-6-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/02/2005] [Accepted: 03/08/2006] [Indexed: 11/18/2022]
Abstract
Background It has been proposed that secondary structure information can be used to classify (to some extend) protein folds. Since this method utilizes very limited information about the protein structure, it is not surprising that it has a higher error rate than the approaches that use full 3D fold description. On the other hand, the comparing of 3D protein structures is computing intensive. This raises the question to what extend the error rate can be decreased with each new source of information, especially if the new information can still be used with simple alignment algorithms. We consider the question whether the information about closed loops can improve the accuracy of this approach. While the answer appears to be obvious, we had to overcome two challenges. First, how to code and to compare topological information in such a way that local alignment of strings will properly identify similar structures. Second, how to properly measure the effect of new information in a large data sample. We investigate alternative ways of computing and presenting this information. Results We used the set of beta proteins with at most 30% pairwise identity to test the approach; local alignment scores were used to build a tree of clusters which was evaluated using a new log-odd cluster scoring function. In particular, we derive a closed formula for the probability of obtaining a given score by chance.Parameters of local alignment function were optimized using a genetic algorithm. Of 81 folds that had more than one representative in our data set, log-odds scores registered significantly better clustering in 27 cases and significantly worse in 6 cases, and small differences in the remaining cases. Various notions of the significant change or average change were considered and tried, and the results were all pointing in the same direction. Conclusion We found that, on average, properly presented information about the loop topology improves noticeably the accuracy of the method but the benefits vary between fold families as measured by log-odds cluster score.
Collapse
Affiliation(s)
- Jieun Jeong
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, USA
| | - Piotr Berman
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, USA
| | - Teresa Przytycka
- National Center for Biotechnology Information, National Library of Medicine, National Institute of Health, Bethesda, USA
| |
Collapse
|
73
|
Vesterstrøm J, Taylor WR. Flexible Secondary Structure Based Protein Structure Comparison Applied to the Detection of Circular Permutation. J Comput Biol 2006; 13:43-63. [PMID: 16472021 DOI: 10.1089/cmb.2006.13.43] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We present a novel method for structural comparison of protein structures. The approach consists of two main phases: 1) an initial search phase where, starting from aligned pairs of secondary structure elements, the space of 3D transformations is searched for similarities and 2) a subsequent refinement phase where interim solutions are subjected to parallel, local, iterative dynamic programming in the areas of possible improvement. The proposed method combines dynamic programming for finding alignments but does not restrict solutions to be sequential. In addition, to deal with the problem of nonuniqueness of optimal similarities, we introduce a consensus scoring method in selecting the preferred similarity and provide a list of top-ranked solutions. The method, called FASE (flexible alignment of secondary structure elements), was tested on well-known data and various standard problems from the literature. The results show that FASE is able to find remote and weak similarities consistently using a reasonable run time. The method was tested (using the SCOP database) on its ability to discriminate interfold pairs from intrafold pairs at the level of the best existing methods. The method was then applied to the problem of finding circular permutations in proteins.
Collapse
Affiliation(s)
- Jakob Vesterstrøm
- BiRC-Bioinformatics Research Center, University of Aarhus, DK-8000 Aarhus C, Denmark
| | | |
Collapse
|
74
|
Björklund AK, Ekman D, Light S, Frey-Skött J, Elofsson A. Domain Rearrangements in Protein Evolution. J Mol Biol 2005; 353:911-23. [PMID: 16198373 DOI: 10.1016/j.jmb.2005.08.067] [Citation(s) in RCA: 138] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2005] [Revised: 08/19/2005] [Accepted: 08/26/2005] [Indexed: 10/25/2022]
Abstract
Most eukaryotic proteins are multi-domain proteins that are created from fusions of genes, deletions and internal repetitions. An investigation of such evolutionary events requires a method to find the domain architecture from which each protein originates. Therefore, we defined a novel measure, domain distance, which is calculated as the number of domains that differ between two domain architectures. Using this measure the evolutionary events that distinguish a protein from its closest ancestor have been studied and it was found that indels are more common than internal repetition and that the exchange of a domain is rare. Indels and repetitions are common at both the N and C-terminals while they are rare between domains. The evolution of the majority of multi-domain proteins can be explained by the stepwise insertions of single domains, with the exception of repeats that sometimes are duplicated several domains in tandem. We show that domain distances agree with sequence similarity and semantic similarity based on gene ontology annotations. In addition, we demonstrate the use of the domain distance measure to build evolutionary trees. Finally, the evolution of multi-domain proteins is exemplified by a closer study of the evolution of two protein families, non-receptor tyrosine kinases and RhoGEFs.
Collapse
Affiliation(s)
- Asa K Björklund
- Stockholm Bioinformatics Center, Stockholm University, SE-10691 Stockholm, Sweden
| | | | | | | | | |
Collapse
|
75
|
Ohlson T, Elofsson A. ProfNet, a method to derive profile-profile alignment scoring functions that improves the alignments of distantly related proteins. BMC Bioinformatics 2005; 6:253. [PMID: 16225676 PMCID: PMC1274300 DOI: 10.1186/1471-2105-6-253] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2005] [Accepted: 10/14/2005] [Indexed: 11/10/2022] Open
Abstract
Background Profile-profile methods have been used for some years now to detect and align homologous proteins. The best such methods use information from the background distribution of amino acids and substitution tables either when constructing the profiles or in the scoring. This makes the methods dependent on the quality and choice of substitution table as well as the construction of the profiles. Here, we introduce a novel method called ProfNet that is used to derive a profile-profile scoring function. The method optimizes the discrimination between scores of related and unrelated residues and it is fast and straightforward to use. This new method derives a scoring function that is mainly dependent on the actual alignment of residues from a training set, and it does not use any additional information about the background distribution. Results It is shown that ProfNet improves the discrimination of related and unrelated residues. Further it can be used to improve the alignment of distantly related proteins. Conclusion The best performance is obtained using superfamily related proteins in the training of ProfNet, and a classifier that is related to the distance between the structurally aligned residues. The main difference between the new scoring function and a traditional profile-profile scoring function is that conserved residues on average score higher with the new function.
Collapse
Affiliation(s)
- Tomas Ohlson
- Stockholm Blolnformatlcs Center, Stockholm University, SE-106 91 Stockholm, Sweden
| | - Arne Elofsson
- Stockholm Blolnformatlcs Center, Stockholm University, SE-106 91 Stockholm, Sweden
| |
Collapse
|
76
|
Yona G, Kedem K. The URMS-RMS hybrid algorithm for fast and sensitive local protein structure alignment. J Comput Biol 2005; 12:12-32. [PMID: 15725731 DOI: 10.1089/cmb.2005.12.12] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We present an efficient and sensitive hybrid algorithm for local structure alignment of a pair of 3D protein structures. The hybrid algorithm employs both the URMS (unit-vector root mean squared) metric and the RMS metric. Our algorithm searches efficiently the transformation space using a fast screening protocol; initial transformations (rotations) are identified using the URMS algorithm. These rotations are then clustered and an RMS-based dynamic programming algorithm is invoked to find the maximal local similarities for representative rotations of the clusters. Statistical significance of the alignments is estimated using a model that accounts for both the score of the match and the RMS. We tested our algorithm over the SCOP classification of protein domains. Our algorithm performs very well; its main advantages are that (1) it combines the advantages of the RMS and the URMS metrics, (2) it searches extensively the transformation space, (3) it detects complex similarities and structural repeats, and (4) its results are symmetric. The software is available for download at biozon.org/ftp/software/urms/.
Collapse
Affiliation(s)
- Golan Yona
- Department of Computer Science, Cornell University, Ithaca, NY 14853, USA.
| | | |
Collapse
|
77
|
Sierk ML, Kleywegt GJ. Déjà vu all over again: finding and analyzing protein structure similarities. Structure 2005; 12:2103-11. [PMID: 15576025 DOI: 10.1016/j.str.2004.09.016] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2004] [Revised: 09/07/2004] [Accepted: 09/23/2004] [Indexed: 10/26/2022]
Abstract
Structure comparison is a crucial aspect of structural biology today. The field of structure comparison is developing rapidly, with the development of new algorithms, similarity scores, and statistical scores. The predicted large increase of experimental structures and structural models made possible by high-throughput efforts means that structural comparison and searching of structural databases using automated methods will become increasingly common. This Ways & Means article is meant to guide the structural biologist in the basics of structural alignment, and to provide an overview of the available software tools. The main purpose is to encourage users to gain some understanding of the strengths and limitations of structural alignment, and to take these factors into account when interpreting the results of different programs.
Collapse
Affiliation(s)
- Michael L Sierk
- Department of Biochemistry and Molecular Genetics, University of Virginia, P.O. Box 800733, Charlottesville, VA 22908, USA.
| | | |
Collapse
|
78
|
Comin M, Guerra C, Zanotti G. PROuST: a comparison method of three-dimensional structures of proteins using indexing techniques. J Comput Biol 2005; 11:1061-72. [PMID: 15662198 DOI: 10.1089/cmb.2004.11.1061] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We present a new method for protein structure comparison that combines indexing and dynamic programming (DP). The method is based on simple geometric features of triplets of secondary structures of proteins. These features provide indexes to a hash table that allows fast retrieval of similarity information for a query protein. After the query protein is matched with all proteins in the hash table producing a list of putative similarities, the dynamic programming algorithm is used to align the query protein with each protein of this list. Since the pairwise comparison with DP is applied only to a small subset of proteins and, furthermore, DP re-uses information that is already computed and stored in the hash table, the approach is very fast even when searching the entire PDB. We have done extensive experimentation showing that our approach achieves results of quality comparable to that of other existing approaches but is generally faster.
Collapse
Affiliation(s)
- Matteo Comin
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | | | | |
Collapse
|
79
|
Blades MJ, Ison JC, Ranasinghe R, Findlay JBC. Automatic generation and evaluation of sparse protein signatures for families of protein structural domains. Protein Sci 2005; 14:13-23. [PMID: 15608116 PMCID: PMC2253312 DOI: 10.1110/ps.04929005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
We identified key residues from the structural alignment of families of protein domains from SCOP which we represented in the form of sparse protein signatures. A signature-generating algorithm (SigGen) was developed and used to automatically identify key residues based on several structural and sequence-based criteria. The capacity of the signatures to detect related sequences from the SWISSPROT database was assessed by receiver operator characteristic (ROC) analysis and jack-knife testing. Test signatures for families from each of the main SCOP classes are described in relation to the quality of the structural alignments, the SigGen parameters used, and their diagnostic performance. We show that automatically generated signatures are potently diagnostic for their family (ROC50 scores typically >0.8), consistently outperform random signatures, and can identify sequence relationships in the "twilight zone" of protein sequence similarity (<40%). Signatures based on 15%-30% of alignment positions occurred most frequently among the best-performing signatures. When alignment quality is poor, sparser signatures perform better, whereas signatures generated from higher-quality alignments of fewer structures require more positions to be diagnostic. Our validation of signatures from the Globin family shows that when sequences from the structural alignment are removed and new signatures generated, the omitted sequences are still detected. The positions highlighted by the signature often correspond (alignment specificity >0.7) to the key positions in the original (non-jack-knifed) alignment. We discuss potential applications of sparse signatures in sequence annotation and homology modeling.
Collapse
Affiliation(s)
- Matthew J Blades
- AstraZeneca R&D Charnwood, Bakewell Road, Loughborough, Leicestershire LE11 5RH, England.
| | | | | | | |
Collapse
|
80
|
Backofen R, Will S. Local sequence-structure motifs in RNA. J Bioinform Comput Biol 2005; 2:681-98. [PMID: 15617161 DOI: 10.1142/s0219720004000818] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2003] [Revised: 04/19/2004] [Accepted: 04/19/2004] [Indexed: 11/18/2022]
Abstract
Ribonuclic acid (RNA) enjoys increasing interest in molecular biology; despite this interest fundamental algorithms are lacking, e.g. for identifying local motifs. As proteins, RNA molecules have a distinctive structure. Therefore, in addition to sequence information, structure plays an important part in assessing the similarity of RNAs. Furthermore, common sequence-structure features in two or several RNA molecules are often only spatially local, where possibly large parts of the molecules are dissimilar. Consequently, we address the problem of comparing RNA molecules by computing an optimal local alignment with respect to sequence and structure information. While local alignment is superior to global alignment for identifying local similarities, no general local sequence-structure alignment algorithms are currently known. We suggest a new general definition of locality for sequence-structure alignments that is biologically motivated and efficiently tractable. To show the former, we discuss locality of RNA and prove that the defined locality means connectivity by atomic and non-atomic bonds. To show the latter, we present an efficient algorithm for the newly defined pairwise local sequence-structure alignment (lssa) problem for RNA. For molecules of lengthes n and m, the algorithm has worst-case time complexity of O(n2 x m2 x max(n,m)) and a space complexity of only O(n x m). An implementation of our algorithm is available at http://www.bio.inf.uni-jena.de. Its runtime is competitive with global sequence-structure alignment.
Collapse
Affiliation(s)
- Rolf Backofen
- Chair for Bioinformatics at the Institute of Computer Science, Friedrich-Schiller-Universitaet Jena, Ernst-Abbe-Platz 2, D-07743 Jena, Germany.
| | | |
Collapse
|
81
|
Jia Y, Dewey TG. A Random Polymer Model of the Statistical Significance of Structure Alignment. J Comput Biol 2005; 12:298-313. [PMID: 15857244 DOI: 10.1089/cmb.2005.12.298] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
A theory for assessing the statistical significance of structure alignment is developed using a random or Gaussian chain model. In this model, we consider the statistical distribution of the root mean square distance (rmsd) of the alignment between two random chains of equal length and common center of mass (referred to as Case 1). We demonstrate that the rmsd2 is distributed as a sum of independent Gamma variables. Analytic results on the mean and variance of the rmsd2 are presented. Since rmsd is strongly dependent on the length, we define the dimensionless quantity, reduced rmsd, as the rmsd divided by the radius of gyration. We find that the reduced rmsd can be accurately approximated by an extreme value distribution (EVD) that is independent of chain length and of bond length. The parameters of the EVD can be calculated from the mean and the variance of the rmsd2. We also consider the case of two chains with a common center of mass that are then rotated to minimize the rmsd (Case 2). In this case, the distribution of reduced rmsd can again be accurately approximated by an EVD, which is independent of the chain length and expected bond length. This distribution is used to calculate the p-value for a given reduced rmsd. Performing an analogous comparison for proteins, we find that <rmsd> approximately M(nu) and nu = 0.28 and 0.32 for Case 1 and Case 2, respectively, where M is the chain length. This result for Case 2 exactly matches with previous scaling results and suggests that rmsd/M(nu)is an appropriate metric for protein structure alignment and will be independent of chain length. We also find that the new score roughly follows the EVD.
Collapse
Affiliation(s)
- Yuting Jia
- Keck Graduate Institute of Applied Life Sciences, Claremont, CA 91711, USA
| | | |
Collapse
|
82
|
Ekman D, Björklund AK, Frey-Skött J, Elofsson A. Multi-domain Proteins in the Three Kingdoms of Life: Orphan Domains and Other Unassigned Regions. J Mol Biol 2005; 348:231-43. [PMID: 15808866 DOI: 10.1016/j.jmb.2005.02.007] [Citation(s) in RCA: 165] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2004] [Revised: 01/31/2005] [Accepted: 02/02/2005] [Indexed: 11/17/2022]
Abstract
Comparative studies of the proteomes from different organisms have provided valuable information about protein domain distribution in the kingdoms of life. Earlier studies have been limited by the fact that only about 50% of the proteomes could be matched to a domain. Here, we have extended these studies by including less well-defined domain definitions, Pfam-B and clustered domains, MAS, in addition to Pfam-A and SCOP domains. It was found that a significant fraction of these domain families are homologous to Pfam-A or SCOP domains. Further, we show that all regions that do not match a Pfam-A or SCOP domain contain a significantly higher fraction of disordered structure. These unstructured regions may be contained within orphan domains or function as linkers between structured domains. Using several different definitions we have re-estimated the number of multi-domain proteins in different organisms and found that several methods all predict that eukaryotes have approximately 65% multi-domain proteins, while the prokaryotes consist of approximately 40% multi-domain proteins. However, these numbers are strongly dependent on the exact choice of cut-off for domains in unassigned regions. In conclusion, all eukaryotes have similar fractions of multi-domain proteins and disorder, whereas a high fraction of repeating domain is distinguished only in multicellular eukaryotes. This implies a role for repeats in cell-cell contacts while the other two features are important for intracellular functions.
Collapse
Affiliation(s)
- Diana Ekman
- Stockholm Bioinformatics Center, Stockholm University, SE-106 91 Stockholm, Sweden
| | | | | | | |
Collapse
|
83
|
Jia Y, Dewey TG, Shindyalov IN, Bourne PE. A new scoring function and associated statistical significance for structure alignment by CE. J Comput Biol 2005; 11:787-99. [PMID: 15700402 DOI: 10.1089/cmb.2004.11.787] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
A new scoring function for assessing the statistical significance of protein structure alignment has been developed. The new scores were tested empirically using the combinatorial extension (CE) algorithm. The significance of a given score was given a p-value by curve-fitting the distribution of the scores generated by a random comparison of proteins taken from the PDB_SELECT database and the structural classification of proteins (SCOP) database. Although the scoring function was developed based on the CE algorithm, it is portable to any other protein structure alignment algorithm. The new scoring function is examined by sensitivity, specificity, and ROC curves.
Collapse
Affiliation(s)
- Yuting Jia
- Keck Graduate Institute of Applied Life Sciences, 535 Watson Drive, Claremont, CA 91711, USA
| | | | | | | |
Collapse
|
84
|
Abstract
We have recently developed a flexible protein structure alignment program (FATCAT) that identifies structural similarity, at the same time accounting for flexibility of protein structures. One of the most important applications of a structure alignment method is to aid in functional annotations by identifying similar structures in large structural databases. However, none of the flexible structure alignment methods were applied in this task because of a lack of significance estimation of flexible alignments. In this paper, we developed an estimate of the statistical significance of FATCAT alignment score, allowing us to use it as a database-searching tool. The results reported here show that (1) the distribution of the similarity score of FATCAT alignment between two unrelated protein structures follows the extreme value distribution (EVD), adding one more example to the current collection of EVDs of sequence and structure similarities; (2) introducing flexibility into structure comparison only slightly influences the sensitivity and specificity of identifying similar structures; and (3) the overall performance of FATCAT as a database searching tool is comparable to that of the widely used rigid-body structure comparison programs DALI and CE. Two examples illustrating the advantages of using flexible structure alignments in database searching are also presented. The conformational flexibilities that were detected in the first example may be involved with substrate specificity, and the conformational flexibilities detected in the second example may reflect the evolution of structures by block building.
Collapse
Affiliation(s)
- Yuzhen Ye
- Program in Bioinformatics and Systems Biology, The Burnham Institute, 10901 N. Torrey Pines Road, La Jolla, CA 92037, USA.
| | | |
Collapse
|
85
|
Kolodny R, Koehl P, Levitt M. Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J Mol Biol 2005; 346:1173-88. [PMID: 15701525 PMCID: PMC2692023 DOI: 10.1016/j.jmb.2004.12.032] [Citation(s) in RCA: 226] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2004] [Revised: 12/13/2004] [Accepted: 12/15/2004] [Indexed: 11/22/2022]
Abstract
We report the largest and most comprehensive comparison of protein structural alignment methods. Specifically, we evaluate six publicly available structure alignment programs: SSAP, STRUCTAL, DALI, LSQMAN, CE and SSM by aligning all 8,581,970 protein structure pairs in a test set of 2930 protein domains specially selected from CATH v.2.4 to ensure sequence diversity. We consider an alignment good if it matches many residues, and the two substructures are geometrically similar. Even with this definition, evaluating structural alignment methods is not straightforward. At first, we compared the rates of true and false positives using receiver operating characteristic (ROC) curves with the CATH classification taken as a gold standard. This proved unsatisfactory in that the quality of the alignments is not taken into account: sometimes a method that finds less good alignments scores better than a method that finds better alignments. We correct this intrinsic limitation by using four different geometric match measures (SI, MI, SAS, and GSAS) to evaluate the quality of each structural alignment. With this improved analysis we show that there is a wide variation in the performance of different methods; the main reason for this is that it can be difficult to find a good structural alignment between two proteins even when such an alignment exists. We find that STRUCTAL and SSM perform best, followed by LSQMAN and CE. Our focus on the intrinsic quality of each alignment allows us to propose a new method, called "Best-of-All" that combines the best results of all methods. Many commonly used methods miss 10-50% of the good Best-of-All alignments. By putting existing structural alignments into proper perspective, our study allows better comparison of protein structures. By highlighting limitations of existing methods, it will spur the further development of better structural alignment methods. This will have significant biological implications now that structural comparison has come to play a central role in the analysis of experimental work on protein structure, protein function and protein evolution.
Collapse
Affiliation(s)
- Rachel Kolodny
- Department of Structural Biology, Fairchild Building, Stanford University, Stanford CA 94305, USA.
| | | | | |
Collapse
|
86
|
Stevens FJ. Efficient recognition of protein fold at low sequence identity by conservative application of Psi-BLAST: validation. J Mol Recognit 2005; 18:139-49. [PMID: 15558595 DOI: 10.1002/jmr.721] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
A substantial fraction of protein sequences derived from genomic analyses is currently classified as representing 'hypothetical proteins of unknown function'. In part, this reflects the limitations of methods for comparison of sequences with very low identity. We evaluated the effectiveness of a Psi-BLAST search strategy to identify proteins of similar fold at low sequence identity. Psi-BLAST searches for structurally characterized low-sequence-identity matches were carried out on a set of over 300 proteins of known structure. Searches were conducted in NCBI's non-redundant database and were limited to three rounds. Some 614 potential homologs with 25% or lower sequence identity to 166 members of the search set were obtained. Disregarding the expect value, level of sequence identity and span of alignment, correspondence of fold between the target and potential homolog was found in more than 95% of the Psi-BLAST matches. Restrictions on expect value or span of alignment improved the false positive rate at the expense of eliminating many true homologs. Approximately three-quarters of the putative homologs obtained by three rounds of Psi-BLAST revealed no significant sequence similarity to the target protein upon direct sequence comparison by BLAST, and therefore could not be found by a conventional search. Although three rounds of Psi-BLAST identified many more homologs than a standard BLAST search, most homologs were undetected. It appears that more than 80% of all homologs to a target protein may be characterized by a lack of significant sequence similarity. We suggest that conservative use of Psi-BLAST has the potential to propose experimentally testable functions for the majority of proteins currently annotated as 'hypothetical proteins of unknown function'.
Collapse
Affiliation(s)
- F J Stevens
- Biosciences Division, Argonne National Laboratory, Argonne, IL 60439, USA.
| |
Collapse
|
87
|
Abstract
A procedure is presented for the automatic determination of the amino acid sequence of peptides by processing data obtained from mass spectrometry analysis. This is a basic and relevant problem in the field of proteomics. Furthermore, it has an even higher conceptual and applicative interest in peptide research, as well as in other connected fields. The analysis does not rely on known protein databases, but on the computation of all amino acid sequences compatible with the given spectral data. By formulating a mathematical model for such combinatorial problems, the structural limitations of known methods are overcome, and efficient solution algorithms can be developed. The results are very encouraging both from the accuracy and computational points of view.
Collapse
Affiliation(s)
- Renato Bruni
- PolyDART: Data Analysis Research Team for Polymers, 03015 Fiuggi (FR), Italy.
| | | | | |
Collapse
|
88
|
Ye J, Janardan R. Approximate Multiple Protein Structure Alignment Using the Sum-of-Pairs Distance. J Comput Biol 2004; 11:986-1000. [PMID: 15700413 DOI: 10.1089/cmb.2004.11.986] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
An algorithm is presented to compute a multiple structure alignment for a set of proteins and to generate a consensus (pseudo) protein for the set. The algorithm is a heuristic in that it computes an approximation to the optimal multiple structure alignment that minimizes the sum of the pairwise distances between the protein structures. The algorithm chooses an input protein as the initial consensus and computes a correspondence between the protein structures (which are represented as sets of unit vectors) using an approach analogous to the center-star method for multiple sequence alignment. From this correspondence, a set of rotation matrices (optimal for the given correspondence) is derived to align the structures and derive the new consensus. The process is iterated until the sum of pairwise distances converges. The computation of the optimal rotations is itself an iterative process that both makes use of the current consensus and generates simultaneously a new one. This approach is based on an interesting result that allows the sum of all pairwise distances to be represented compactly as distances to the consensus. Experimental results on several protein families are presented, showing that the algorithm converges quite rapidly.
Collapse
Affiliation(s)
- Jieping Ye
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN 55455, USA.
| | | |
Collapse
|
89
|
Julenius K, Mølgaard A, Gupta R, Brunak S. Prediction, conservation analysis, and structural characterization of mammalian mucin-type O-glycosylation sites. Glycobiology 2004; 15:153-64. [PMID: 15385431 DOI: 10.1093/glycob/cwh151] [Citation(s) in RCA: 688] [Impact Index Per Article: 34.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
O-GalNAc-glycosylation is one of the main types of glycosylation in mammalian cells. No consensus recognition sequence for the O-glycosyltransferases is known, making prediction methods necessary to bridge the gap between the large number of known protein sequences and the small number of proteins experimentally investigated with regard to glycosylation status. From O-GLYCBASE a total of 86 mammalian proteins experimentally investigated for in vivo O-GalNAc sites were extracted. Mammalian protein homolog comparisons showed that a glycosylated serine or threonine is less likely to be precisely conserved than a nonglycosylated one. The Protein Data Bank was analyzed for structural information, and 12 glycosylated structures were obtained. All positive sites were found in coil or turn regions. A method for predicting the location for mucin-type glycosylation sites was trained using a neural network approach. The best overall network used as input amino acid composition, averaged surface accessibility predictions together with substitution matrix profile encoding of the sequence. To improve prediction on isolated (single) sites, networks were trained on isolated sites only. The final method combines predictions from the best overall network and the best isolated site network; this prediction method correctly predicted 76% of the glycosylated residues and 93% of the nonglycosylated residues. NetOGlyc 3.1 can predict sites for completely new proteins without losing its performance. The fact that the sites could be predicted from averaged properties together with the fact that glycosylation sites are not precisely conserved indicates that mucin-type glycosylation in most cases is a bulk property and not a very site-specific one. NetOGlyc 3.1 is made available at www.cbs.dtu.dk/services/netoglyc.
Collapse
Affiliation(s)
- Karin Julenius
- Center for Biological Sequence Analysis, BioCentrum, Building 208, Technical University of Denmark, DK-2800 Lyngby, Denmark.
| | | | | | | |
Collapse
|
90
|
Abstract
We developed a variant of the intermediate sequence search method (ISS(new)) for detection and alignment of weakly similar pairs of protein sequences. ISS(new) relates two query sequences by an intermediate sequence that is potentially homologous to both queries. The improvement was achieved by a more robust overlap score for a match between the queries through an intermediate. The approach was benchmarked on a data set of 2369 sequences of known structure with insignificant sequence similarity to each other (BLAST E-value larger than 0.001); 2050 of these sequences had a related structure in the set. ISS(new) performed significantly better than both PSI-BLAST and a previously described intermediate sequence search method. PSI-BLAST could not detect correct homologs for 1619 of the 2369 sequences. In contrast, ISS(new) assigned a correct homolog as the top hit for 121 of these 1619 sequences, while incorrectly assigning homologs for only nine targets; it did not assign homologs for the remainder of the sequences. By estimate, ISS(new) may be able to assign the folds of domains in approximately 29,000 of the approximately 500,000 sequences unassigned by PSI-BLAST, with 90% specificity (1 - false positives fraction). In addition, we show that the 15 alignments with the most significant BLAST E-values include the nearly best alignments constructed by ISS(new).
Collapse
Affiliation(s)
- Bino John
- Laboratory of Molecular Biophysics, Pels Family Center for Biochemistry and Structural Biology, The Rockefeller University, New York, New York 10021, USA
| | | |
Collapse
|
91
|
Das R, Gerstein M. A method using active-site sequence conservation to find functional shifts in protein families: application to the enzymes of central metabolism, leading to the identification of an anomalous isocitrate dehydrogenase in pathogens. Proteins 2004; 55:455-63. [PMID: 15048835 DOI: 10.1002/prot.10639] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
We have introduced a method to identify functional shifts in protein families. Our method is based on the calculation of an active-site conservation ratio, which we call the "ASC ratio." For a structurally based alignment of a protein family, this ratio is the average sequence similarity of the active-site region compared to the full-length protein. The active-site region is defined as all the residues within a certain radius of the known functionally important groups. Using our method, we have analyzed enzymes of central metabolism from a large number of genomes (35). We found that for most of the enzymes, the active-site region is more highly conserved than the full-length sequence. However, for three tricarboxylic acid (TCA)-cycle enzymes, active-site sequences are considerably more diverged (than full-length ones). In particular, we were able to identify in six pathogens a novel isocitrate dehydrogenase that has very low sequence similarity around the active site. Detailed sequence-structure analysis indicates that while the active-site structure of isocitrate dehydrogenase is most likely similar between pathogens and nonpathogens, the unusual sequence divergence could result from an extra domain added at the N-terminus. This domain has a leucine-rich motif similar one in the Yersinia pestis cytotoxin and may therefore confer additional pathogenic functions.
Collapse
Affiliation(s)
- Rajdeep Das
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | | |
Collapse
|
92
|
Karchin R, Cline M, Karplus K. Evaluation of local structure alphabets based on residue burial. Proteins 2004; 55:508-18. [PMID: 15103615 DOI: 10.1002/prot.20008] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Residue burial, which describes a protein residue's exposure to solvent and neighboring atoms, is key to protein structure prediction, modeling, and analysis. We assessed 21 alphabets representing residue burial, according to their predictability from amino acid sequence, conservation in structural alignments, and utility in one fold-recognition scenario. This follows upon our previous work in assessing nine representations of backbone geometry.1 The alphabet found to be most effective overall has seven states and is based on a count of C(beta) atoms within a 14 A-radius sphere centered at the C(beta) of a residue of interest. When incorporated into a hidden Markov model (HMM), this alphabet gave us a 38% performance boost in fold recognition and 23% in alignment quality.
Collapse
Affiliation(s)
- Rachel Karchin
- Department of Biopharmaceutical Sciences, University of California, San Francisco 94143-2240, USA.
| | | | | |
Collapse
|
93
|
Day R, Beck DAC, Armen RS, Daggett V. A consensus view of fold space: combining SCOP, CATH, and the Dali Domain Dictionary. Protein Sci 2004; 12:2150-60. [PMID: 14500873 PMCID: PMC2366924 DOI: 10.1110/ps.0306803] [Citation(s) in RCA: 82] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
We have determined consensus protein-fold classifications on the basis of three classification methods, SCOP, CATH, and Dali. These classifications make use of different methods of defining and categorizing protein folds that lead to different views of protein-fold space. Pairwise comparisons of domains on the basis of their fold classifications show that much of the disagreement between the classification systems is due to differing domain definitions rather than assigning the same domain to different folds. However, there are significant differences in the fold assignments between the three systems. These remaining differences can be explained primarily in terms of the breadth of the fold classifications. Many structures may be defined as having one fold in one system, whereas far fewer are defined as having the analogous fold in another system. By comparing these folds for a nonredundant set of proteins, the consensus method breaks up broad fold classifications and combines restrictive fold classifications into metafolds, creating, in effect, an averaged view of fold space. This averaged view requires that the structural similarities between proteins having the same metafold be recognized by multiple classification systems. Thus, the consensus map is useful for researchers looking for fold similarities that are relatively independent of the method used to compare proteins. The 30 most populated metafolds, representing the folds of about half of a nonredundant subset of the PDB, are presented here. The full list of metafolds is presented on the Web.
Collapse
Affiliation(s)
- Ryan Day
- Biomolecular Structure and Design Program and Department of Medicinal Chemistry, University of Washington, Seattle, Washington 98195, USA
| | | | | | | |
Collapse
|
94
|
Ohlson T, Wallner B, Elofsson A. Profile-profile methods provide improved fold-recognition: A study of different profile-profile alignment methods. Proteins 2004; 57:188-97. [PMID: 15326603 DOI: 10.1002/prot.20184] [Citation(s) in RCA: 81] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
To improve the detection of related proteins, it is often useful to include evolutionary information for both the query and target proteins. One method to include this information is by the use of profile-profile alignments, where a profile from the query protein is compared with the profiles from the target proteins. Profile-profile alignments can be implemented in several fundamentally different ways. The similarity between two positions can be calculated using a dot-product, a probabilistic model, or an information theoretical measure. Here, we present a large-scale comparison of different profile-profile alignment methods. We show that the profile-profile methods perform at least 30% better than standard sequence-profile methods both in their ability to recognize superfamily-related proteins and in the quality of the obtained alignments. Although the performance of all methods is quite similar, profile-profile methods that use a probabilistic scoring function have an advantage as they can create good alignments and show a good fold recognition capacity using the same gap-penalties, while the other methods need to use different parameters to obtain comparable performances.
Collapse
Affiliation(s)
- Tomas Ohlson
- Stockholm Bioinformatics Center, Stockholm University, Stockholm, Sweden
| | | | | |
Collapse
|
95
|
Abstract
The structural comparison of two proteins comes up in many applications in structural biology where it is often necessary to find similarities in very large conformation sets. This work describes techniques to achieve significant speedup in the computation of structural similarity between two given conformations, at the expense of introducing a small error in the similarity measure. Furthermore, the proposed computational scheme allows for a tradeoff between speedup and error. This scheme exploits the fact that the Calpha representation of a protein conformation contains redundant information, due to the chain topology and limited compactness of proteins. This redundancy can be reduced by approximating subchains of a protein by their centers of mass, resulting in a smaller number of points to describe a conformation. A Haar wavelet analysis of random chains and proteins is used to justify this approximated representation. Similarity measures computed with this representation are highly correlated to the measures computed with the original Calpha representation. Therefore, they can be used in applications where small similarity errors can be tolerated or as fast filters in applications that require exact measures. Computational tests have been conducted on two applications, nearest neighbor search and automatic structural classification.
Collapse
Affiliation(s)
- Itay Lotan
- Department of Computer Science, 353 Serra Mall, Stanford University, Stanford, CA 94305, USA.
| | | |
Collapse
|
96
|
Ochagavía ME, Wodak S. Progressive combinatorial algorithm for multiple structural alignments: Application to distantly related proteins. Proteins 2004; 55:436-54. [PMID: 15048834 DOI: 10.1002/prot.10587] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
MALECON is a progressive combinatorial procedure for multiple alignments of protein structures. It searches a library of pairwise alignments for all three-protein alignments in which a specified number of residues is consistently aligned. These alignments are progressively expanded to include additional proteins and more spatially equivalent residues, subject to certain criteria. This action involves superimposing the aligned proteins by their hitherto equivalent residues and searching for additional Calpha atoms that lie close in space. The performance of MALECON is illustrated and compared with several extant multiple structure alignment methods by using as test the globin homologous superfamily, the OB and the Jellyrolls folds. MALECON gives better definitions of the common structural features in the structurally more diverse proteins of the OB and Jellyrolls folds, but it yields comparable results for the more similar globins. When no consistent multiple alignments can be derived for all members of a protein group, our procedure is still capable of automatically generating consistent alignments and common core definitions for subgroups of the members. This finding is illustrated for proteins of the OB fold and SH3 domains, believed to share common structural features, and should be very instrumental in homology modeling and investigations of protein evolution.
Collapse
|
97
|
Alexandrov V, Gerstein M. Using 3D Hidden Markov Models that explicitly represent spatial coordinates to model and compare protein structures. BMC Bioinformatics 2004; 5:2. [PMID: 14715091 PMCID: PMC344530 DOI: 10.1186/1471-2105-5-2] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2003] [Accepted: 01/09/2004] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND Hidden Markov Models (HMMs) have proven very useful in computational biology for such applications as sequence pattern matching, gene-finding, and structure prediction. Thus far, however, they have been confined to representing 1D sequence (or the aspects of structure that could be represented by character strings). RESULTS We develop an HMM formalism that explicitly uses 3D coordinates in its match states. The match states are modeled by 3D Gaussian distributions centered on the mean coordinate position of each alpha carbon in a large structural alignment. The transition probabilities depend on the spread of the neighboring match states and on the number of gaps found in the structural alignment. We also develop methods for aligning query structures against 3D HMMs and scoring the result probabilistically. For 1D HMMs these tasks are accomplished by the Viterbi and forward algorithms. However, these will not work in unmodified form for the 3D problem, due to non-local quality of structural alignment, so we develop extensions of these algorithms for the 3D case. Several applications of 3D HMMs for protein structure classification are reported. A good separation of scores for different fold families suggests that the described construct is quite useful for protein structure analysis. CONCLUSION We have created a rigorous 3D HMM representation for protein structures and implemented a complete set of routines for building 3D HMMs in C and Perl. The code is freely available from http://www.molmovdb.org/geometry/3dHMM, and at this site we also have a simple prototype server to demonstrate the features of the described approach.
Collapse
Affiliation(s)
- Vadim Alexandrov
- Department of Molecular Biophysics and Biochemistry, Yale University, 266 Whitney Ave., New Haven, CT 06511, USA
| | - Mark Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, 266 Whitney Ave., New Haven, CT 06511, USA
- Department of Computer Science, Yale University, 266 Whitney Ave., New Haven, CT 06511, USA
| |
Collapse
|
98
|
Blankenbecler R, Ohlsson M, Peterson C, Ringner M. Matching protein structures with fuzzy alignments. Proc Natl Acad Sci U S A 2003; 100:11936-40. [PMID: 14526099 PMCID: PMC218691 DOI: 10.1073/pnas.1635048100] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Unraveling functional and ancestral relationships between proteins as well as structure-prediction procedures require powerful protein-alignment methods. A structure-alignment method is presented where the problem is mapped onto a cost function containing both fuzzy (Potts) assignment variables and atomic coordinates. The cost function is minimized by using an iterative scheme, where at each step mean field theory methods at finite "temperatures" are used for determining fuzzy assignment variables followed by exact translation and rotation of atomic coordinates weighted by their corresponding fuzzy assignment variables. The approach performs very well when compared with other methods, requires modest central processing unit consumption, and is robust with respect to choice of iteration parameters for a wide range of proteins.
Collapse
|
99
|
Van Walle I, Lasters I, Wyns L. Consistency matrices: quantified structure alignments for sets of related proteins. Proteins 2003; 51:1-9. [PMID: 12596259 DOI: 10.1002/prot.10293] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Comparing two remotely similar structures is a difficult problem: more often than not, resulting structure alignments will show ambiguities and a unique answer usually does not even exist. In addition, alignments in general have a limited information content because every aligned residue is considered equally important. To solve these issues to a certain extent, one can take the perspective of a whole group of similar structures and then evaluate common structural features. Here, we describe a consistency approach that, although not actually performing a multiple structure alignment, does produce the information that one would conceivably want from such an experiment: the key structural features of the group, e.g., a fold, which in this case are projected onto either a pair of proteins or a single protein. Both representations are useful for a number of applications, ranging from the detection of (partially) wrong structure alignments to protein structure classification and fold recognition. To demonstrate some of these applications, the procedure was applied to 195 SCOP folds containing a total of 1802 domains sharing very low sequence similarity.
Collapse
Affiliation(s)
- Ivo Van Walle
- Department of Ultrastructure, Vrije Universiteit Brussel, Sint-Genesius Rode, Belgium.
| | | | | |
Collapse
|
100
|
Rogen P, Fain B. Automatic classification of protein structure by using Gauss integrals. Proc Natl Acad Sci U S A 2003; 100:119-24. [PMID: 12506205 PMCID: PMC140900 DOI: 10.1073/pnas.2636460100] [Citation(s) in RCA: 139] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2002] [Accepted: 10/24/2002] [Indexed: 11/18/2022] Open
Abstract
We introduce a method of looking at, analyzing, and comparing protein structures. The topology of a protein is captured by 30 numbers inspired by Vassiliev knot invariants. To illustrate the simplicity and power of this topological approach, we construct a measure (scaled Gauss metric, SGM) of similarity of protein shapes. Under this metric, protein chains naturally separate into fold clusters. We use SGM to construct an automatic classification procedure for the CATH2.4 database. The method is very fast because it requires neither alignment of the chains nor any chain-chain comparison. It also has only one adjustable parameter. We assign 95.51% of the chains into the proper C (class), A (architecture), T (topology), and H (homologous superfamily) fold, find all new folds, and detect no false geometric positives. Using the SGM, we display a "map" of the space of folds projected onto two dimensions, show the relative locations of the major structural classes, and "zoom into" the space of proteins to show architecture, topology, and fold clusters. The existence of a simple measure of a protein fold computed from the chain path will have a major impact on automatic fold classification.
Collapse
Affiliation(s)
- Peter Rogen
- Department of Mathematics, Technical University of Denmark, Building 303, DK-2800 Kongens Lyngby, Denmark
| | | |
Collapse
|