1
|
Xu S, Zou S, Wang L. A geometric clustering algorithm with applications to structural data. J Comput Biol 2014; 22:436-50. [PMID: 25517067 DOI: 10.1089/cmb.2014.0162] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
An important feature of structural data, especially those from structural determination and protein-ligand docking programs, is that their distribution could be mostly uniform. Traditional clustering algorithms developed specifically for nonuniformly distributed data may not be adequate for their classification. Here we present a geometric partitional algorithm that could be applied to both uniformly and nonuniformly distributed data. The algorithm is a top-down approach that recursively selects the outliers as the seeds to form new clusters until all the structures within a cluster satisfy a classification criterion. The algorithm has been evaluated on a diverse set of real structural data and six sets of test data. The results show that it is superior to the previous algorithms for the clustering of structural data and is similar to or better than them for the classification of the test data. The algorithm should be especially useful for the identification of the best but minor clusters and for speeding up an iterative process widely used in NMR structure determination.
Collapse
Affiliation(s)
- Shutan Xu
- College of Computer Science and Technology, Jilin University , Changchun, P.R. China
| | | | | |
Collapse
|
2
|
Pelta DA, González JR, Moreno Vega M. A simple and fast heuristic for protein structure comparison. BMC Bioinformatics 2008; 9:161. [PMID: 18366735 PMCID: PMC2335283 DOI: 10.1186/1471-2105-9-161] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2007] [Accepted: 03/25/2008] [Indexed: 11/20/2022] Open
Abstract
Background Protein structure comparison is a key problem in bioinformatics. There exist several methods for doing protein comparison, being the solution of the Maximum Contact Map Overlap problem (MAX-CMO) one of the alternatives available. Although this problem may be solved using exact algorithms, researchers require approximate algorithms that obtain good quality solutions using less computational resources than the formers. Results We propose a variable neighborhood search metaheuristic for solving MAX-CMO. We analyze this strategy in two aspects: 1) from an optimization point of view the strategy is tested on two different datasets, obtaining an error of 3.5%(over 2702 pairs) and 1.7% (over 161 pairs) with respect to optimal values; thus leading to high accurate solutions in a simpler and less expensive way than exact algorithms; 2) in terms of protein structure classification, we conduct experiments on three datasets and show that is feasible to detect structural similarities at SCOP's family and CATH's architecture levels using normalized overlap values. Some limitations and the role of normalization are outlined for doing classification at SCOP's fold level. Conclusion We designed, implemented and tested.a new tool for solving MAX-CMO, based on a well-known metaheuristic technique. The good balance between solution's quality and computational effort makes it a valuable tool. Moreover, to the best of our knowledge, this is the first time the MAX-CMO measure is tested at SCOP's fold and CATH's architecture levels with encouraging results. Software is available for download at .
Collapse
Affiliation(s)
- David A Pelta
- Models of Decision and Optimization Research Group, Dept. of Computer Science and Artificial Intelligence, University of Granada, Spain.
| | | | | |
Collapse
|
3
|
Theobald DL, Wuttke DS. Divergent evolution within protein superfolds inferred from profile-based phylogenetics. J Mol Biol 2005; 354:722-37. [PMID: 16266719 PMCID: PMC1769326 DOI: 10.1016/j.jmb.2005.08.071] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2005] [Revised: 08/29/2005] [Accepted: 08/30/2005] [Indexed: 11/19/2022]
Abstract
Many dissimilar protein sequences fold into similar structures. A central and persistent challenge facing protein structural analysis is the discrimination between homology and convergence for structurally similar domains that lack significant sequence similarity. Classic examples are the OB-fold and SH3 domains, both small, modular beta-barrel protein superfolds. The similarities among these domains have variously been attributed to common descent or to convergent evolution. Using a sequence profile-based phylogenetic technique, we analyzed all structurally characterized OB-fold, SH3, and PDZ domains with less than 40% mutual sequence identity. An all-against-all, profile-versus-profile analysis of these domains revealed many previously undetectable significant interrelationships. The matrices of scores were used to infer phylogenies based on our derivation of the relationships between sequence similarity E-values and evolutionary distances. The resulting clades of domains correlate remarkably well with biological function, as opposed to structural similarity, indicating that the functionally distinct sub-families within these superfolds are homologous. This method extends phylogenetics into the challenging "twilight zone" of sequence similarity, providing the first objective resolution of deep evolutionary relationships among distant protein families.
Collapse
Affiliation(s)
- Douglas L. Theobald
- Department of Chemistry and Biochemistry, UCB 215 University of Colorado Boulder, CO 80309-0215, USA
| | - Deborah S. Wuttke
- Department of Chemistry and Biochemistry, UCB 215 University of Colorado Boulder, CO 80309-0215, USA
| |
Collapse
|
4
|
Bostick DL, Shen M, Vaisman II. A simple topological representation of protein structure: implications for new, fast, and robust structural classification. Proteins 2004; 56:487-501. [PMID: 15229882 DOI: 10.1002/prot.20146] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
A topological representation of proteins is developed that makes use of two metrics: the Euclidean metric for identifying natural nearest neighboring residues via the Delaunay tessellation in Cartesian space and the distance between residues in sequence space. Using this representation, we introduce a quantitative and computationally inexpensive method for the comparison of protein structural topology. The method ultimately results in a numerical score quantifying the distance between proteins in a heuristically defined topological space. The properties of this scoring scheme are investigated and correlated with the standard Calpha distance root-mean-square deviation measure of protein similarity calculated by rigid body structural alignment. The topological comparison method is shown to have a characteristic dependence on protein conformational differences and secondary structure. This distinctive behavior is also observed in the comparison of proteins within families of structural relatives. The ability of the comparison method to successfully classify proteins into classes, superfamilies, folds, and families that are consistent with standard classification methods, both automated and human-driven, is demonstrated. Furthermore, it is shown that the scoring method allows for a fine-grained classification on the family, protein, and species level that agrees very well with currently established phylogenetic hierarchies. This fine classification is achieved without requiring visual inspection of proteins, sequence analysis, or the use of structural superimposition methods. Implications of the method for a fast, automated, topological hierarchical classification of proteins are discussed.
Collapse
Affiliation(s)
- David L Bostick
- Department of Physics and Program in Molecular/Cell Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | | | | |
Collapse
|
5
|
Day R, Beck DAC, Armen RS, Daggett V. A consensus view of fold space: combining SCOP, CATH, and the Dali Domain Dictionary. Protein Sci 2004; 12:2150-60. [PMID: 14500873 PMCID: PMC2366924 DOI: 10.1110/ps.0306803] [Citation(s) in RCA: 82] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
We have determined consensus protein-fold classifications on the basis of three classification methods, SCOP, CATH, and Dali. These classifications make use of different methods of defining and categorizing protein folds that lead to different views of protein-fold space. Pairwise comparisons of domains on the basis of their fold classifications show that much of the disagreement between the classification systems is due to differing domain definitions rather than assigning the same domain to different folds. However, there are significant differences in the fold assignments between the three systems. These remaining differences can be explained primarily in terms of the breadth of the fold classifications. Many structures may be defined as having one fold in one system, whereas far fewer are defined as having the analogous fold in another system. By comparing these folds for a nonredundant set of proteins, the consensus method breaks up broad fold classifications and combines restrictive fold classifications into metafolds, creating, in effect, an averaged view of fold space. This averaged view requires that the structural similarities between proteins having the same metafold be recognized by multiple classification systems. Thus, the consensus map is useful for researchers looking for fold similarities that are relatively independent of the method used to compare proteins. The 30 most populated metafolds, representing the folds of about half of a nonredundant subset of the PDB, are presented here. The full list of metafolds is presented on the Web.
Collapse
Affiliation(s)
- Ryan Day
- Biomolecular Structure and Design Program and Department of Medicinal Chemistry, University of Washington, Seattle, Washington 98195, USA
| | | | | | | |
Collapse
|
6
|
Wintjens R, Noël C, May ACW, Gerbod D, Dufernez F, Capron M, Viscogliosi E, Rooman M. Specificity and Phenetic Relationships of Iron- and Manganese-containing Superoxide Dismutases on the Basis of Structure and Sequence Comparisons. J Biol Chem 2004; 279:9248-54. [PMID: 14672935 DOI: 10.1074/jbc.m312329200] [Citation(s) in RCA: 66] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
The iron- and manganese-containing superoxide dismutases (Fe/Mn-SOD) share the same chemical function and spatial structure but can be distinguished according to their modes of oligomerization and their metal ion specificity. They appear as homodimers or homotetramers and usually require a specific metal for activity. On the basis of 261 aligned SOD sequences and 12 superimposed x-ray structures, two phenetic trees were constructed, one sequence-based and the other structure-based. Their comparison reveals the imperfect correlation of sequence and structural changes; hyperthermophilicity requires the largest sequence alterations, whereas dimer/tetramer and manganese/iron specificities are induced by the most sizable structural differences within the monomers. A systematic investigation of sequence and structure characteristics conserved in all aligned SOD sequences or in subsets sharing common oligomeric and/or metal specificities was performed. Several residues were identified as guaranteeing the common function and dimeric conformation, others as determining the tetramer formation, and yet others as potentially responsible for metal specificity. Some form cation-pi interactions between an aromatic ring and a fully or partially positively charged group, suggesting that these interactions play a significant role in the structure and function of SOD enzymes. Dimer/tetramer- and iron/manganese-specific fingerprints were derived from the set of conserved residues; they can be used to propose selected residue substitutions in view of the experimental validation of our in silico derived hypotheses.
Collapse
Affiliation(s)
- René Wintjens
- Université Libre de Bruxelles, Institut de Pharmacie, Chimie Générale, CP 206/04, Campus de la Plaine, Boulevard du Triomphe, B-1050 Bruxelles, Belgium
| | | | | | | | | | | | | | | |
Collapse
|
7
|
|
8
|
May ACW. Definition of the tempo of sequence diversity across an alignment and automatic identification of sequence motifs: Application to protein homologous families and superfamilies. Protein Sci 2002; 11:2825-35. [PMID: 12441381 PMCID: PMC2373737 DOI: 10.1110/ps.0211202] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
It is often possible to identify sequence motifs that characterize a protein family in terms of its fold and/or function from aligned protein sequences. Such motifs can be used to search for new family members. Partitioning of sequence alignments into regions of similar amino acid variability is usually done by hand. Here, I present a completely automatic method for this purpose: one that is guaranteed to produce globally optimal solutions at all levels of partition granularity. The method is used to compare the tempo of sequence diversity across reliable three-dimensional (3D) structure-based alignments of 209 protein families (HOMSTRAD) and that for 69 superfamilies (CAMPASS). (The mean alignment length for HOMSTRAD and CAMPASS are very similar.) Surprisingly, the optimal segmentation distributions for the closely related proteins and distantly related ones are found to be very similar. Also, optimal segmentation identifies an unusual protein superfamily. Finally, protein 3D structure clues from the tempo of sequence diversity across alignments are examined. The method is general, and could be applied to any area of comparative biological sequence and 3D structure analysis where the constraint of the inherent linear organization of the data imposes an ordering on the set of objects to be clustered.
Collapse
Affiliation(s)
- Alex C W May
- Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, London NW7 1AA, UK.
| |
Collapse
|
9
|
Orengo CA, Sillitoe I, Reeves G, Pearl FM. Review: what can structural classifications reveal about protein evolution? J Struct Biol 2001; 134:145-65. [PMID: 11551176 DOI: 10.1006/jsbi.2001.4398] [Citation(s) in RCA: 42] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
In this article we present a review of the methods used for comparing and classifying protein structures. We discuss the hierarchies and populations of fold groups and evolutionary families in some of the major classifications and we consider some of the problems confronting any general analyses of structural evolution in protein families. We also review some more recent analyses that have expanded these classifications by identifying sequence relatives in the genomes and thereby reveal interesting trends in fold usage and recurrence.
Collapse
Affiliation(s)
- C A Orengo
- Department of Biochemistry and Molecular Biology, University College, Gower Street, London, WC1E 6BT, United Kingdom
| | | | | | | |
Collapse
|
10
|
Abstract
Typically, protein spatial structures are more conserved in evolution than amino acid sequences. However, the recent explosion of sequence and structure information accompanied by the development of powerful computational methods led to the accumulation of examples of homologous proteins with globally distinct structures. Significant sequence conservation, local structural resemblance, and functional similarity strongly indicate evolutionary relationships between these proteins despite pronounced structural differences at the fold level. Several mechanisms such as insertions/deletions/substitutions, circular permutations, and rearrangements in beta-sheet topologies account for the majority of detected structural irregularities. The existence of evolutionarily related proteins that possess different folds brings new challenges to the homology modeling techniques and the structure classification strategies and offers new opportunities for protein design in experimental studies.
Collapse
Affiliation(s)
- N V Grishin
- Howard Hughes Medical Institute, Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, Texas 75390-9050, USA
| |
Collapse
|
11
|
May AC. Optimal classification of protein sequences and selection of representative sets from multiple alignments: application to homologous families and lessons for structural genomics. PROTEIN ENGINEERING 2001; 14:209-17. [PMID: 11391012 DOI: 10.1093/protein/14.4.209] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Hierarchical classification is probably the most popular approach to group related proteins. However, there are a number of problems associated with its use for this purpose. One is that the resulting tree showing a nested sequence of groups may not be the most suitable representation of the data. Another is that visual inspection is the most common method to decide the most appropriate number of subsets from a tree. In fact, classification of proteins in general is bedevilled with the need for subjective thresholds to define group membership (e.g., 'significant' sequence identity for homologous families). Such arbitrariness is not only intellectually unsatisfying but also has important practical consequences. For instance, it hinders meaningful identification of protein targets for structural genomics. I describe an alternative approach to cluster related proteins without the need for an a priori threshold: one, through its use of dynamic programming, which is guaranteed to produce globally optimal solutions at all levels of partition granularity. Grouping proteins according to weights assigned to their aligned sequences makes it possible to delineate dynamically a 'core-periphery' structure within families. The 'core' of a protein family comprises the most typical sequences while the 'periphery' consists of the atypical ones. Further, a new sequence weighting scheme that combines the information in all the multiply aligned positions of an alignment in a novel way is put forward. Instead of averaging over all positions, this procedure takes into account directly the distribution of sequence variability along an alignment. The relationships between sequence weights and sequence identity are investigated for 168 families taken from HOMSTRAD, a database of protein structure alignments for homologous families. An exact solution is presented for the problem of how to select the most representative pair of sequences for a protein family. Extension of this approach by a greedy algorithm allows automatic identification of a minimal set of aligned sequences. The results of this analysis are available on the Web at http://mathbio.nimr.mrc.ac.uk/~amay.
Collapse
Affiliation(s)
- A C May
- Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 lAA, UK.
| |
Collapse
|
12
|
Copley RR, Bork P. Homology among (betaalpha)(8) barrels: implications for the evolution of metabolic pathways. J Mol Biol 2000; 303:627-41. [PMID: 11054297 DOI: 10.1006/jmbi.2000.4152] [Citation(s) in RCA: 163] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
We provide statistically reliable sequence evidence indicating that at least 12 of 23 SCOP (betaalpha)(8) (TIM) barrel superfamilies share a common origin. This includes all but one of the known and predicted TIM barrels found in central metabolism. The statistical evidence is complemented by an examination of the details of protein structure, with certain structural locations favouring catalytic residues even though the nature of their molecular function may change. The combined analysis of sequence, structure and function also enables us to propose a phylogeny of TIM barrels. Based on these data, we are able to examine differing theories of pathway and enzyme evolution, by mapping known TIM barrel folds to the pathways of central metabolism. The results favour widespread recruitment of enzymes between pathways, rather than a "backwards evolution" model, and support the idea that modern proteins may have arisen from common ancestors that bound key metabolites.
Collapse
Affiliation(s)
- R R Copley
- Biocomputing, European Molecular Biology Laboratory, Meyerhofstrasse 1, Heidelberg, 69117, Germany.
| | | |
Collapse
|
13
|
Gutiérrez G, Ganfornina MD, Sánchez D. Evolution of the lipocalin family as inferred from a protein sequence phylogeny. BIOCHIMICA ET BIOPHYSICA ACTA 2000; 1482:35-45. [PMID: 11058745 DOI: 10.1016/s0167-4838(00)00151-5] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
The lipocalins constitute a family of proteins that have been found in eubacteria and a variety of eukaryotic cells, where they play diverse physiological roles. It is the primary goal of this review to examine the patterns of change followed by lipocalins through their complex history, in order to stimulate scientists in the field to experimentally contrast our phylogeny-derived hypotheses. We reexamine our previous work on lipocalin phylogeny and update the phylogenetic analysis of the family. Lipocalins separate into 14 monophyletic clades, some of which are grouped in well supported superclades. The lipocalin tree was rooted with the bacterial lipocalin genes under the assumption that they have evolved from a single common ancestor with the metazoan lipocalins, and not by horizontal transfer. The topology of the rooted tree and the species distribution of lipocalins suggest that the newly arising lipocalins show a higher rate of amino acid sequence divergence, a higher rate of gene duplication, and their internal pocket has evolved towards binding smaller hydrophobic ligands with more efficiency.
Collapse
Affiliation(s)
- G Gutiérrez
- Departmento de Genética, Universidad de Sevilla, Spain
| | | | | |
Collapse
|