1
|
Chandonia JM, Fox NK, Brenner SE. SCOPe: Manual Curation and Artifact Removal in the Structural Classification of Proteins - extended Database. J Mol Biol 2016; 429:348-355. [PMID: 27914894 DOI: 10.1016/j.jmb.2016.11.023] [Citation(s) in RCA: 68] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2016] [Revised: 11/23/2016] [Accepted: 11/24/2016] [Indexed: 12/23/2022]
Abstract
SCOPe (Structural Classification of Proteins-extended, http://scop.berkeley.edu) is a database of relationships between protein structures that extends the Structural Classification of Proteins (SCOP) database. SCOP is an expert-curated ordering of domains from the majority of proteins of known structure in a hierarchy according to structural and evolutionary relationships. SCOPe classifies the majority of protein structures released since SCOP development concluded in 2009, using a combination of manual curation and highly precise automated tools, aiming to have the same accuracy as fully hand-curated SCOP releases. SCOPe also incorporates and updates the ASTRAL compendium, which provides several databases and tools to aid in the analysis of the sequences and structures of proteins classified in SCOPe. SCOPe continues high-quality manual classification of new superfamilies, a key feature of SCOP. Artifacts such as expression tags are now separated into their own class, in order to distinguish them from the homology-based annotations in the remainder of the SCOPe hierarchy. SCOPe 2.06 contains 77,439 Protein Data Bank entries, double the 38,221 structures classified in SCOP.
Collapse
Affiliation(s)
- John-Marc Chandonia
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA; Molecular Biophysics and Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
| | - Naomi K Fox
- Molecular Biophysics and Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Steven E Brenner
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA; Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
| |
Collapse
|
2
|
Xu J, Zhang J. Impact of structure space continuity on protein fold classification. Sci Rep 2016; 6:23263. [PMID: 27006112 PMCID: PMC4804218 DOI: 10.1038/srep23263] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2015] [Accepted: 03/03/2016] [Indexed: 11/09/2022] Open
Abstract
Protein structure classification hierarchically clusters domain structures based on structure and/or sequence similarities and plays important roles in the study of protein structure-function relationship and protein evolution. Among many classifications, SCOP and CATH are widely viewed as the gold standards. Fold classification is of special interest because this is the lowest level of classification that does not depend on protein sequence similarity. The current fold classifications such as those in SCOP and CATH are controversial because they implicitly assume that folds are discrete islands in the structure space, whereas increasing evidence suggests significant similarities among folds and supports a continuous fold space. Although this problem is widely recognized, its impact on fold classification has not been quantitatively evaluated. Here we develop a likelihood method to classify a domain into the existing folds of CATH or SCOP using both query-fold structure similarities and within-fold structure heterogeneities. The new classification differs from the original classification for 3.4-12% of domains, depending on factors such as the structure similarity score and original classification scheme used. Because these factors differ for different biological purposes, our results indicate that the importance of considering structure space continuity in fold classification depends on the specific question asked.
Collapse
Affiliation(s)
- Jinrui Xu
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Jianzhi Zhang
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
3
|
Mohan A, Rao MD, Sunderrajan S, Pennathur G. Automatic classification of protein structures using physicochemical parameters. Interdiscip Sci 2014; 6:176-86. [PMID: 25205495 DOI: 10.1007/s12539-013-0199-0] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2013] [Revised: 11/12/2013] [Accepted: 12/05/2013] [Indexed: 11/26/2022]
Abstract
Protein classification is the first step to functional annotation; SCOP and Pfam databases are currently the most relevant protein classification schemes. However, the disproportion in the number of three dimensional (3D) protein structures generated versus their classification into relevant superfamilies/families emphasizes the need for automated classification schemes. Predicting function of novel proteins based on sequence information alone has proven to be a major challenge. The present study focuses on the use of physicochemical parameters in conjunction with machine learning algorithms (Naive Bayes, Decision Trees, Random Forest and Support Vector Machines) to classify proteins into their respective SCOP superfamily/Pfam family, using sequence derived information. Spectrophores™, a 1D descriptor of the 3D molecular field surrounding a structure was used as a benchmark to compare the performance of the physicochemical parameters. The machine learning algorithms were modified to select features based on information gain for each SCOP superfamily/Pfam family. The effect of combining physicochemical parameters and spectrophores on classification accuracy (CA) was studied. Machine learning algorithms trained with the physicochemical parameters consistently classified SCOP superfamilies and Pfam families with a classification accuracy above 90%, while spectrophores performed with a CA of around 85%. Feature selection improved classification accuracy for both physicochemical parameters and spectrophores based machine learning algorithms. Combining both attributes resulted in a marginal loss of performance. Physicochemical parameters were able to classify proteins from both schemes with classification accuracy ranging from 90-96%. These results suggest the usefulness of this method in classifying proteins from amino acid sequences.
Collapse
Affiliation(s)
- Abhilash Mohan
- The Center for Biotechnology, Anna University, Chennai, 600025, Tamilnadu, India
| | | | | | | |
Collapse
|
4
|
Angadi UB, Venkatesulu M. Structural SCOP superfamily level classification using unsupervised machine learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 9:601-608. [PMID: 21844638 DOI: 10.1109/tcbb.2011.114] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
One of the major research directions in bioinformatics is that of assigning superfamily classification to a given set of proteins. The classification reflects the structural, evolutionary, and functional relatedness. These relationships are embodied in a hierarchical classification, such as the Structural Classification of Protein (SCOP), which is mostly manually curated. Such a classification is essential for the structural and functional analyses of proteins. Yet a large number of proteins remain unclassified. In this study, we have proposed an unsupervised machine learning approach to classify and assign a given set of proteins to SCOP superfamilies. In the method, we have constructed a database and similarity matrix using P-values obtained from an all-against-all BLAST run and trained the network with the ART2 unsupervised learning algorithm using the rows of the similarity matrix as input vectors, enabling the trained network to classify the proteins from 0.82 to 0.97 f-measure accuracy. The performance of ART2 has been compared with that of spectral clustering, Random forest, SVM, and HHpred. ART2 performs better than the others except HHpred. HHpred performs better than ART2 and the sum of errors is smaller than that of the other methods evaluated.
Collapse
|
5
|
Penner RC, Knudsen M, Wiuf C, Andersen JE. An Algebro-topological description of protein domain structure. PLoS One 2011; 6:e19670. [PMID: 21629687 PMCID: PMC3101207 DOI: 10.1371/journal.pone.0019670] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2011] [Accepted: 04/03/2011] [Indexed: 11/25/2022] Open
Abstract
The space of possible protein structures appears vast and continuous, and the relationship between primary, secondary and tertiary structure levels is complex. Protein structure comparison and classification is therefore a difficult but important task since structure is a determinant for molecular interaction and function. We introduce a novel mathematical abstraction based on geometric topology to describe protein domain structure. Using the locations of the backbone atoms and the hydrogen bonds, we build a combinatorial object – a so-called fatgraph. The description is discrete yet gives rise to a 2-dimensional mathematical surface. Thus, each protein domain corresponds to a particular mathematical surface with characteristic topological invariants, such as the genus (number of holes) and the number of boundary components. Both invariants are global fatgraph features reflecting the interconnectivity of the domain by hydrogen bonds. We introduce the notion of robust variables, that is variables that are robust towards minor changes in the structure/fatgraph, and show that the genus and the number of boundary components are robust. Further, we invesigate the distribution of different fatgraph variables and show how only four variables are capable of distinguishing different folds. We use local (secondary) and global (tertiary) fatgraph features to describe domain structures and illustrate that they are useful for classification of domains in CATH. In addition, we combine our method with two other methods thereby using primary, secondary, and tertiary structure information, and show that we can identify a large percentage of new and unclassified structures in CATH.
Collapse
Affiliation(s)
- Robert Clark Penner
- Center for the Topology and Quantization of Moduli Spaces, Department of Mathematical Sciences, Aarhus University, Aarhus, Denmark
- Departments of Mathematics and Physics/Astronomy, University of Southern California, Los Angeles, California, United States of America
| | - Michael Knudsen
- Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
| | - Carsten Wiuf
- Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
- Centre for Membrane Pumps in Cells and Disease, Aarhus University, Aarhus, Denmark
- * E-mail:
| | - Jørgen Ellegaard Andersen
- Center for the Topology and Quantization of Moduli Spaces, Department of Mathematical Sciences, Aarhus University, Aarhus, Denmark
| |
Collapse
|
6
|
Angadi UB, Venkatesulu M. FuzzyART neural network for protein classification. J Bioinform Comput Biol 2010; 8:825-41. [PMID: 20981890 DOI: 10.1142/s0219720010004951] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2010] [Revised: 05/13/2010] [Accepted: 05/13/2010] [Indexed: 11/18/2022]
Abstract
One of the major research directions in bioinformatics is that of predicting the protein superfamily in large databases and classifying a given set of protein domains into superfamilies. The classification reflects the structural, evolutionary and functional relatedness. These relationships are embodied in hierarchical classification such as Structural Classification of Protein (SCOP), which is manually curated. Such classification is essential for the structural and functional analysis of proteins. Yet, a large number of proteins remain unclassified. We have proposed an unsupervised machine-learning FuzzyART neural network algorithm to classify a given set of proteins into SCOP superfamilies. The proposed method is fast learning and uses an atypical non-linear pattern recognition technique. In this approach, we have constructed a similarity matrix from p-values of BLAST all-against-all, trained the network with FuzzyART unsupervised learning algorithm using the similarity matrix as input vectors and finally the trained network offers SCOP superfamily level classification. In this experiment, we have evaluated the performance of our method with existing techniques on six different datasets. We have shown that the trained network is able to classify a given similarity matrix of a set of sequences into SCOP superfamilies at high classification accuracy.
Collapse
Affiliation(s)
- Ulavappa B Angadi
- Department of Computer Applications, Kalasalingam University, Krishnankoil, Srivilliputtur (via), Tamil Nadu, 626190, India.
| | | |
Collapse
|
7
|
Müller CL, Sbalzarini IF, van Gunsteren WF, Zagrović B, Hünenberger PH. In the eye of the beholder: Inhomogeneous distribution of high-resolution shapes within the random-walk ensemble. J Chem Phys 2009; 130:214904. [PMID: 19508095 DOI: 10.1063/1.3140090] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The concept of high-resolution shapes (also referred to as folds or states, depending on the context) of a polymer chain plays a central role in polymer science, structural biology, bioinformatics, and biopolymer dynamics. However, although the idea of shape is intuitively very useful, there is no unambiguous mathematical definition for this concept. In the present work, the distributions of high-resolution shapes within the ideal random-walk ensembles with N=3,...,6 beads (or up to N=10 for some properties) are investigated using a systematic (grid-based) approach based on a simple working definition of shapes relying on the root-mean-square atomic positional deviation as a metric (i.e., to define the distance between pairs of structures) and a single cutoff criterion for the shape assignment. Although the random-walk ensemble appears to represent the paramount of homogeneity and randomness, this analysis reveals that the distribution of shapes within this ensemble, i.e., in the total absence of interatomic interactions characteristic of a specific polymer (beyond the generic connectivity constraint), is significantly inhomogeneous. In particular, a specific (densest) shape occurs with a local probability that is 1.28, 1.79, 2.94, and 10.05 times (N=3,...,6) higher than the corresponding average over all possible shapes (these results can tentatively be extrapolated to a factor as large as about 10(28) for N=100). The qualitative results of this analysis lead to a few rather counterintuitive suggestions, namely, that, e.g., (i) a fold classification analysis applied to the random-walk ensemble would lead to the identification of random-walk "folds;" (ii) a clustering analysis applied to the random-walk ensemble would also lead to the identification random-walk "states" and associated relative free energies; and (iii) a random-walk ensemble of polymer chains could lead to well-defined diffraction patterns in hypothetical fiber or crystal diffraction experiments. The inhomogeneous nature of the shape probability distribution identified here for random walks may represent a significant underlying baseline effect in the analysis of real polymer chain ensembles (i.e., in the presence of specific interatomic interactions). As a consequence, a part of what is called a polymer shape may actually reside just "in the eye of the beholder" rather than in the nature of the interactions between the constituting atoms, and the corresponding observation-related bias should be taken into account when drawing conclusions from shape analyses as applied to real structural ensembles.
Collapse
Affiliation(s)
- Christian L Müller
- Institute of Computational Science and Swiss Institute of Bioinformatics, ETH Zürich, Switzerland
| | | | | | | | | |
Collapse
|
8
|
Sam V, Tai CH, Garnier J, Gibrat JF, Lee B, Munson PJ. Towards an automatic classification of protein structural domains based on structural similarity. BMC Bioinformatics 2008; 9:74. [PMID: 18237410 PMCID: PMC2267780 DOI: 10.1186/1471-2105-9-74] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2007] [Accepted: 01/31/2008] [Indexed: 11/10/2022] Open
Abstract
Background Formal classification of a large collection of protein structures aids the understanding of evolutionary relationships among them. Classifications involving manual steps, such as SCOP and CATH, face the challenge of increasing volume of available structures. Automatic methods such as FSSP or Dali Domain Dictionary, yield divergent classifications, for reasons not yet fully investigated. One possible reason is that the pairwise similarity scores used in automatic classification do not adequately reflect the judgments made in manual classification. Another possibility is the difference between manual and automatic classification procedures. We explore the degree to which these two factors might affect the final classification. Results We use DALI, SHEBA and VAST pairwise scores on the SCOP C class domains, to investigate a variety of hierarchical clustering procedures. The constructed dendrogram is cut in a variety of ways to produce a partition, which is compared to the SCOP fold classification. Ward's method dendrograms led to partitions closest to the SCOP fold classification. Dendrogram- or tree-cutting strategies fell into four categories according to the similarity of resulting partitions to the SCOP fold partition. Two strategies which optimize similarity to SCOP, gave an average of 72% true positives rate (TPR), at a 1% false positive rate. Cutting the largest size cluster at each step gave an average of 61% TPR which was one of the best strategies not making use of prior knowledge of SCOP. Cutting the longest branch at each step produced one of the worst strategies. We also developed a method to detect irreducible differences between the best possible automatic partitions and SCOP, regardless of the cutting strategy. These differences are substantial. Visual examination of hard-to-classify proteins confirms our previous finding, that global structural similarity of domains is not the only criterion used in the SCOP classification. Conclusion Different clustering procedures give rise to different levels of agreement between automatic and manual protein classifications. None of the tested procedures completely eliminates the divergence between automatic and manual protein classifications. Achieving full agreement between these two approaches would apparently require additional information.
Collapse
Affiliation(s)
- Vichetra Sam
- Mathematical and Statistical Computing Laboratory, DCB, CIT, NIH, DHHS, Bethesda, MD, USA.
| | | | | | | | | | | |
Collapse
|