1
|
Wang J, Lu CH, Kong XZ, Dai LY, Yuan S, Zhang X. Multi-view manifold regularized compact low-rank representation for cancer samples clustering on multi-omics data. BMC Bioinformatics 2022; 22:334. [PMID: 35057729 PMCID: PMC8772048 DOI: 10.1186/s12859-021-04220-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Accepted: 05/27/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The identification of cancer types is of great significance for early diagnosis and clinical treatment of cancer. Clustering cancer samples is an important means to identify cancer types, which has been paid much attention in the field of bioinformatics. The purpose of cancer clustering is to find expression patterns of different cancer types, so that the samples with similar expression patterns can be gathered into the same type. In order to improve the accuracy and reliability of cancer clustering, many clustering methods begin to focus on the integration analysis of cancer multi-omics data. Obviously, the methods based on multi-omics data have more advantages than those using single omics data. However, the high heterogeneity and noise of cancer multi-omics data pose a great challenge to the multi-omics analysis method. RESULTS In this study, in order to extract more complementary information from cancer multi-omics data for cancer clustering, we propose a low-rank subspace clustering method called multi-view manifold regularized compact low-rank representation (MmCLRR). In MmCLRR, each omics data are regarded as a view, and it learns a consistent subspace representation by imposing a consistence constraint on the low-rank affinity matrix of each view to balance the agreement between different views. Moreover, the manifold regularization and concept factorization are introduced into our method. Relying on the concept factorization, the dictionary can be updated in the learning, which greatly improves the subspace learning ability of low-rank representation. We adopt linearized alternating direction method with adaptive penalty to solve the optimization problem of MmCLRR method. CONCLUSIONS Finally, we apply MmCLRR into the clustering of cancer samples based on multi-omics data, and the clustering results show that our method outperforms the existing multi-view methods.
Collapse
Affiliation(s)
- Juan Wang
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Cong-Hai Lu
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Xiang-Zhen Kong
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Ling-Yun Dai
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Shasha Yuan
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Xiaofeng Zhang
- School of Information and Electrical Engineering, Ludong University, Yantai, 264025 China
| |
Collapse
|
2
|
Cui X, Naveed H, Gao X. Finding optimal interaction interface alignments between biological complexes. Bioinformatics 2015; 31:i133-41. [PMID: 26072475 PMCID: PMC4765866 DOI: 10.1093/bioinformatics/btv242] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Motivation: Biological molecules perform their functions through interactions with other molecules. Structure alignment of interaction interfaces between biological complexes is an indispensable step in detecting their structural similarities, which are keys to understanding their evolutionary histories and functions. Although various structure alignment methods have been developed to successfully access the similarities of protein structures or certain types of interaction interfaces, existing alignment tools cannot directly align arbitrary types of interfaces formed by protein, DNA or RNA molecules. Specifically, they require a ‘blackbox preprocessing’ to standardize interface types and chain identifiers. Yet their performance is limited and sometimes unsatisfactory. Results: Here we introduce a novel method, PROSTA-inter, that automatically determines and aligns interaction interfaces between two arbitrary types of complex structures. Our method uses sequentially remote fragments to search for the optimal superimposition. The optimal residue matching problem is then formulated as a maximum weighted bipartite matching problem to detect the optimal sequence order-independent alignment. Benchmark evaluation on all non-redundant protein–DNA complexes in PDB shows significant performance improvement of our method over TM-align and iAlign (with the ‘blackbox preprocessing’). Two case studies where our method discovers, for the first time, structural similarities between two pairs of functionally related protein–DNA complexes are presented. We further demonstrate the power of our method on detecting structural similarities between a protein–protein complex and a protein–RNA complex, which is biologically known as a protein–RNA mimicry case. Availability and implementation: The PROSTA-inter web-server is publicly available at http://www.cbrc.kaust.edu.sa/prosta/. Contact:xin.gao@kaust.edu.sa
Collapse
Affiliation(s)
- Xuefeng Cui
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Hammad Naveed
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Xin Gao
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
3
|
Minami S, Sawada K, Chikenji G. How a spatial arrangement of secondary structure elements is dispersed in the universe of protein folds. PLoS One 2014; 9:e107959. [PMID: 25243952 PMCID: PMC4171485 DOI: 10.1371/journal.pone.0107959] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2014] [Accepted: 08/18/2014] [Indexed: 11/18/2022] Open
Abstract
It has been known that topologically different proteins of the same class sometimes share the same spatial arrangement of secondary structure elements (SSEs). However, the frequency by which topologically different structures share the same spatial arrangement of SSEs is unclear. It is important to estimate this frequency because it provides both a deeper understanding of the geometry of protein folds and a valuable suggestion for predicting protein structures with novel folds. Here we clarified the frequency with which protein folds share the same SSE packing arrangement with other folds, the types of spatial arrangement of SSEs that are frequently observed across different folds, and the diversity of protein folds that share the same spatial arrangement of SSEs with a given fold, using a protein structure alignment program MICAN, which we have been developing. By performing comprehensive structural comparison of SCOP fold representatives, we found that approximately 80% of protein folds share the same spatial arrangement of SSEs with other folds. We also observed that many protein pairs that share the same spatial arrangement of SSEs belong to the different classes, often with an opposing N- to C-terminal direction of the polypeptide chain. The most frequently observed spatial arrangement of SSEs was the 2-layer α/β packing arrangement and it was dispersed among as many as 27% of SCOP fold representatives. These results suggest that the same spatial arrangements of SSEs are adopted by a wide variety of different folds and that the spatial arrangement of SSEs is highly robust against the N- to C-terminal direction of the polypeptide chain.
Collapse
Affiliation(s)
- Shintaro Minami
- Department of Complex Systems Science, Nagoya University, Nagoya, Aichi, Japan
| | - Kengo Sawada
- Department of Applied Physics, Nagoya University, Nagoya, Aichi, Japan
| | - George Chikenji
- Department of Computational Science and Engineering, Nagoya University, Nagoya, Aichi, Japan
- * E-mail:
| |
Collapse
|
4
|
Rosenman DJ, Huang YM, Xia K, Fraser K, Jones VE, Lamberson CM, Van Roey P, Colón W, Bystroff C. Green-lighting green fluorescent protein: faster and more efficient folding by eliminating a cis-trans peptide isomerization event. Protein Sci 2014; 23:400-10. [PMID: 24408076 DOI: 10.1002/pro.2421] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2013] [Revised: 01/02/2014] [Accepted: 01/06/2014] [Indexed: 11/06/2022]
Abstract
Wild-type green fluorescent protein (GFP) folds on a time scale of minutes. The slow step in folding is a cis-trans peptide bond isomerization. The only conserved cis-peptide bond in the native GFP structure, at P89, was remodeled by the insertion of two residues, followed by iterative energy minimization and side chain design. The engineered GFP was synthesized and found to fold faster and more efficiently than its template protein, recovering 50% more of its fluorescence upon refolding. The slow phase of folding is faster and smaller in amplitude, and hysteresis in refolding has been eliminated. The elimination of a previously reported kinetically trapped state in refolding suggests that X-P89 is trans in the trapped state. A 2.55 Å resolution crystal structure revealed that the new variant contains only trans-peptide bonds, as designed. This is the first instance of a computationally remodeled fluorescent protein that folds faster and more efficiently than wild type.
Collapse
Affiliation(s)
- David J Rosenman
- Rensselaer Polytechnic Institute, Biological Sciences, 110 8th St., Troy, New York, 12180
| | | | | | | | | | | | | | | | | |
Collapse
|
5
|
Going over the three dimensional protein structure similarity problem. Artif Intell Rev 2013. [DOI: 10.1007/s10462-013-9416-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
6
|
Minami S, Sawada K, Chikenji G. MICAN: a protein structure alignment algorithm that can handle Multiple-chains, Inverse alignments, C(α) only models, Alternative alignments, and Non-sequential alignments. BMC Bioinformatics 2013; 14:24. [PMID: 23331634 PMCID: PMC3637537 DOI: 10.1186/1471-2105-14-24] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2012] [Accepted: 01/08/2013] [Indexed: 11/10/2022] Open
Abstract
Background Protein pairs that have the same secondary structure packing arrangement but have different topologies have attracted much attention in terms of both evolution and physical chemistry of protein structures. Further investigation of such protein relationships would give us a hint as to how proteins can change their fold in the course of evolution, as well as a insight into physico-chemical properties of secondary structure packing. For this purpose, highly accurate sequence order independent structure comparison methods are needed. Results We have developed a novel protein structure alignment algorithm, MICAN (a structure alignment algorithm that can handle Multiple-chain complexes, Inverse direction of secondary structures, Cα only models, Alternative alignments, and Non-sequential alignments). The algorithm was designed so as to identify the best structural alignment between protein pairs by disregarding the connectivity between secondary structure elements (SSE). One of the key feature of the algorithm is utilizing the multiple vector representation for each SSE, which enables us to correctly treat bent or twisted nature of long SSE. We compared MICAN with other 9 publicly available structure alignment programs, using both reference-dependent and reference-independent evaluation methods on a variety of benchmark test sets which include both sequential and non-sequential alignments. We show that MICAN outperforms the other existing methods for reproducing reference alignments of non-sequential test sets. Further, although MICAN does not specialize in sequential structure alignment, it showed the top level performance on the sequential test sets. We also show that MICAN program is the fastest non-sequential structure alignment program among all the programs we examined here. Conclusions MICAN is the fastest and the most accurate program among non-sequential alignment programs we examined here. These results suggest that MICAN is a highly effective tool for automatically detecting non-trivial structural relationships of proteins, such as circular permutations and segment-swapping, many of which have been identified manually by human experts so far. The source code of MICAN is freely download-able at http://www.tbp.cse.nagoya-u.ac.jp/MICAN.
Collapse
Affiliation(s)
- Shintaro Minami
- Department of Computational Science and Engineering, Nagoya University, Nagoya 464-8603, Japan
| | | | | |
Collapse
|
7
|
Yang Y, Zhan J, Zhao H, Zhou Y. A new size-independent score for pairwise protein structure alignment and its application to structure classification and nucleic-acid binding prediction. Proteins 2012; 80:2080-8. [PMID: 22522696 DOI: 10.1002/prot.24100] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2012] [Revised: 04/13/2012] [Accepted: 04/17/2012] [Indexed: 11/12/2022]
Abstract
A structure alignment program aligns two structures by optimizing a scoring function that measures structural similarity. It is highly desirable that such scoring function is independent of the sizes of proteins in comparison so that the significance of alignment across different sizes of the protein regions aligned is comparable. Here, we developed a new score called SP-score that fixes the cutoff distance at 4 Å and removed the size dependence using a normalization prefactor. We further built a program called SPalign that optimizes SP-score for structure alignment. SPalign was applied to recognize proteins within the same structure fold and having the same function of DNA or RNA binding. For fold discrimination, SPalign improves sensitivity over TMalign for the chain-level comparison by 12% and over DALI for the domain-level comparison by 13% at the same specificity of 99.6%. The difference between TMalign and SPalign at the chain level is due to the inability of TMalign to detect single domain similarity between multidomain proteins. For recognizing nucleic acid binding proteins, SPalign consistently improves over TMalign by 12% and DALI by 31% in average value of Mathews correlation coefficients for four datasets. SPalign with default setting is 14% faster than TMalign. SPalign is expected to be useful for function prediction and comparing structures with or without domains defined. The source code for SPalign and the server are available at http://sparks.informatics.iupui.edu.
Collapse
Affiliation(s)
- Yuedong Yang
- Indiana University School of Informatics, Indiana University-Purdue University, Indianapolis, Indiana 46202, USA
| | | | | | | |
Collapse
|
8
|
|
9
|
SALEM SAEED, ZAKI MOHAMMEDJ, BYSTROFF CHRISTOPHER. ITERATIVE NON-SEQUENTIAL PROTEIN STRUCTURAL ALIGNMENT. J Bioinform Comput Biol 2011; 7:571-96. [PMID: 19507290 DOI: 10.1142/s0219720009004205] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2008] [Revised: 11/05/2008] [Accepted: 11/06/2008] [Indexed: 11/18/2022]
Abstract
Structural similarity between proteins gives us insights into their evolutionary relationships when there is low sequence similarity. In this paper, we present a novel approach called SNAP for non-sequential pair-wise structural alignment. Starting from an initial alignment, our approach iterates over a two-step process consisting of a superposition step and an alignment step, until convergence. We propose a novel greedy algorithm to construct both sequential and non-sequential alignments. The quality of SNAP alignments were assessed by comparing against the manually curated reference alignments in the challenging SISY and RIPC datasets. Moreover, when applied to a dataset of 4410 protein pairs selected from the CATH database, SNAP produced longer alignments with lower rmsd than several state-of-the-art alignment methods. Classification of folds using SNAP alignments was both highly sensitive and highly selective. The SNAP software along with the datasets are available online at
Collapse
Affiliation(s)
- SAEED SALEM
- Department of Computer Science, Rensselaer Polytechnic Institute, 110 8th st. Troy, New York 12180, USA
| | - MOHAMMED J. ZAKI
- Department of Computer Science, Rensselaer Polytechnic Institute, 110 8th st. Troy, New York 12180, USA
| | - CHRISTOPHER BYSTROFF
- Department of Computer Science, Rensselaer Polytechnic Institute, 110 8th st. Troy, New York 12180, USA
- Department of Biology, Rensselaer Polytechnic Institute, 110 8th st. Troy, New York 12180, USA
| |
Collapse
|
10
|
Teyra J, Hawkins J, Zhu H, Pisabarro MT. Studies on the inference of protein binding regions across fold space based on structural similarities. Proteins 2011; 79:499-508. [PMID: 21069715 DOI: 10.1002/prot.22897] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
The emerging picture of a continuous protein fold space highlights the existence of non obvious structural similarities between proteins with apparent different topologies. The identification of structure resemblances across fold space and the analysis of similar recognition regions may be a valuable source of information towards protein structure-based functional characterization. In this work, we use non-sequential structural alignment methods (ns-SAs) to identify structural similarities between protein pairs independently of their SCOP hierarchy, and we calculate the significance of binding region conservation using the interacting residues overlap in the ns-SA. We cluster the binding inferences for each family to distinguish already known family binding regions from putative new ones. Our methodology exploits the enormous amount of data available in the PDB to identify binding region similarities within protein families and to propose putative binding regions. Our results indicate that there is a plethora of structurally common binding regions among proteins, independently of current fold classifications. We obtain a 6- to 8-fold enrichment of novel binding regions, and identify binding inferences for 728 protein families that so far lack binding information in the PDB. We explore binding mode analogies between ligands from commonly clustered binding regions to investigate the utility of our methodology. A comprehensive analysis of the obtained binding inferences may help in the functional characterization of protein recognition and assist rational engineering. The data obtained in this work is available in the download link at www.scowlp.org.
Collapse
Affiliation(s)
- Joan Teyra
- Structural Bioinformatics, BIOTEC, Technical University of Dresden, Tatzberg 47-51, 01307 Dresden, Germany.
| | | | | | | |
Collapse
|
11
|
Venkateswaran JG, Song B, Kahveci T, Jermaine C. TRIAL: a tool for finding distant structural similarities. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:819-831. [PMID: 21393655 DOI: 10.1109/tcbb.2009.28] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Finding structural similarities in distantly related proteins can reveal functional relationships that can not be identified using sequence comparison. Given two proteins A and B and threshold ε Å, we develop an algorithm, TRiplet-based Iterative ALignment (TRIAL) for computing the transformation of B that maximizes the number of aligned residues such that the root mean square deviation (RMSD) of the alignment is at most ε Å. Our algorithm is designed with the specific goal of effectively handling proteins with low similarity in primary structure, where existing algorithms perform particularly poorly. Experiments show that our method outperforms existing methods. TRIAL alignment brings the secondary structures of distantly related proteins to similar orientations. It also finds larger number of secondary structure matches at lower RMSD values and increased overall alignment lengths. Its classification accuracy is up to 63 percent better than other methods, including CE and DALI. TRIAL successfully aligns 83 percent of the residues from the smaller protein in reasonable time while other methods align only 29 to 65 percent of the residues for the same set of proteins.
Collapse
|
12
|
Dai L, Zhou Y. Characterizing the existing and potential structural space of proteins by large-scale multiple loop permutations. J Mol Biol 2011; 408:585-95. [PMID: 21376059 DOI: 10.1016/j.jmb.2011.02.056] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2010] [Revised: 02/22/2011] [Accepted: 02/24/2011] [Indexed: 10/18/2022]
Abstract
Worldwide structural genomics projects are increasing structure coverage of sequence space but have not significantly expanded the protein structure space itself (i.e., number of unique structural folds) since 2007. Discovering new structural folds experimentally by directed evolution and random recombination of secondary-structure blocks is also proved rarely successful. Meanwhile, previous computational efforts for large-scale mapping of protein structure space are limited to simple model proteins and led to an inconclusive answer on the completeness of the existing observed protein structure space. Here, we build novel protein structures by extending naturally occurring circular (single-loop) permutation to multiple loop permutations (MLPs). These structures are clustered by structural similarity measure called TM-score. The computational technique allows us to produce different structural clusters on the same naturally occurring, packed, stable core but with alternatively connected secondary-structure segments. A large-scale MLP of 2936 domains from structural classification of protein domains reproduces those existing structural clusters (63%) mostly as hubs for many nonredundant sequences and illustrates newly discovered novel clusters as islands adopted by a few sequences only. Results further show that there exist a significant number of novel potentially stable clusters for medium-size or large-size single-domain proteins, in particular, >100 amino acid residues, that are either not yet adopted by nature or adopted only by a few sequences. This study suggests that MLP provides a simple yet highly effective tool for engineering and design of novel protein structures (including naturally knotted proteins). The implication of recovering new-fold targets from critical assessment of structure prediction techniques (CASP) by MLP on template-based structure prediction is also discussed. Our MLP structures are available for download at the publication page of the Web site http://sparks.informatics.iupui.edu.
Collapse
Affiliation(s)
- Liang Dai
- School of Informatics, Indiana University Purdue University Indianapolis, and Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 719 Indiana Avenue, Walker Plaza Building Suite 319, Indianapolis, IN 46202, USA
| | | |
Collapse
|
13
|
Reeder PJ, Huang YM, Dordick JS, Bystroff C. A rewired green fluorescent protein: folding and function in a nonsequential, noncircular GFP permutant. Biochemistry 2010; 49:10773-9. [PMID: 21090791 DOI: 10.1021/bi100975z] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The sequential order of secondary structural elements in proteins affects the folding and activity to an unknown extent. To test the dependence on sequential connectivity, we reconnected secondary structural elements by their solvent-exposed ends, permuting their sequential order, called "rewiring". This new protein design strategy changes the topology of the backbone without changing the core side chain packing arrangement. While circular and noncircular permutations have been observed in protein structures that are not related by sequence homology, to date no one has attempted to rationally design and construct a protein with a sequence that is noncircularly permuted while conserving three-dimensional structure. Herein, we show that green fluorescent protein can be rewired, still functionally fold, and exhibit wild-type fluorescence excitation and emission spectra.
Collapse
Affiliation(s)
- Philippa J Reeder
- Department of Chemical and Biological Engineering, University of Colorado, Boulder, Colorado 80309, United States
| | | | | | | |
Collapse
|
14
|
Abstract
A web service for analysis of protein structures that are sequentially or non-sequentially similar was generated. Recently, the non-sequential structure alignment algorithm GANGSTA+ was introduced. GANGSTA+ can detect non-sequential structural analogs for proteins stated to possess novel folds. Since GANGSTA+ ignores the polypeptide chain connectivity of secondary structure elements (i.e. α-helices and β-strands), it is able to detect structural similarities also between proteins whose sequences were reshuffled during evolution. GANGSTA+ was applied in an all-against-all comparison on the ASTRAL40 database (SCOP version 1.75), which consists of >10 000 protein domains yielding about 55 × 106 possible protein structure alignments. Here, we provide the resulting protein structure alignments as a public web-based service, named GANGSTA+ Internet Services (GIS). We also allow to browse the ASTRAL40 database of protein structures with GANGSTA+ relative to an externally given protein structure using different constraints to select specific results. GIS allows us to analyze protein structure families according to the SCOP classification scheme. Additionally, users can upload their own protein structures for pairwise protein structure comparison, alignment against all protein structures of the ASTRAL40 database (SCOP version 1.75) or symmetry analysis. GIS is publicly available at http://agknapp.chemie.fu-berlin.de/gplus.
Collapse
Affiliation(s)
- Aysam Guerler
- Freie Universität Berlin, Institute of Chemistry and Biochemistry, Fabeckstrasse 36a, 14195 Berlin, Germany
| | | |
Collapse
|
15
|
FlexSnap: flexible non-sequential protein structure alignment. Algorithms Mol Biol 2010; 5:12. [PMID: 20047669 PMCID: PMC2846951 DOI: 10.1186/1748-7188-5-12] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2009] [Accepted: 01/04/2010] [Indexed: 11/10/2022] Open
Abstract
Background Proteins have evolved subject to energetic selection pressure for stability and flexibility. Structural similarity between proteins that have gone through conformational changes can be captured effectively if flexibility is considered. Topologically unrelated proteins that preserve secondary structure packing interactions can be detected if both flexibility and Sequential permutations are considered. We propose the FlexSnap algorithm for flexible non-topological protein structural alignment. Results The effectiveness of FlexSnap is demonstrated by measuring the agreement of its alignments with manually curated non-sequential structural alignments. FlexSnap showed competitive results against state-of-the-art algorithms, like DALI, SARF2, MultiProt, FlexProt, and FATCAT. Moreover on the DynDom dataset, FlexSnap reported longer alignments with smaller rmsd. Conclusions We have introduced FlexSnap, a greedy chaining algorithm that reports both sequential and non-sequential alignments and allows twists (hinges). We assessed the quality of the FlexSnap alignments by measuring its agreements with manually curated non-sequential alignments. On the FlexProt dataset, FlexSnap was competitive to state-of-the-art flexible alignment methods. Moreover, we demonstrated the benefits of introducing hinges by showing significant improvements in the alignments reported by FlexSnap for the structure pairs for which rigid alignment methods reported alignments with either low coverage or large rmsd. Availability An implementation of the FlexSnap algorithm will be made available online at http://www.cs.rpi.edu/~zaki/software/flexsnap.
Collapse
|
16
|
Schmidt-Goenner T, Guerler A, Kolbeck B, Knapp EW. Circular permuted proteins in the universe of protein folds. Proteins 2009; 78:1618-30. [DOI: 10.1002/prot.22678] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
17
|
Micheletti C, Orland H. MISTRAL: a tool for energy-based multiple structural alignment of proteins. ACTA ACUST UNITED AC 2009; 25:2663-9. [PMID: 19692555 DOI: 10.1093/bioinformatics/btp506] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION The steady growth of the number of available protein structures has constantly motivated the development of new algorithms for detecting structural correspondences in proteins. Detecting structural equivalences in two or more proteins is computationally demanding as it typically entails the exploration of the combinatorial space of all possible amino acid pairings in the parent proteins. The search is often aided by the introduction of various constraints such as considering protein fragments, rather than single amino acids, and/or seeking only sequential correspondences in the given proteins. An additional challenge is represented by the difficulty of associating to a given alignment, a reliable a priori measure of its statistical significance. RESULTS Here, we present and discuss MISTRAL (Multiple STRuctural ALignment), a novel strategy for multiple protein alignment based on the minimization of an energy function over the low-dimensional space of the relative rotations and translations of the molecules. The energy minimization avoids combinatorial searches and returns pairwise alignment scores for which a reliable a priori statistical significance can be given. AVAILABILITY MISTRAL is freely available for academic users as a standalone program and as a web service at http://ipht.cea.fr/protein.php.
Collapse
Affiliation(s)
- Cristian Micheletti
- SISSA, CNR-INFM Democritos and Italian Institute of Technology, Via Beirut 2-4, 34014 Trieste, Italy.
| | | |
Collapse
|
18
|
da Silveira CH, Pires DEV, Minardi RC, Ribeiro C, Veloso CJM, Lopes JCD, Meira W, Neshich G, Ramos CHI, Habesch R, Santoro MM. Protein cutoff scanning: A comparative analysis of cutoff dependent and cutoff free methods for prospecting contacts in proteins. Proteins 2009; 74:727-43. [PMID: 18704933 DOI: 10.1002/prot.22187] [Citation(s) in RCA: 67] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Carlos H da Silveira
- Department of Biochemistry and Immunology, Institute of Biological Sciences, Federal University of Minas Gerais, UFMG, Brazil.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
19
|
Chu CH, Tang CY, Tang CY, Pai TW. Angle-distance image matching techniques for protein structure comparison. J Mol Recognit 2008; 21:442-52. [DOI: 10.1002/jmr.914] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
20
|
Wang S, Zheng WM. CLePAPS: fast pair alignment of protein structures based on conformational letters. J Bioinform Comput Biol 2008; 6:347-66. [PMID: 18464327 DOI: 10.1142/s0219720008003461] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2007] [Revised: 11/22/2007] [Accepted: 12/05/2007] [Indexed: 11/18/2022]
Abstract
Fast, efficient, and reliable algorithms for pairwise alignment of protein structures are in ever-increasing demand for analyzing the rapidly growing data on protein structures. CLePAPS is a tool developed for this purpose. It distinguishes itself from other existing algorithms by the use of conformational letters, which are discretized states of 3D segmental structural states. A letter corresponds to a cluster of combinations of the three angles formed by Calpha pseudobonds of four contiguous residues. A substitution matrix called CLESUM is available to measure the similarity between any two such letters. CLePAPS regards an aligned fragment pair (AFP) as an ungapped string pair with a high sum of pairwise CLESUM scores. Using CLESUM scores as the similarity measure, CLePAPS searches for AFPs by simple string comparison. The transformation which best superimposes a highly similar AFP can be used to superimpose the structure pairs under comparison. A highly scored AFP which is consistent with several other AFPs determines an initial alignment. CLePAPS then joins consistent AFPs guided by their similarity scores to extend the alignment by several "zoom-in" iteration steps. A follow-up refinement produces the final alignment. CLePAPS does not implement dynamic programming. The utility of CLePAPS is tested on various protein structure pairs.
Collapse
Affiliation(s)
- Sheng Wang
- Institute of Theoretical Physics, Academia Sinica, Beijing 100080, China
| | | |
Collapse
|
21
|
Guerler A, Knapp EW. Novel protein folds and their nonsequential structural analogs. Protein Sci 2008; 17:1374-82. [PMID: 18583523 DOI: 10.1110/ps.035469.108] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Newly determined protein structures are classified to belong to a new fold, if the structures are sufficiently dissimilar from all other so far known protein structures. To analyze structural similarities of proteins, structure alignment tools are used. We demonstrate that the usage of nonsequential structure alignment tools, which neglect the polypeptide chain connectivity, can yield structure alignments with significant similarities between proteins of known three-dimensional structure and newly determined protein structures that possess a new fold. The recently introduced protein structure alignment tool, GANGSTA, is specialized to perform nonsequential alignments with proper assignment of the secondary structure types by focusing on helices and strands only. In the new version, GANGSTA+, the underlying algorithms were completely redesigned, yielding enhanced quality of structure alignments, offering alignment against a larger database of protein structures, and being more efficient. We applied DaliLite, TM-align, and GANGSTA+ on three protein crystal structures considered to be novel folds. Applying GANGSTA+ to these novel folds, we find proteins in the ASTRAL40 database, which possess significant structural similarities, albeit the alignments are nonsequential and in some cases involve secondary structure elements aligned in reverse orientation. A web server is available at http://agknapp.chemie.fu-berlin.de/gplus for pairwise alignment, visualization, and database comparison.
Collapse
Affiliation(s)
- Aysam Guerler
- Department of Chemistry and Biochemistry, Freie Universität Berlin, 14195 Berlin, Germany
| | | |
Collapse
|
22
|
Backbone structure of the infectious epsilon15 virus capsid revealed by electron cryomicroscopy. Nature 2008; 451:1130-4. [PMID: 18305544 DOI: 10.1038/nature06665] [Citation(s) in RCA: 170] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2007] [Accepted: 01/03/2008] [Indexed: 01/04/2023]
Abstract
A half-century after the determination of the first three-dimensional crystal structure of a protein, more than 40,000 structures ranging from single polypeptides to large assemblies have been reported. The challenge for crystallographers, however, remains the growing of a diffracting crystal. Here we report the 4.5-A resolution structure of a 22-MDa macromolecular assembly, the capsid of the infectious epsilon15 (epsilon15) particle, by single-particle electron cryomicroscopy. From this density map we constructed a complete backbone trace of its major capsid protein, gene product 7 (gp7). The structure reveals a similar protein architecture to that of other tailed double-stranded DNA viruses, even in the absence of detectable sequence similarity. However, the connectivity of the secondary structure elements (topology) in gp7 is unique. Protruding densities are observed around the two-fold axes that cannot be accounted for by gp7. A subsequent proteomic analysis of the whole virus identifies these densities as gp10, a 12-kDa protein. Its structure, location and high binding affinity to the capsid indicate that the gp10 dimer functions as a molecular staple between neighbouring capsomeres to ensure the particle's stability. Beyond epsilon15, this method potentially offers a new approach for modelling the backbone conformations of the protein subunits in other macromolecular assemblies at near-native solution states.
Collapse
|
23
|
Gao F, Zaki MJ. Indexing protein structures using suffix trees. Methods Mol Biol 2008; 413:147-169. [PMID: 18075165 DOI: 10.1007/978-1-59745-574-9_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
Approaches for indexing proteins and fast and scalable searching for structures similar to a query structure have important applications such as protein structure and function prediction, protein classification and drug discovery. In this chapter, we describe a new method for extracting the local feature vectors of protein structures. Each residue is represented by a triangle, and the correlation between a set of residues is described by the distances between Calpha atoms and the angles between the normals of planes in which the triangles lie. The normalized local feature vectors are indexed using a suffix tree. For all query segments, suffix trees can be used effectively to retrieve the maximal matches, which are then chained to obtain alignments with database proteins. Similar proteins are selected by their alignment score against the query. Our results show classification accuracy up to 97.8 and 99.4% at the superfamily and class level according to the SCOP classification and show that on average 7.49 out of 10 proteins from the same superfamily are obtained among the top 10 matches. These results outperform the best previous methods.
Collapse
Affiliation(s)
- Feng Gao
- Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY, USA
| | | |
Collapse
|
24
|
Shatsky M, Nussinov R, Wolfson HJ. Algorithms for multiple protein structure alignment and structure-derived multiple sequence alignment. Methods Mol Biol 2008; 413:125-46. [PMID: 18075164 PMCID: PMC10773980 DOI: 10.1007/978-1-59745-574-9_5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
Primary amino acid content and the geometry of the folded protein 3D structure are major parameters of protein function. During the course of evolution the protein 3D structure is more preserved than its primary sequence. Thus, analysis of protein structures is expected to lead to a deep insight into protein function. Recognition of a structural core common to a set of protein structures serves as a basic tool for the studies of protein evolution and classification, analysis of similar structural motifs and functional binding sites, and for homology modeling and threading. In this chapter, we discuss several biologically related computational aspects of the multiple structure alignment and propose a method that provides solutions to these problems. Finally, we address the problem of structure-based multiple sequence alignment and propose an optimization method that unifies primary sequence and 3D structure information.
Collapse
Affiliation(s)
- Maxim Shatsky
- School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | | | | |
Collapse
|
25
|
Menke M, Berger B, Cowen L. Matt: local flexibility aids protein multiple structure alignment. PLoS Comput Biol 2008; 4:e10. [PMID: 18193941 PMCID: PMC2186361 DOI: 10.1371/journal.pcbi.0040010] [Citation(s) in RCA: 172] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2007] [Accepted: 12/06/2007] [Indexed: 11/20/2022] Open
Abstract
Even when there is agreement on what measure a protein multiple structure alignment should be optimizing, finding the optimal alignment is computationally prohibitive. One approach used by many previous methods is aligned fragment pair chaining, where short structural fragments from all the proteins are aligned against each other optimally, and the final alignment chains these together in geometrically consistent ways. Ye and Godzik have recently suggested that adding geometric flexibility may help better model protein structures in a variety of contexts. We introduce the program Matt (Multiple Alignment with Translations and Twists), an aligned fragment pair chaining algorithm that, in intermediate steps, allows local flexibility between fragments: small translations and rotations are temporarily allowed to bring sets of aligned fragments closer, even if they are physically impossible under rigid body transformations. After a dynamic programming assembly guided by these "bent" alignments, geometric consistency is restored in the final step before the alignment is output. Matt is tested against other recent multiple protein structure alignment programs on the popular Homstrad and SABmark benchmark datasets. Matt's global performance is competitive with the other programs on Homstrad, but outperforms the other programs on SABmark, a benchmark of multiple structure alignments of proteins with more distant homology. On both datasets, Matt demonstrates an ability to better align the ends of alpha-helices and beta-strands, an important characteristic of any structure alignment program intended to help construct a structural template library for threading approaches to the inverse protein-folding problem. The related question of whether Matt alignments can be used to distinguish distantly homologous structure pairs from pairs of proteins that are not homologous is also considered. For this purpose, a p-value score based on the length of the common core and average root mean squared deviation (RMSD) of Matt alignments is shown to largely separate decoys from homologous protein structures in the SABmark benchmark dataset. We postulate that Matt's strong performance comes from its ability to model proteins in different conformational states and, perhaps even more important, its ability to model backbone distortions in more distantly related proteins.
Collapse
Affiliation(s)
- Matthew Menke
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
| | - Lenore Cowen
- Department of Computer Science, Tufts University, Medford, Massachusetts, United States of America
| |
Collapse
|
26
|
Abyzov A, Ilyin VA. A comprehensive analysis of non-sequential alignments between all protein structures. BMC STRUCTURAL BIOLOGY 2007; 7:78. [PMID: 18005453 PMCID: PMC2213659 DOI: 10.1186/1472-6807-7-78] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/14/2007] [Accepted: 11/16/2007] [Indexed: 05/02/2023]
Abstract
Background The majority of relations between proteins can be represented as a conventional sequential alignment. Nevertheless, unusual non-sequential alignments with different connectivity of the aligned fragments in compared proteins have been reported by many researchers. It is interesting to understand those non-sequential alignments; are they unique, sporadic cases or they occur frequently; do they belong to a few specific folds or spread among many different folds, as a common feature of protein structure. We present here a comprehensive large-scale study of non-sequential alignments between available protein structures in Protein Data Bank. Results The study has been conducted on a non-redundant set of 8,865 protein structures aligned with the aid of the TOPOFIT method. It has been estimated that between 17.4% and 35.2% of all alignments are non-sequential depending on variations in the parameters. Analysis of the data revealed that non-sequential relations between proteins do occur systematically and in large quantities. Various sizes and numbers of non-sequential fragments have been observed with all possible complexities of fragment rearrangements found for alignments consisting of up to 12 fragments. It has been found that non-sequential alignments are not limited to proteins of any particular fold and are present in more than two hundred of them. Moreover, many of them are found between proteins with different fold assignments. It has been shown that protein structure symmetry does not explain non-sequential alignments. Therefore, compelling evidences have been provided that non-sequential alignments between proteins are systematic and widespread across the protein universe. Conclusion The phenomenon of the widespread occurrence of non-sequential alignments between proteins might represent a missing rule of protein structure organization. More detailed study of this phenomenon will enhance our understanding of protein stability, folding, and evolution.
Collapse
Affiliation(s)
- Alexej Abyzov
- Department of Biology, Northeastern University 360 Huntington Avenue, Boston, MA 02115, USA.
| | | |
Collapse
|
27
|
Dundas J, Binkowski TA, DasGupta B, Liang J. Topology independent protein structural alignment. BMC Bioinformatics 2007; 8:388. [PMID: 17937816 PMCID: PMC2096629 DOI: 10.1186/1471-2105-8-388] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2007] [Accepted: 10/15/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identifying structurally similar proteins with different chain topologies can aid studies in homology modeling, protein folding, protein design, and protein evolution. These include circular permuted protein structures, and the more general cases of non-cyclic permutations between similar structures, which are related by non-topological rearrangement beyond circular permutation. We present a method based on an approximation algorithm that finds sequence-order independent structural alignments that are close to optimal. We formulate the structural alignment problem as a special case of the maximum-weight independent set problem, and solve this computationally intensive problem approximately by iteratively solving relaxations of a corresponding integer programming problem. The resulting structural alignment is sequence order independent. Our method is also insensitive to insertions, deletions, and gaps. RESULTS Using a novel similarity score and a statistical model for significance p-value, we are able to discover previously unknown circular permuted proteins between nucleoplasmin-core protein and auxin binding protein, between aspartate rasemase and 3-dehydrogenate dehydralase, as well as between migration inhibition factor and arginine repressor which involves an additional strand-swapping. We also report the finding of non-cyclic permuted protein structures existing in nature between AML1/core binding factor and ribofalvin synthase. Our method can be used for large scale alignment of protein structures regardless of the topology. CONCLUSION The approximation algorithm introduced in this work can find good solutions for the problem of protein structure alignment. Furthermore, this algorithm can detect topological differences between two spatially similar protein structures. The alignment between MIF and the arginine repressor demonstrates our algorithm's ability to detect structural similarities even when spatial rearrangement of structural units has occurred. The effectiveness of our method is also demonstrated by the discovery of previously unknown circular permutations. In addition, we report in this study the finding of a naturally occurring non-cyclic permuted protein between AML1/Core Binding Factor chain F and riboflavin synthase chain A.
Collapse
Affiliation(s)
- Joe Dundas
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607-7053, USA.
| | | | | | | |
Collapse
|
28
|
Abstract
This paper proposes a parameterized polynomial time approximation scheme (PTAS) for aligning two protein structures, in the case where one protein structure is represented by a contact map graph and the other by a contact map graph or a distance matrix. If the sequential order of alignment is not required, the time complexity is polynomial in the protein size and exponential with respect to two parameters D(u)/D(l) and D(c)/D(l), which usually can be treated as constants. In particular, D(u) is the distance threshold determining if two residues are in contact or not, D(c) is the maximally allowed distance between two matched residues after two proteins are superimposed, and D(l) is the minimum inter-residue distance in a typical protein. This result clearly demonstrates that the computational hardness of the contact map based protein structure alignment problem is related not to protein size but to several parameters modeling the problem. The result is achieved by decomposing the protein structure using tree decomposition and discretizing the rigid-body transformation space. Preliminary experimental results indicate that on a Linux PC, it takes from ten minutes to one hour to align two proteins with approximately 100 residues.
Collapse
Affiliation(s)
- Jinbo Xu
- Toyota Technological Institute at Chicago, Chicago, Illinois 60637, USA.
| | | | | |
Collapse
|
29
|
Connectivity independent protein-structure alignment: a hierarchical approach. BMC Bioinformatics 2006; 7:510. [PMID: 17118190 PMCID: PMC1683948 DOI: 10.1186/1471-2105-7-510] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2006] [Accepted: 11/21/2006] [Indexed: 11/13/2022] Open
Abstract
Background Protein-structure alignment is a fundamental tool to study protein function, evolution and model building. In the last decade several methods for structure alignment were introduced, but most of them ignore that structurally similar proteins can share the same spatial arrangement of secondary structure elements (SSE) but differ in the underlying polypeptide chain connectivity (non-sequential SSE connectivity). Results We perform protein-structure alignment using a two-level hierarchical approach implemented in the program GANGSTA. On the first level, pair contacts and relative orientations between SSEs (i.e. α-helices and β-strands) are maximized with a genetic algorithm (GA). On the second level residue pair contacts from the best SSE alignments are optimized. We have tested the method on visually optimized structure alignments of protein pairs (pairwise mode) and for database scans. For a given protein structure, our method is able to detect significant structural similarity of functionally important folds with non-sequential SSE connectivity. The performance for structure alignments with strictly sequential SSE connectivity is comparable to that of other structure alignment methods. Conclusion As demonstrated for several applications, GANGSTA finds meaningful protein-structure alignments independent of the SSE connectivity. GANGSTA is able to detect structural similarity of protein folds that are assigned to different superfamilies but nevertheless possess similar structures and perform related functions, even if these proteins differ in SSE connectivity.
Collapse
|
30
|
Yao Z, Xiao J, Tung AKH, Sung WK. FAMCS: finding all maximal common substructures in proteins. GENOMICS PROTEOMICS & BIOINFORMATICS 2006; 3:107-19. [PMID: 16393147 PMCID: PMC5172543 DOI: 10.1016/s1672-0229(05)03015-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Finding the common substructures shared by two proteins is considered as one of the central issues in computational biology because of its usefulness in understanding the structure-function relationship and application in drug and vaccine design. In this paper, we propose a novel algorithm called FAMCS (Finding All Maximal Common Substructures) for the common substructure identification problem. Our method works initially at the protein secondary structural element (SSE) level and starts with the identification of all structurally similar SSE pairs. These SSE pairs are then merged into sets using a modified Apriori algorithm, which will test the similarity of various sets of SSE pairs incrementally until all the maximal sets of SSE pairs that deemed to be similar are found. The maximal common substructures of the two proteins will be formed from these maximal sets. A refinement algorithm is also proposed to fine tune the alignment from the SSE level to the residue level. Comparison of FAMCS with other methods on various proteins shows that FAMCS can address all four requirements and infer interesting biological discoveries.
Collapse
Affiliation(s)
- Zhen Yao
- Department of Computer Science, National University of Singapore.
| | | | | | | |
Collapse
|
31
|
|
32
|
Gao F, Zaki MJ. PSIST: indexing protein structures using suffix trees. PROCEEDINGS. IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2005:212-22. [PMID: 16447979 DOI: 10.1109/csb.2005.46] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
Approaches for indexing proteins, and for fast and scalable searching for structures similar to a query structure have important applications such as protein structure and function prediction, protein classification and drug discovery. In this paper, we developed a new method for extracting the local feature vectors of protein structures. Each residue is represented by a triangle, and the correlation between a set of residues is described by the distances between Calpha atoms and the angles between the normals of planes in which the triangles lie. The normalized local feature vectors are indexed using a suffix tree. For all query segments, suffix trees can be used effectively to retrieve the maximal matches, which are then chained to obtain alignments with database proteins. Similar proteins are selected by their alignment score against the query. Our results shows classification accuracy up to 97.8% and 99.4% at the superfamily and class level according to the SCOP classification, and shows that on average 7.49 out of 10 proteins from the same superfamily are obtained among the top 10 matches. These results are competitive with the best previous methods.
Collapse
Affiliation(s)
- Feng Gao
- Department of Computer Science, Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY 12180, USA.
| | | |
Collapse
|