1
|
Sadique N, Ahmed AAN, Islam MT, Pervage MN, Shatabda S. Image-based effective feature generation for protein structural class and ligand binding prediction. PeerJ Comput Sci 2020; 6:e253. [PMID: 33816905 PMCID: PMC7924679 DOI: 10.7717/peerj-cs.253] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2019] [Accepted: 12/23/2019] [Indexed: 06/12/2023]
Abstract
Proteins are the building blocks of all cells in both human and all living creatures of the world. Most of the work in the living organism is performed by proteins. Proteins are polymers of amino acid monomers which are biomolecules or macromolecules. The tertiary structure of protein represents the three-dimensional shape of a protein. The functions, classification and binding sites are governed by the protein's tertiary structure. If two protein structures are alike, then the two proteins can be of the same kind implying similar structural class and ligand binding properties. In this paper, we have used the protein tertiary structure to generate effective features for applications in structural similarity to detect structural class and ligand binding. Firstly, we have analyzed the effectiveness of a group of image-based features to predict the structural class of a protein. These features are derived from the image generated by the distance matrix of the tertiary structure of a given protein. They include local binary pattern (LBP) histogram, Gabor filtered LBP histogram, separate row multiplication matrix with uniform LBP histogram, neighbor block subtraction matrix with uniform LBP histogram and atom bond. Separate row multiplication matrix and neighbor block subtraction matrix filters, as well as atom bond, are our novels. The experiments were done on a standard benchmark dataset. We have demonstrated the effectiveness of these features over a large variety of supervised machine learning algorithms. Experiments suggest support vector machines is the best performing classifier on the selected dataset using the set of features. We believe the excellent performance of Hybrid LBP in terms of accuracy would motivate the researchers and practitioners to use it to identify protein structural class. To facilitate that, a classification model using Hybrid LBP is readily available for use at http://brl.uiu.ac.bd/PL/. Protein-ligand binding is accountable for managing the tasks of biological receptors that help to cure diseases and many more. Therefore, binding prediction between protein and ligand is important for understanding a protein's activity or to accelerate docking computations in virtual screening-based drug design. Protein-ligand binding prediction requires three-dimensional tertiary structure of the target protein to be searched for ligand binding. In this paper, we have proposed a supervised learning algorithm for predicting protein-ligand binding, which is a similarity-based clustering approach using the same set of features. Our algorithm works better than the most popular and widely used machine learning algorithms.
Collapse
Affiliation(s)
- Nafees Sadique
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Al Amin Neaz Ahmed
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Md Tajul Islam
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Md. Nawshad Pervage
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| |
Collapse
|
2
|
Karim R, Aziz MMA, Shatabda S, Rahman MS, Mia MAK, Zaman F, Rakin S. CoMOGrad and PHOG: From Computer Vision to Fast and Accurate Protein Tertiary Structure Retrieval. Sci Rep 2015; 5:13275. [PMID: 26293226 PMCID: PMC4543952 DOI: 10.1038/srep13275] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2014] [Accepted: 06/26/2015] [Indexed: 11/09/2022] Open
Abstract
The number of entries in a structural database of proteins is increasing day by day. Methods for retrieving protein tertiary structures from such a large database have turn out to be the key to comparative analysis of structures that plays an important role to understand proteins and their functions. In this paper, we present fast and accurate methods for the retrieval of proteins having tertiary structures similar to a query protein from a large database. Our proposed methods borrow ideas from the field of computer vision. The speed and accuracy of our methods come from the two newly introduced features- the co-occurrence matrix of the oriented gradient and pyramid histogram of oriented gradient- and the use of Euclidean distance as the distance measure. Experimental results clearly indicate the superiority of our approach in both running time and accuracy. Our method is readily available for use from this website: http://research.buet.ac.bd:8080/Comograd/.
Collapse
Affiliation(s)
- Rezaul Karim
- AlEDA Group, Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Bangladesh
| | - Mohd Momin Al Aziz
- AlEDA Group, Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Bangladesh
| | - Swakkhar Shatabda
- AlEDA Group, Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Bangladesh.,Department of Computer Science and Engineering, United International University, Bangladesh
| | - M Sohel Rahman
- AlEDA Group, Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Bangladesh
| | - Md Abul Kashem Mia
- AlEDA Group, Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Bangladesh
| | - Farhana Zaman
- AlEDA Group, Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Bangladesh
| | - Salman Rakin
- AlEDA Group, Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Bangladesh
| |
Collapse
|
3
|
A real-time all-atom structural search engine for proteins. PLoS Comput Biol 2014; 10:e1003750. [PMID: 25079944 PMCID: PMC4117414 DOI: 10.1371/journal.pcbi.1003750] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2013] [Accepted: 06/09/2014] [Indexed: 12/01/2022] Open
Abstract
Protein designers use a wide variety of software tools for de novo design, yet their repertoire still lacks a fast and interactive all-atom search engine. To solve this, we have built the Suns program: a real-time, atomic search engine integrated into the PyMOL molecular visualization system. Users build atomic-level structural search queries within PyMOL and receive a stream of search results aligned to their query within a few seconds. This instant feedback cycle enables a new “designability”-inspired approach to protein design where the designer searches for and interactively incorporates native-like fragments from proven protein structures. We demonstrate the use of Suns to interactively build protein motifs, tertiary interactions, and to identify scaffolds compatible with hot-spot residues. The official web site and installer are located at http://www.degradolab.org/suns/ and the source code is hosted at https://github.com/godotgildor/Suns (PyMOL plugin, BSD license), https://github.com/Gabriel439/suns-cmd (command line client, BSD license), and https://github.com/Gabriel439/suns-search (search engine server, GPLv2 license).
Collapse
|
4
|
Wylie T, Zhu B. Protein chain pair simplification under the discrete Fréchet distance. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1372-1383. [PMID: 24407296 DOI: 10.1109/tcbb.2013.17] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
For protein structure alignment and comparison, a lot of work has been done using RMSD as the distance measure, which has drawbacks under certain circumstances. Thus, the discrete Fréchet distance was recently applied to the problem of protein (backbone) structure alignment and comparison with promising results. For this problem, visualization is also important because protein chain backbones can have as many as 500-600 $(\alpha)$-carbon atoms, which constitute the vertices in the comparison. Even with an excellent alignment, the similarity of two polygonal chains can be difficult to visualize unless the chains are nearly identical. Thus, the chain pair simplification problem (CPS-3F) was proposed in 2008 to simultaneously simplify both chains with respect to each other under the discrete Fréchet distance. The complexity of CPS-3F is unknown, so heuristic methods have been developed. Here, we define a variation of CPS-3F, called the constrained CPS-3F problem ($({\rm CPS\hbox{-}3F}^+)$), and prove that it is polynomially solvable by presenting a dynamic programming solution, which we then prove is a factor-2 approximation for CPS-3F. We then compare the $({\rm CPS\hbox{-}3F}^+)$ solutions with previous empirical results, and further demonstrate some of the benefits of the simplified comparisons. Chain pair simplification based on the Hausdorff distance (CPS-2H) is known to be NP-complete, and here we prove that the constrained version ($(\rm CPS\hbox{-}2H^+)$) is also NP-complete. Finally, we discuss future work and implications along with a software library implementation, named the Fréchet-based Protein Alignment & Comparison Toolkit (FPACT).
Collapse
|
5
|
JIANG MINGHUI, XU YING, ZHU BINHAI. PROTEIN STRUCTURE–STRUCTURE ALIGNMENT WITH DISCRETE FRÉCHET DISTANCE. J Bioinform Comput Biol 2011; 6:51-64. [PMID: 18324745 DOI: 10.1142/s0219720008003278] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2007] [Accepted: 10/25/2007] [Indexed: 11/18/2022]
Abstract
Matching two geometric objects in two-dimensional (2D) and three-dimensional (3D) spaces is a central problem in computer vision, pattern recognition, and protein structure prediction. In particular, the problem of aligning two polygonal chains under translation and rotation to minimize their distance has been studied using various distance measures. It is well known that the Hausdorff distance is useful for matching two point sets, and that the Fréchet distance is a superior measure for matching two polygonal chains. The discrete Fréchet distance closely approximates the (continuous) Fréchet distance, and is a natural measure for the geometric similarity of the folded 3D structures of biomolecules such as proteins. In this paper, we present new algorithms for matching two polygonal chains in two dimensions to minimize their discrete Fréchet distance under translation and rotation, and an effective heuristic for matching two polygonal chains in three dimensions. We also describe our empirical results on the application of the discrete Fréchet distance to protein structure–structure alignment.
Collapse
Affiliation(s)
- MINGHUI JIANG
- Department of Computer Science, Utah State University, Logan, UT 84322-4205, USA
| | - YING XU
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA 30602-7229, USA
| | - BINHAI ZHU
- Department of Computer Science, Montana State University, Bozeman, MT 59717-3880, USA
| |
Collapse
|
6
|
Abstract
The CATH database provides hierarchical classification of protein domains based on their folding patterns. Domains are obtained from protein structures deposited in the Protein Data Bank and both domain identification and subsequent classification use manual as well as automated procedures. The accompanying website http://www.cathdb.info provides an easy-to-use entry to the classification, allowing for both browsing and downloading of data. Here, we give a brief review of the database, its corresponding website and some related tools.
Collapse
Affiliation(s)
- Michael Knudsen
- Bioinformatics Research Centre, Aarhus University, DK-8000 Aarhus C, Denmark
| | | |
Collapse
|
7
|
Shyu CR, Pang B, Chi PH, Zhao N, Korkin D, Xu D. ProteinDBS v2.0: a web server for global and local protein structure search. Nucleic Acids Res 2010; 38:W53-8. [PMID: 20538653 PMCID: PMC2896110 DOI: 10.1093/nar/gkq522] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
ProteinDBS v2.0 is a web server designed for efficient and accurate comparisons and searches of structurally similar proteins from a large-scale database. It provides two comparison methods, global-to-global and local-to-local, to facilitate the searches of protein structures or substructures. ProteinDBS v2.0 applies advanced feature extraction algorithms and scalable indexing techniques to achieve a high-running speed while preserving reasonably high precision of structural comparison. The experimental results show that our system is able to return results of global comparisons in seconds from a complete Protein Data Bank (PDB) database of 152,959 protein chains and that it takes much less time to complete local comparisons from a non-redundant database of 3276 proteins than other accurate comparison methods. ProteinDBS v2.0 supports query by PDB protein ID and by new structures uploaded by users. To our knowledge, this is the only search engine that can simultaneously support global and local comparisons. ProteinDBS v2.0 is a useful tool to investigate functional or evolutional relationships among proteins. Moreover, the common substructures identified by local comparison can be potentially used to assist the human curation process in discovering new domains or folds from the ever-growing protein structure databases. The system is hosted at http://ProteinDBS.rnet.missouri.edu.
Collapse
Affiliation(s)
- Chi-Ren Shyu
- Informatics Institute, University of Missouri, Columbia, MO 65211, USA.
| | | | | | | | | | | |
Collapse
|
8
|
Improving Performance of Protein Structure Similarity Searching by Distributing Computations in Hierarchical Multi-Agent System. ACTA ACUST UNITED AC 2010. [DOI: 10.1007/978-3-642-16693-8_34] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/15/2023]
|
9
|
Chi PH, Pang B, Korkin D, Shyu CR. Efficient SCOP-fold classification and retrieval using index-based protein substructure alignments. ACTA ACUST UNITED AC 2009; 25:2559-65. [PMID: 19667079 DOI: 10.1093/bioinformatics/btp474] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION To investigate structure-function relationships, life sciences researchers usually retrieve and classify proteins with similar substructures into the same fold. A manually constructed database, SCOP, is believed to be highly accurate; however, it is labor intensive. Another known method, DALI, is also precise but computationally expensive. We have developed an efficient algorithm, namely, index-based protein substructure alignment (IPSA), for protein-fold classification. IPSA constructs a two-layer indexing tree to quickly retrieve similar substructures in proteins and suggests possible folds by aligning these substructures. RESULTS Compared with known algorithms, such as DALI, CE, MultiProt and MAMMOTH, on a sample dataset of non-redundant proteins from SCOP v1.73, IPSA exhibits an efficiency improvement of 53.10, 16.87, 3.60 and 1.64 times speedup, respectively. Evaluated on three different datasets of non-redundant proteins from SCOP, average accuracy of IPSA is approximately equal to DALI and better than CE, MAMMOTH, MultiProt and SSM. With reliable accuracy and efficiency, this work will benefit the study of high-throughput protein structure-function relationships. AVAILABILITY IPSA is publicly accessible at http://ProteinDBS.rnet.missouri.edu/IPSA.php
Collapse
Affiliation(s)
- Pin-Hao Chi
- Medical and Biological Digital Library Research Lab, Informatics Institute, University of Missouri, Columbia, MO 65211, USA
| | | | | | | |
Collapse
|
10
|
Han GW, Rife C, Sawaya MR. Applications of bioinformatics to protein structures: how protein structure and bioinformatics overlap. Methods Mol Biol 2009; 569:157-172. [PMID: 19623490 DOI: 10.1007/978-1-59745-524-4_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
In this chapter, we will focus on the role of bioinformatics to analyze a protein after its protein structure has been determined. First, we present how to validate protein structures for quality assurance. Then, we discuss how to analyze protein-protein interfaces and how to predict the biomolecule which is the biological oligomeric state of the protein. Finally, we discuss how to search for homologs based on the 3-D structure which is an essential step for understanding protein function.
Collapse
Affiliation(s)
- Gye Won Han
- Burnham Institute for Medical Research, La Jolla, CA, USA
| | | | | |
Collapse
|
11
|
Miao X, Waddell PJ, Valafar H. TALI: local alignment of protein structures using backbone torsion angles. J Bioinform Comput Biol 2008; 6:163-81. [PMID: 18324751 DOI: 10.1142/s0219720008003370] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2007] [Revised: 08/22/2007] [Accepted: 09/17/2007] [Indexed: 11/18/2022]
Abstract
UNLABELLED Torsion angle alignment (TALI) is a novel approach to local structural motif alignment, based on backbone torsion angles (phi, psi) rather than the more traditional atomic distance matrices. Representation of a protein structure in the form of a sequence of torsion angles enables easy integration of sequence and structural information, and adopts mature techniques in sequence alignment to improve performance and alignment quality. We show that TALI is able to match local structural motifs as well as identify global structural similarity. TALI is also compared to other structure alignment methods such as DALI, CE, and SSM, as well as sequence alignment based on PSI-BLAST; TALI is shown to be equally successful as, or more successful than, these other methods when applied to challenging structural alignments. The inference of the evolutionary tree of class II aminoacyl-tRNA synthetase shows the potential for TALI in estimating protein structural evolution and in identifying structural divergence among homologous structures. AVAILABILITY http://redcat.cse.sc.edu/index.php/ PROJECT TALI/.
Collapse
Affiliation(s)
- Xijiang Miao
- Department of Computer Science and Engineering, University of South Carolina, 301 Main Street, Columbia, SC 29208, USA.
| | | | | |
Collapse
|
12
|
Shapiro LG, Atmosukarto I, Cho H, Lin HJ, Ruiz-Correa S, Yuen J. Similarity-Based Retrieval for Biomedical Applications. CASE-BASED REASONING ON IMAGES AND SIGNALS 2008. [DOI: 10.1007/978-3-540-73180-1_12] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
13
|
Abstract
As protein databases continue to grow in size, exhaustive search methods that compare a query structure against every database structure can no longer provide satisfactory performance. Instead, the filter-and-refine paradigm offers an efficient alternative to database search without compromising the accuracy of the answers. In this paradigm, protein structures are represented in an abstract form. During querying, based on the abstract representations, the filtering phase prunes away dissimilar structures quickly so that only a small collection of promising structures are examined using a detailed structure alignment technique in the refinement phase. This article reviews mainly techniques developed for the filtering phase.
Collapse
Affiliation(s)
- Zeyar Aung
- Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613, Singapore.
| | | |
Collapse
|
14
|
Scott G, Shyu CR. Knowledge-Driven Multidimensional Indexing Structure for Biomedical Media Database Retrieval. ACTA ACUST UNITED AC 2007; 11:320-31. [PMID: 17521082 DOI: 10.1109/titb.2006.880551] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Today, biomedical media data are being generated at rates unimaginable only years ago. Content-based retrieval of biomedical media from large databases is becoming increasingly important to clinical, research, and educational communities. In this paper, we present the recently developed entropy balanced statistical (EBS) k-d tree and its applications to biomedical media, including a high-resolution computed tomography (HRCT) lung image database and the first real-time protein tertiary structure search engine. Our index utilizes statistical properties inherent in large-scale biomedical media databases for efficient and accurate searches. By applying concepts from pattern recognition and information theory, the EBS k-d tree is built through top-down decision tree induction. Experimentation shows similarity searches against a protein structure database of 53 363 structures consistently execute in less than 8.14 ms for the top 100 most similar structures. Additionally, we have shown improved retrieval precision over adaptive and statistical k-d trees. Retrieval precision of the EBS k-d tree is 81.6% for content-based retrieval of HRCT lung images and 94.9% at 10% recall for protein structure similarity search. The EBS k-d tree has enormous potential for use in biomedical applications embedded with ground-truth knowledge and multidimensional signatures.
Collapse
Affiliation(s)
- Grant Scott
- Medical and Biological Digital Library Research Laboratory, Department of Computer Science, University of Missouri, Columbia, MO 65211, USA.
| | | |
Collapse
|
15
|
Shih ESC, Gan RCR, Hwang MJ. OPAAS: a web server for optimal, permuted, and other alternative alignments of protein structures. Nucleic Acids Res 2006; 34:W95-8. [PMID: 16845117 PMCID: PMC1538888 DOI: 10.1093/nar/gkl264] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
The large number of experimentally determined protein 3D structures is a rich resource for studying protein function and evolution, and protein structure comparison (PSC) is a key method for such studies. When comparing two protein structures, almost all currently available PSC servers report a single and sequential (i.e. topological) alignment, whereas the existence of good alternative alignments, including those involving permutations (i.e. non-sequential or non-topological alignments), is well known. We have recently developed a novel PSC method that can detect alternative alignments of statistical significance (alignment similarity P-value <10−5), including structural permutations at all levels of complexity. OPAAS, the server of this PSC method freely accessible at our website (), provides an easy-to-read hierarchical layout of output to display detailed information on all of the significant alternative alignments detected. Because these alternative alignments can offer a more complete picture on the structural, evolutionary and functional relationship between two proteins, OPAAS can be used in structural bioinformatics research to gain additional insight that is not readily provided by existing PSC servers.
Collapse
Affiliation(s)
| | | | - Ming-Jing Hwang
- To whom correspondence should be addressed. Tel: +886 2 2789 9033; Fax: +886 2 2788 7641;
| |
Collapse
|
16
|
Yang JM, Tung CH. Protein structure database search and evolutionary classification. Nucleic Acids Res 2006; 34:3646-59. [PMID: 16885238 PMCID: PMC1540718 DOI: 10.1093/nar/gkl395] [Citation(s) in RCA: 80] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2006] [Revised: 05/06/2006] [Accepted: 05/09/2006] [Indexed: 11/14/2022] Open
Abstract
As more protein structures become available and structural genomics efforts provide structural models in a genome-wide strategy, there is a growing need for fast and accurate methods for discovering homologous proteins and evolutionary classifications of newly determined structures. We have developed 3D-BLAST, in part, to address these issues. 3D-BLAST is as fast as BLAST and calculates the statistical significance (E-value) of an alignment to indicate the reliability of the prediction. Using this method, we first identified 23 states of the structural alphabet that represent pattern profiles of the backbone fragments and then used them to represent protein structure databases as structural alphabet sequence databases (SADB). Our method enhanced BLAST as a search method, using a new structural alphabet substitution matrix (SASM) to find the longest common substructures with high-scoring structured segment pairs from an SADB database. Using personal computers with Intel Pentium4 (2.8 GHz) processors, our method searched more than 10 000 protein structures in 1.3 s and achieved a good agreement with search results from detailed structure alignment methods. [3D-BLAST is available at http://3d-blast.life.nctu.edu.tw].
Collapse
Affiliation(s)
- Jinn-Moon Yang
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, 30050, Taiwan.
| | | |
Collapse
|
17
|
Chi PH, Shyu CR, Xu D. A fast SCOP fold classification system using content-based E-Predict algorithm. BMC Bioinformatics 2006; 7:362. [PMID: 16872501 PMCID: PMC1579235 DOI: 10.1186/1471-2105-7-362] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2005] [Accepted: 07/26/2006] [Indexed: 11/10/2022] Open
Abstract
Background Domain experts manually construct the Structural Classification of Protein (SCOP) database to categorize and compare protein structures. Even though using the SCOP database is believed to be more reliable than classification results from other methods, it is labor intensive. To mimic human classification processes, we develop an automatic SCOP fold classification system to assign possible known SCOP folds and recognize novel folds for newly-discovered proteins. Results With a sufficient amount of ground truth data, our system is able to assign the known folds for newly-discovered proteins in the latest SCOP v1.69 release with 92.17% accuracy. Our system also recognizes the novel folds with 89.27% accuracy using 10 fold cross validation. The average response time for proteins with 500 and 1409 amino acids to complete the classification process is 4.1 and 17.4 seconds, respectively. By comparison with several structural alignment algorithms, our approach outperforms previous methods on both the classification accuracy and efficiency. Conclusion In this paper, we build an advanced, non-parametric classifier to accelerate the manual classification processes of SCOP. With satisfactory ground truth data from the SCOP database, our approach identifies relevant domain knowledge and yields reasonably accurate classifications. Our system is publicly accessible at .
Collapse
Affiliation(s)
- Pin-Hao Chi
- Medical and Biological Digital Library Research Lab, Department of Computer Science, University of Missouri, Columbia, MO 65211, USA
| | - Chi-Ren Shyu
- Medical and Biological Digital Library Research Lab, Department of Computer Science, University of Missouri, Columbia, MO 65211, USA
| | - Dong Xu
- Digital Biology Laboratory, Department of Computer Science and Life Sciences Center, University of Missouri, Columbia, MO 65211, USA
| |
Collapse
|
18
|
Topinka CM, Shyu CR. Predicting cancer interaction networks using text-mining and structure understanding. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2006; 2006:1123. [PMID: 17238742 PMCID: PMC1839458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
Extended biomolecule binding or interaction networks can be built by computationally predicting protein-protein interactions from diverse data sources. To construct networks focused on cancer research our approach combines domain specific natural language processing (NLP) assisted text-mining of biomedical literature databases with structure-based protein-protein interaction prediction reinforced with sub-cellular localization and evolutionary information. Fast retrieval of structure-based queries will be accomplished by using a novel knowledge discovery process developed previously.
Collapse
|