51
|
ProFold: Protein Fold Classification with Additional Structural Features and a Novel Ensemble Classifier. BIOMED RESEARCH INTERNATIONAL 2016; 2016:6802832. [PMID: 27660761 PMCID: PMC5021882 DOI: 10.1155/2016/6802832] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/01/2016] [Revised: 07/15/2016] [Accepted: 08/07/2016] [Indexed: 11/17/2022]
Abstract
Protein fold classification plays an important role in both protein functional analysis and drug design. The number of proteins in PDB is very large, but only a very small part is categorized and stored in the SCOPe database. Therefore, it is necessary to develop an efficient method for protein fold classification. In recent years, a variety of classification methods have been used in many protein fold classification studies. In this study, we propose a novel classification method called proFold. We import protein tertiary structure in the period of feature extraction and employ a novel ensemble strategy in the period of classifier training. Compared with existing similar ensemble classifiers using the same widely used dataset (DD-dataset), proFold achieves 76.2% overall accuracy. Another two commonly used datasets, EDD-dataset and TG-dataset, are also tested, of which the accuracies are 93.2% and 94.3%, higher than the existing methods. ProFold is available to the public as a web-server.
Collapse
|
52
|
Zhang L, Wang H, Yan L, Su L, Xu D. OMPcontact: An Outer Membrane Protein Inter-Barrel Residue Contact Prediction Method. J Comput Biol 2016; 24:217-228. [PMID: 27513917 DOI: 10.1089/cmb.2015.0236] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In the two transmembrane protein types, outer membrane proteins (OMPs) perform diverse important biochemical functions, including substrate transport and passive nutrient uptake and intake. Hence their 3D structures are expected to reveal these functions. Because experimental structures are scarce, predicted 3D structures are more adapted to OMP research instead, and the inter-barrel residue contact is becoming one of the most remarkable features, improving prediction accuracy by describing the structural information of OMPs. To predict OMP structures accurately, we explored an OMP inter-barrel residue contact prediction method: OMPcontact. Multiple OMP-specific features were integrated in the method, including residue evolutionary covariation, topology-based transmembrane segment relative residue position, OMP lipid layer accessibility, and residue evolution conservation. These features describe the properties of a residue pair in different respects: sequential, structural, evolutionary, and biochemical. Within a 3-residues slide window, a Support Vector Machine (SVM) could accurately determinate the inter-barrel contact residue pair using above features. A 5-fold cross-valuation process was applied in testing the OMPcontact performance against a non-redundant OMP set with 75 samples inside. The tests compared four evolutionary covariation methods and screen analyzed the adaptive ones for inter-barrel contact prediction. The results showed our method not only efficiently realized the prediction, but also scored the possibility for residue pairs reliably. This is expected to improve OMP tertiary structure prediction. Therefore, OMPcontact will be helpful in compiling a structural census of outer membrane protein.
Collapse
Affiliation(s)
- Li Zhang
- 1 School of Computer Science and Technology, Jilin University , Changchun, China .,4 School of Computer Science and Engineering, Changchun University of Technology , Changchun, China
| | - Han Wang
- 2 School of Computer Science and Information Technology, Northeast Normal University , Changchun, China
| | - Lun Yan
- 1 School of Computer Science and Technology, Jilin University , Changchun, China
| | - Lingtao Su
- 1 School of Computer Science and Technology, Jilin University , Changchun, China
| | - Dong Xu
- 3 Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri , Columbia, Missouri, U.S.A
| |
Collapse
|
53
|
Cui X, Lu Z, Wang S, Jing-Yan Wang J, Gao X. CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction. Bioinformatics 2016; 32:i332-i340. [PMID: 27307635 PMCID: PMC4908355 DOI: 10.1093/bioinformatics/btw271] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
MOTIVATION Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding protein functions. Despite the advances in recent decades on sequence alignment, threading and alignment-free methods, protein homology detection remains a challenging open problem. Recently, network methods that try to find transitive paths in the protein structure space demonstrate the importance of incorporating network information of the structure space. Yet, current methods merge the sequence space and the structure space into a single space, and thus introduce inconsistency in combining different sources of information. METHOD We present a novel network-based protein homology detection method, CMsearch, based on cross-modal learning. Instead of exploring a single network built from the mixture of sequence and structure space information, CMsearch builds two separate networks to represent the sequence space and the structure space. It then learns sequence-structure correlation by simultaneously taking sequence information, structure information, sequence space information and structure space information into consideration. RESULTS We tested CMsearch on two challenging tasks, protein homology detection and protein structure prediction, by querying all 8332 PDB40 proteins. Our results demonstrate that CMsearch is insensitive to the similarity metrics used to define the sequence and the structure spaces. By using HMM-HMM alignment as the sequence similarity metric, CMsearch clearly outperforms state-of-the-art homology detection methods and the CASP-winning template-based protein structure prediction methods. AVAILABILITY AND IMPLEMENTATION Our program is freely available for download from http://sfb.kaust.edu.sa/Pages/Software.aspx CONTACT : xin.gao@kaust.edu.sa SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xuefeng Cui
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia
| | - Zhiwu Lu
- Beijing Key Laboratory of Big Data Management and Analysis Methods, School of Information, Renmin University of China, Beijing 100872, China
| | - Sheng Wang
- Toyota Technological Institute at Chicago, 6045 Kenwood Avenue, Chicago, IL 60637, USA Department of Human Genetics, University of Chicago, E. 58th St, Chicago, IL 60637, USA
| | - Jim Jing-Yan Wang
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia
| | - Xin Gao
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
54
|
Xu J, Zhang J. Impact of structure space continuity on protein fold classification. Sci Rep 2016; 6:23263. [PMID: 27006112 PMCID: PMC4804218 DOI: 10.1038/srep23263] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2015] [Accepted: 03/03/2016] [Indexed: 11/09/2022] Open
Abstract
Protein structure classification hierarchically clusters domain structures based on structure and/or sequence similarities and plays important roles in the study of protein structure-function relationship and protein evolution. Among many classifications, SCOP and CATH are widely viewed as the gold standards. Fold classification is of special interest because this is the lowest level of classification that does not depend on protein sequence similarity. The current fold classifications such as those in SCOP and CATH are controversial because they implicitly assume that folds are discrete islands in the structure space, whereas increasing evidence suggests significant similarities among folds and supports a continuous fold space. Although this problem is widely recognized, its impact on fold classification has not been quantitatively evaluated. Here we develop a likelihood method to classify a domain into the existing folds of CATH or SCOP using both query-fold structure similarities and within-fold structure heterogeneities. The new classification differs from the original classification for 3.4-12% of domains, depending on factors such as the structure similarity score and original classification scheme used. Because these factors differ for different biological purposes, our results indicate that the importance of considering structure space continuity in fold classification depends on the specific question asked.
Collapse
Affiliation(s)
- Jinrui Xu
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Jianzhi Zhang
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
55
|
Improving Protein Fold Recognition by Deep Learning Networks. Sci Rep 2015; 5:17573. [PMID: 26634993 PMCID: PMC4669437 DOI: 10.1038/srep17573] [Citation(s) in RCA: 90] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2015] [Accepted: 11/02/2015] [Indexed: 12/31/2022] Open
Abstract
For accurate recognition of protein folds, a deep learning network method (DN-Fold) was developed to predict if a given query-template protein pair belongs to the same structural fold. The input used stemmed from the protein sequence and structural features extracted from the protein pair. We evaluated the performance of DN-Fold along with 18 different methods on Lindahl’s benchmark dataset and on a large benchmark set extracted from SCOP 1.75 consisting of about one million protein pairs, at three different levels of fold recognition (i.e., protein family, superfamily, and fold) depending on the evolutionary distance between protein sequences. The correct recognition rate of ensembled DN-Fold for Top 1 predictions is 84.5%, 61.5%, and 33.6% and for Top 5 is 91.2%, 76.5%, and 60.7% at family, superfamily, and fold levels, respectively. We also evaluated the performance of single DN-Fold (DN-FoldS), which showed the comparable results at the level of family and superfamily, compared to ensemble DN-Fold. Finally, we extended the binary classification problem of fold recognition to real-value regression task, which also show a promising performance. DN-Fold is freely available through a web server at http://iris.rnet.missouri.edu/dnfold.
Collapse
|
56
|
Cabezas-Cruz A, Valdés JJ, Lancelot J, Pierce RJ. Fast evolutionary rates associated with functional loss in class I glucose transporters of Schistosoma mansoni. BMC Genomics 2015; 16:980. [PMID: 26584526 PMCID: PMC4653847 DOI: 10.1186/s12864-015-2144-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2015] [Accepted: 10/26/2015] [Indexed: 11/24/2022] Open
Abstract
Background The trematode parasite, Schistosoma mansoni, has evolved to switch from oxidative phosphorylation to glycolysis in the presence of glucose immediately after invading the human host. This metabolic switch is dependent on extracellular glucose concentration. Four glucose transporters are encoded in the genome of S. mansoni, however, only two were shown to facilitate glucose diffusion. Results By modeling the phase of human host infection, we showed that transporter transcript expression profiles of recently transformed schistosomula have two opposing responses to increased glucose concentrations. Concurring with the transcription profiles, our phylogenetic analyses revealed that S. mansoni glucose transporters belong to two separate clusters, one associated with class I glucose transporters from vertebrates and insects, and the other specific to parasitic Platyhelminthes. To study the evolutionary paths of both groups and their functional implications, we determined evolutionary rates, relative divergence times, genomic organization and performed structural analyses with the protein sequences. We finally used the modelled structures of the S. mansoni glucose transporters to biophysically (i) analyze the dynamics of key residues during glucose binding, (ii) test glucose stability within the active site, and (iii) demonstrate glucose diffusion. The two S. mansoni Platyhelminthes-specific glucose transporters, which seem to be younger than the other two, exhibit slower rates of molecular evolution, are encoded by intron-poor genes, and transport glucose. Interestingly, our molecular dynamic analyses suggest that S. mansoni class I glucose transporters are not able to transport glucose. Conclusions The glucose transporter family in S. mansoni exhibit different evolutionary histories. Our results suggested that S. mansoni class I glucose transporters lost their capacity to transport glucose and that this function evolved independently in the Platyhelminthes-specific glucose transporters. Finally, taking into account the differences in the dynamics of glucose transport of the Platyhelminthes-specific transporters of S. mansoni compared to that of humans, we conclude that S. mansoni glucose transporters may be targets for rationally designed drugs against schistosomiasis. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2144-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alejandro Cabezas-Cruz
- Univ. Lille, CNRS, Inserm, CHU Lille, Institut Pasteur de Lille, U1019 - UMR 8204 - CIIL - Centre d'Infection et d'Immunité de Lille, F-59000, Lille, France.
| | - James J Valdés
- Institute of Parasitology, Biology Centre of the Academy of Sciences of the Czech Republic, 37005, České Budějovice, Czech Republic.
| | - Julien Lancelot
- Univ. Lille, CNRS, Inserm, CHU Lille, Institut Pasteur de Lille, U1019 - UMR 8204 - CIIL - Centre d'Infection et d'Immunité de Lille, F-59000, Lille, France.
| | - Raymond J Pierce
- Univ. Lille, CNRS, Inserm, CHU Lille, Institut Pasteur de Lille, U1019 - UMR 8204 - CIIL - Centre d'Infection et d'Immunité de Lille, F-59000, Lille, France.
| |
Collapse
|
57
|
Lyons J, Dehzangi A, Heffernan R, Yang Y, Zhou Y, Sharma A, Paliwal K. Advancing the Accuracy of Protein Fold Recognition by Utilizing Profiles From Hidden Markov Models. IEEE Trans Nanobioscience 2015. [DOI: 10.1109/tnb.2015.2457906] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
58
|
Pinheiro-Silva R, Borges L, Coelho LP, Cabezas-Cruz A, Valdés JJ, do Rosário V, de la Fuente J, Domingos A. Gene expression changes in the salivary glands of Anopheles coluzzii elicited by Plasmodium berghei infection. Parasit Vectors 2015; 8:485. [PMID: 26395987 PMCID: PMC4580310 DOI: 10.1186/s13071-015-1079-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2015] [Accepted: 09/09/2015] [Indexed: 11/10/2022] Open
Abstract
Background Malaria is a devastating infectious disease caused by Plasmodium parasites transmitted through the bites of infected Anopheles mosquitoes. Salivary glands are the only mosquito tissue invaded by Plasmodium sporozoites, being a key stage for the effective parasite transmission, making the study of Anopheles sialome highly relevant. Methods RNA-sequencing was used to compare differential gene expression in salivary glands of uninfected and Plasmodium berghei-infected Anopheles coluzzii mosquitoes. RNA-seq results were validated by quantitative RT-PCR. The transmembrane glucose transporter gene AGAP007752 was selected for functional analysis by RNA interference. The effect of gene silencing on infection level was evaluated. The putative function and tertiary structure of the protein was assessed. Results RNA-seq data showed that 2588 genes were differentially expressed in mosquitoes salivary glands in response to P. berghei infection, being 1578 upregulated and 1010 downregulated. Metabolism, Immunity, Replication/Transcription/Translation, Proteolysis and Transport were the mosquito gene functional classes more affected by parasite infection. Endopeptidase coding genes were the most abundant within the differentially expressed genes in infected salivary glands (P < 0.001). Based on its putative function and expression level, the transmembrane glucose transporter gene, AGAP007752, was selected for functional analysis by RNA interference. The results demonstrated that the number of sporozoites was 44.3 % lower in mosquitoes fed on infected mice after AGAPP007752 gene knockdown when compared to control (P < 0.01). Conclusions Our hypothesis is that the protein encoded by the gene AGAPP007752 may play a role on An. coluzzii salivary glands infection by Plasmodium parasite, working as a sporozoite receptor and/or promoting a favorable environment for the capacity of sporozoites. Electronic supplementary material The online version of this article (doi:10.1186/s13071-015-1079-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Lara Borges
- Instituto de Higiene e Medicina Tropical (IHMT), Lisbon, Portugal. .,Global Health and Tropical Medicine (GHMT), Instituto de Higiene e Medicina Tropical (IHMT), Lisbon, Portugal.
| | - Luís Pedro Coelho
- Unidade de Biofísica e Expressão Genética, Instituto de Medicina Molecular (IMM), Lisbon, Portugal.
| | - Alejandro Cabezas-Cruz
- Center for Infection and Immunity of Lille (CIIL), Institut Pasteur de Lille, Lille, France. .,SaBio. Instituto de Investigación de Recursos Cinegéticos, IREC-CSIC-UCLM-JCCM, Ciudad Real, Spain.
| | - James J Valdés
- Institute of Parasitology, Biology Centre of the Academy of Sciences of the Czech Republic, České Budějovice, Czech Republic.
| | | | - José de la Fuente
- SaBio. Instituto de Investigación de Recursos Cinegéticos, IREC-CSIC-UCLM-JCCM, Ciudad Real, Spain. .,Department of Veterinary Pathobiology, Center for Veterinary Health Sciences, Oklahoma State University, Stillwater, USA.
| | - Ana Domingos
- Instituto de Higiene e Medicina Tropical (IHMT), Lisbon, Portugal. .,Global Health and Tropical Medicine (GHMT), Instituto de Higiene e Medicina Tropical (IHMT), Lisbon, Portugal.
| |
Collapse
|
59
|
Iqbal S, Mishra A, Hoque MT. Improved prediction of accessible surface area results in efficient energy function application. J Theor Biol 2015; 380:380-91. [DOI: 10.1016/j.jtbi.2015.06.012] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2015] [Revised: 05/15/2015] [Accepted: 06/02/2015] [Indexed: 01/16/2023]
|
60
|
Li C, Lin X, Hui C, Lam KM, Zhang S. Computer-Aided Diagnosis for Distinguishing Pancreatic Mucinous Cystic Neoplasms From Serous Oligocystic Adenomas in Spectral CT Images. Technol Cancer Res Treat 2014; 15:44-54. [PMID: 25520271 DOI: 10.1177/1533034614563013] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2014] [Accepted: 11/10/2014] [Indexed: 12/15/2022] Open
Abstract
OBJECTIVE This preliminary study aims to verify the effectiveness of the additional information provided by spectral computed tomography (CT) with the proposed computer-aided diagnosis (CAD) scheme to differentiate pancreatic serous oligocystic adenomas (SOAs) from mucinous cystic neoplasms of pancreas cystic lesions. MATERIALS AND METHODS This study was conducted from January 2010 to October 2013. Twenty-three patients (5 men and 18 women; mean age, 43.96 years old) with SOA and 19 patients (3 men and 16 women; mean age, 41.74 years old) with MCN were included in this retrospective study. Two types of features were collected by dual-energy spectral CT imaging as follows: conventional and additional quantitative spectral CT features. Classification results of the CAD scheme were compared using the conventional features and full feature data set. Important features were selected using support vector machine classification method combined with feature-selection technique. The optimal cutoff values of selected features were determined through receiver-operating characteristic curve analyses. RESULTS Combining conventional features with additional spectral CT features improved the overall accuracy from 88.37% to 93.02%. The selected features of the proposed CAD scheme were tumor size, contour, location, and low-energy CT values (43 keV). Iodine-water basis material pair densities in both arterial phase (AP) and portal venous phase (PP) were important factors for differential diagnosis of SOA and MCN. The optimal cutoff values of long axis, short axis, 40 keV monochromatic CT value in AP, iodine (water) density in AP, 43 keV monochromatic CT value in PP, and iodine (water) density in PP were 3.4 mm, 3.1 mm, 35.7 Hu, 0.32533 mg/mL, 39.4 Hu, and 0.348 mg/mL, respectively. CONCLUSION The combination of conventional features and additional information provided by dual-energy spectral CT shows a high accuracy in the CAD scheme. The quantitative information of spectral CT may prove useful in the diagnosis and classification of SOAs and MCNs with machine learning algorithms.
Collapse
Affiliation(s)
- Chao Li
- Department of Biomedical Engineering, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Xiaozhu Lin
- Department of Radiology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Chun Hui
- Department of Biomedical Engineering, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Kin Man Lam
- Department of Electronic and Information Engineering, Centre for Signal Processing, the Hong Kong Polytechnic University, Hong Kong, China
| | - Su Zhang
- Department of Biomedical Engineering, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China
| |
Collapse
|
61
|
Eickholt J, Wang Z. PCP-ML: protein characterization package for machine learning. BMC Res Notes 2014; 7:810. [PMID: 25406415 PMCID: PMC4246511 DOI: 10.1186/1756-0500-7-810] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2014] [Accepted: 10/31/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Machine Learning (ML) has a number of demonstrated applications in protein prediction tasks such as protein structure prediction. To speed further development of machine learning based tools and their release to the community, we have developed a package which characterizes several aspects of a protein commonly used for protein prediction tasks with machine learning. FINDINGS A number of software libraries and modules exist for handling protein related data. The package we present in this work, PCP-ML, is unique in its small footprint and emphasis on machine learning. Its primary focus is on characterizing various aspects of a protein through sets of numerical data. The generated data can then be used with machine learning tools and/or techniques. PCP-ML is very flexible in how the generated data is formatted and as a result is compatible with a variety of existing machine learning packages. Given its small size, it can be directly packaged and distributed with community developed tools for protein prediction tasks. CONCLUSIONS Source code and example programs are available under a BSD license at http://mlid.cps.cmich.edu/eickh1jl/tools/PCPML/. The package is implemented in C++ and accessible as a Python module.
Collapse
Affiliation(s)
- Jesse Eickholt
- Department of Computer Science, Central Michigan University, Mount Pleasant, MI 48859, USA.
| | | |
Collapse
|
62
|
Abstract
BACKGROUND Recognizing the correct structural fold among known template protein structures for a target protein (i.e. fold recognition) is essential for template-based protein structure modeling. Since the fold recognition problem can be defined as a binary classification problem of predicting whether or not the unknown fold of a target protein is similar to an already known template protein structure in a library, machine learning methods have been effectively applied to tackle this problem. In our work, we developed RF-Fold that uses random forest - one of the most powerful and scalable machine learning classification methods - to recognize protein folds. RESULTS RF-Fold consists of hundreds of decision trees that can be trained efficiently on very large datasets to make accurate predictions on a highly imbalanced dataset. We evaluated RF-Fold on the standard Lindahl's benchmark dataset comprised of 976 × 975 target-template protein pairs through cross-validation. Compared with 17 different fold recognition methods, the performance of RF-Fold is generally comparable to the best performance in fold recognition of different difficulty ranging from the easiest family level, the medium-hard superfamily level, and to the hardest fold level. Based on the top-one template protein ranked by RF-Fold, the correct recognition rate is 84.5%, 63.4%, and 40.8% at family, superfamily, and fold levels, respectively. Based on the top-five template protein folds ranked by RF-Fold, the correct recognition rate increases to 91.5%, 79.3% and 58.3% at family, superfamily, and fold levels. CONCLUSIONS The good performance achieved by the RF-Fold demonstrates the random forest's effectiveness for protein fold recognition.
Collapse
Affiliation(s)
- Taeho Jo
- Department of Computer Science, Informatics Institute, C. Bond Life Science Center, University of Missouri, Columbia, MO 65211, USA
| | - Jianlin Cheng
- Department of Computer Science, Informatics Institute, C. Bond Life Science Center, University of Missouri, Columbia, MO 65211, USA
| |
Collapse
|
63
|
Reconstructing protein structures by neural network pairwise interaction fields and iterative decoy set construction. Biomolecules 2014; 4:160-80. [PMID: 24970210 PMCID: PMC4030983 DOI: 10.3390/biom4010160] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2013] [Revised: 01/22/2014] [Accepted: 01/30/2014] [Indexed: 11/17/2022] Open
Abstract
Predicting the fold of a protein from its amino acid sequence is one of the grand problems in computational biology. While there has been progress towards a solution, especially when a protein can be modelled based on one or more known structures (templates), in the absence of templates, even the best predictions are generally much less reliable. In this paper, we present an approach for predicting the three-dimensional structure of a protein from the sequence alone, when templates of known structure are not available. This approach relies on a simple reconstruction procedure guided by a novel knowledge-based evaluation function implemented as a class of artificial neural networks that we have designed: Neural Network Pairwise Interaction Fields (NNPIF). This evaluation function takes into account the contextual information for each residue and is trained to identify native-like conformations from non-native-like ones by using large sets of decoys as a training set. The training set is generated and then iteratively expanded during successive folding simulations. As NNPIF are fast at evaluating conformations, thousands of models can be processed in a short amount of time, and clustering techniques can be adopted for model selection. Although the results we present here are very preliminary, we consider them to be promising, with predictions being generated at state-of-the-art levels in some of the cases.
Collapse
|
64
|
Abbasi E, Ghatee M, Shiri M. FRAN and RBF-PSO as two components of a hyper framework to recognize protein folds. Comput Biol Med 2013; 43:1182-91. [DOI: 10.1016/j.compbiomed.2013.05.017] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2012] [Revised: 05/21/2013] [Accepted: 05/22/2013] [Indexed: 10/26/2022]
|
65
|
Kuksa PP. Biological sequence classification with multivariate string kernels. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1201-1210. [PMID: 24384708 DOI: 10.1109/tcbb.2013.15] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
String kernel-based machine learning methods have yielded great success in practical tasks of structured/sequential data analysis. They often exhibit state-of-the-art performance on many practical tasks of sequence analysis such as biological sequence classification, remote homology detection, or protein superfamily and fold prediction. However, typical string kernel methods rely on the analysis of discrete 1D string data (e.g., DNA or amino acid sequences). In this paper, we address the multiclass biological sequence classification problems using multivariate representations in the form of sequences of features vectors (as in biological sequence profiles, or sequences of individual amino acid physicochemical descriptors) and a class of multivariate string kernels that exploit these representations. On three protein sequence classification tasks, the proposed multivariate representations and kernels show significant 15-20 percent improvements compared to existing state-of-the-art sequence classification methods.
Collapse
|
66
|
Wang H, He Z, Zhang C, Zhang L, Xu D. Transmembrane protein alignment and fold recognition based on predicted topology. PLoS One 2013; 8:e69744. [PMID: 23894534 PMCID: PMC3716705 DOI: 10.1371/journal.pone.0069744] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2012] [Accepted: 06/15/2013] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Although Transmembrane Proteins (TMPs) are highly important in various biological processes and pharmaceutical developments, general prediction of TMP structures is still far from satisfactory. Because TMPs have significantly different physicochemical properties from soluble proteins, current protein structure prediction tools for soluble proteins may not work well for TMPs. With the increasing number of experimental TMP structures available, template-based methods have the potential to become broadly applicable for TMP structure prediction. However, the current fold recognition methods for TMPs are not as well developed as they are for soluble proteins. METHODOLOGY We developed a novel TMP Fold Recognition method, TMFR, to recognize TMP folds based on sequence-to-structure pairwise alignment. The method utilizes topology-based features in alignment together with sequence profile and solvent accessibility. It also incorporates a gap penalty that depends on predicted topology structure segments. Given the difference between α-helical transmembrane protein (αTMP) and β-strands transmembrane protein (βTMP), parameters of scoring functions are trained respectively for these two protein categories using 58 αTMPs and 17 βTMPs in a non-redundant training dataset. RESULTS We compared our method with HHalign, a leading alignment tool using a non-redundant testing dataset including 72 αTMPs and 30 βTMPs. Our method achieved 10% and 9% better accuracies than HHalign in αTMPs and βTMPs, respectively. The raw score generated by TMFR is negatively correlated with the structure similarity between the target and the template, which indicates its effectiveness for fold recognition. The result demonstrates TMFR provides an effective TMP-specific fold recognition and alignment method.
Collapse
Affiliation(s)
- Han Wang
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, People’s Republic of China
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, Missouri, United States of America
| | - Zhiquan He
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, Missouri, United States of America
| | - Chao Zhang
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, Missouri, United States of America
| | - Li Zhang
- School of Computer Science and Engineering, Changchun University of Technology, Changchun, People’s Republic of China
| | - Dong Xu
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, Missouri, United States of America
| |
Collapse
|
67
|
Guilloux A, Caudron B, Jestin JL. A method to predict edge strands in beta-sheets from protein sequences. Comput Struct Biotechnol J 2013; 7:e201305001. [PMID: 24688737 PMCID: PMC3962219 DOI: 10.5936/csbj.201305001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2013] [Revised: 05/27/2013] [Accepted: 05/30/2013] [Indexed: 12/15/2022] Open
Abstract
There is a need for rules allowing three-dimensional structure information to be derived from protein sequences. In this work, consideration of an elementary protein folding step allows protein sub-sequences which optimize folding to be derived for any given protein sequence. Classical mechanics applied to this system and the energy conservation law during the elementary folding step yields an equation whose solutions are taken over the field of rational numbers. This formalism is applied to beta-sheets containing two edge strands and at least two central strands. The number of protein sub-sequences optimized for folding per amino acid in beta-strands is shown in particular to predict edge strands from protein sequences. Topological information on beta-strands and loops connecting them is derived for protein sequences with a prediction accuracy of 75%. The statistical significance of the finding is given. Applications in protein structure prediction are envisioned such as for the quality assessment of protein structure models.
Collapse
Affiliation(s)
- Antonin Guilloux
- Analyse algébrique, Institut de Mathématiques de Jussieu, Université Pierre et Marie Curie, Paris VI, France
| | - Bernard Caudron
- Centre d'Informatique pour la Biologie, Institut Pasteur, Paris, France
| | | |
Collapse
|
68
|
Wang R, Gao X. A Two-Layer Learning Architecture for Multi-Class Protein Folds Classification. Bioinformatics 2013. [DOI: 10.4018/978-1-4666-3604-0.ch041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
Classification of protein folds plays a very important role in the protein structure discovery process, especially when traditional sequence alignment methods fail to yield convincing structural homologies. In this chapter, we have developed a two-layer learning architecture, named TLLA, for multi-class protein folds classification. In the first layer, OET-KNN (Optimized Evidence-Theoretic K Nearest Neighbors) is used as the component classifier to find the most probable K-folds of the query protein. In the second layer, we use support vector machine (SVM) to build the multi-class classifier just on the K-folds, generated in the first layer, rather than on all the 27 folds. For multi-feature combination, ensemble strategy based on voting is selected to give the final classification result. The standard percentage accuracy of our method at ~63% is achieved on the independent testing dataset, where most of the proteins have <25% sequence identity with those in the training dataset. The experimental evaluation based on a widely used benchmark dataset has shown that our approach outperforms the competing methods, implying our approach might become a useful vehicle in the literature.
Collapse
|
69
|
Kaushik S, Mutt E, Chellappan A, Sankaran S, Srinivasan N, Sowdhamini R. Improved detection of remote homologues using cascade PSI-BLAST: influence of neighbouring protein families on sequence coverage. PLoS One 2013; 8:e56449. [PMID: 23437136 PMCID: PMC3577913 DOI: 10.1371/journal.pone.0056449] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2012] [Accepted: 01/13/2013] [Indexed: 12/31/2022] Open
Abstract
Background Development of sensitive sequence search procedures for the detection of distant relationships between proteins at superfamily/fold level is still a big challenge. The intermediate sequence search approach is the most frequently employed manner of identifying remote homologues effectively. In this study, examination of serine proteases of prolyl oligopeptidase, rhomboid and subtilisin protein families were carried out using plant serine proteases as queries from two genomes including A. thaliana and O. sativa and 13 other families of unrelated folds to identify the distant homologues which could not be obtained using PSI-BLAST. Methodology/Principal Findings We have proposed to start with multiple queries of classical serine protease members to identify remote homologues in families, using a rigorous approach like Cascade PSI-BLAST. We found that classical sequence based approaches, like PSI-BLAST, showed very low sequence coverage in identifying plant serine proteases. The algorithm was applied on enriched sequence database of homologous domains and we obtained overall average coverage of 88% at family, 77% at superfamily or fold level along with specificity of ∼100% and Mathew’s correlation coefficient of 0.91. Similar approach was also implemented on 13 other protein families representing every structural class in SCOP database. Further investigation with statistical tests, like jackknifing, helped us to better understand the influence of neighbouring protein families. Conclusions/Significance Our study suggests that employment of multiple queries of a family for the Cascade PSI-BLAST searches is useful for predicting distant relationships effectively even at superfamily level. We have proposed a generalized strategy to cover all the distant members of a particular family using multiple query sequences. Our findings reveal that prior selection of sequences as query and the presence of neighbouring families can be important for covering the search space effectively in minimal computational time. This study also provides an understanding of the ‘bridging’ role of related families.
Collapse
Affiliation(s)
- Swati Kaushik
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Bangalore, Karnataka, India
| | - Eshita Mutt
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Bangalore, Karnataka, India
| | - Ajithavalli Chellappan
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Bangalore, Karnataka, India
- School of Chemical and Biotechnology, Shanmugha Arts, Science, Technology & Research Academy, Thanjavur, Tamil Nadu, India
| | - Sandhya Sankaran
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, Bangalore, India
| | - Narayanaswamy Srinivasan
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, Bangalore, India
- * E-mail: (NS); (RS)
| | - Ramanathan Sowdhamini
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, Bangalore, Karnataka, India
- * E-mail: (NS); (RS)
| |
Collapse
|
70
|
SNPranker 2.0: a gene-centric data mining tool for diseases associated SNP prioritization in GWAS. BMC Bioinformatics 2013; 14 Suppl 1:S9. [PMID: 23369106 PMCID: PMC3548692 DOI: 10.1186/1471-2105-14-s1-s9] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background The capability of correlating specific genotypes with human diseases is a complex issue in spite of all advantages arisen from high-throughput technologies, such as Genome Wide Association Studies (GWAS). New tools for genetic variants interpretation and for Single Nucleotide Polymorphisms (SNPs) prioritization are actually needed. Given a list of the most relevant SNPs statistically associated to a specific pathology as result of a genotype study, a critical issue is the identification of genes that are effectively related to the disease by re-scoring the importance of the identified genetic variations. Vice versa, given a list of genes, it can be of great importance to predict which SNPs can be involved in the onset of a particular disease, in order to focus the research on their effects. Results We propose a new bioinformatics approach to support biological data mining in the analysis and interpretation of SNPs associated to pathologies. This system can be employed to design custom genotyping chips for disease-oriented studies and to re-score GWAS results. The proposed method relies (1) on the data integration of public resources using a gene-centric database design, (2) on the evaluation of a set of static biomolecular annotations, defined as features, and (3) on the SNP scoring function, which computes SNP scores using parameters and weights set by users. We employed a machine learning classifier to set default feature weights and an ontological annotation layer to enable the enrichment of the input gene set. We implemented our method as a web tool called SNPranker 2.0 (http://www.itb.cnr.it/snpranker), improving our first published release of this system. A user-friendly interface allows the input of a list of genes, SNPs or a biological process, and to customize the features set with relative weights. As result, SNPranker 2.0 returns a list of SNPs, localized within input and ontologically enriched genes, combined with their prioritization scores. Conclusions Different databases and resources are already available for SNPs annotation, but they do not prioritize or re-score SNPs relying on a-priori biomolecular knowledge. SNPranker 2.0 attempts to fill this gap through a user-friendly integrated web resource. End users, such as researchers in medical genetics and epidemiology, may find in SNPranker 2.0 a new tool for data mining and interpretation able to support SNPs analysis. Possible scenarios are GWAS data re-scoring, SNPs selection for custom genotyping arrays and SNPs/diseases association studies.
Collapse
|
71
|
eThread: a highly optimized machine learning-based approach to meta-threading and the modeling of protein tertiary structures. PLoS One 2012. [PMID: 23185577 PMCID: PMC3503980 DOI: 10.1371/journal.pone.0050200] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Template-based modeling that employs various meta-threading techniques is currently the most accurate, and consequently the most commonly used, approach for protein structure prediction. Despite the evident progress in this field, accurate structure models cannot be constructed for a significant fraction of gene products, thus the development of new algorithms is required. Here, we describe the development, optimization and large-scale benchmarking of eThread, a highly accurate meta-threading procedure for the identification of structural templates and the construction of corresponding target-to-template alignments. eThread integrates ten state-of-the-art threading/fold recognition algorithms in a local environment and extensively uses various machine learning techniques to carry out fully automated template-based protein structure modeling. Tertiary structure prediction employs two protocols based on widely used modeling algorithms: Modeller and TASSER-Lite. As a part of eThread, we also developed eContact, which is a Bayesian classifier for the prediction of inter-residue contacts and eRank, which effectively ranks generated multiple protein models and provides reliable confidence estimates as structure quality assessment. Excluding closely related templates from the modeling process, eThread generates models, which are correct at the fold level, for >80% of the targets; 40–50% of the constructed models are of a very high quality, which would be considered accurate at the family level. Furthermore, in large-scale benchmarking, we compare the performance of eThread to several alternative methods commonly used in protein structure prediction. Finally, we estimate the upper bound for this type of approach and discuss the directions towards further improvements.
Collapse
|
72
|
3D profile-based approach to proteome-wide discovery of novel human chemokines. PLoS One 2012; 7:e36151. [PMID: 22586462 PMCID: PMC3346806 DOI: 10.1371/journal.pone.0036151] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2012] [Accepted: 03/27/2012] [Indexed: 12/29/2022] Open
Abstract
Chemokines are small secreted proteins with important roles in immune responses. They consist of a conserved three-dimensional (3D) structure, so-called IL8-like chemokine fold, which is supported by disulfide bridges characteristic of this protein family. Sequence- and profile-based computational methods have been proficient in discovering novel chemokines by making use of their sequence-conserved cysteine patterns. However, it has been recently shown that some chemokines escaped annotation by these methods due to low sequence similarity to known chemokines and to different arrangement of cysteines in sequence and in 3D. Innovative methods overcoming the limitations of current techniques may allow the discovery of new remote homologs in the still functionally uncharacterized fraction of the human genome. We report a novel computational approach for proteome-wide identification of remote homologs of the chemokine family that uses fold recognition techniques in combination with a scaffold-based automatic mapping of disulfide bonds to define a 3D profile of the chemokine protein family. By applying our methodology to all currently uncharacterized human protein sequences, we have discovered two novel proteins that, without having significant sequence similarity to known chemokines or characteristic cysteine patterns, show strong structural resemblance to known anti-HIV chemokines. Detailed computational analysis and experimental structural investigations based on mass spectrometry and circular dichroism support our structural predictions and highlight several other chemokine-like features. The results obtained support their functional annotation as putative novel chemokines and encourage further experimental characterization. The identification of remote homologs of human chemokines may provide new insights into the molecular mechanisms causing pathologies such as cancer or AIDS, and may contribute to the development of novel treatments. Besides, the genome-wide applicability of our methodology based on 3D protein family profiles may open up new possibilities for improving and accelerating protein function annotation processes.
Collapse
|
73
|
Pires DEV, de Melo-Minardi RC, dos Santos MA, da Silveira CH, Santoro MM, Meira W. Cutoff Scanning Matrix (CSM): structural classification and function prediction by protein inter-residue distance patterns. BMC Genomics 2011; 12 Suppl 4:S12. [PMID: 22369665 PMCID: PMC3287581 DOI: 10.1186/1471-2164-12-s4-s12] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Background The unforgiving pace of growth of available biological data has increased the demand for efficient and scalable paradigms, models and methodologies for automatic annotation. In this paper, we present a novel structure-based protein function prediction and structural classification method: Cutoff Scanning Matrix (CSM). CSM generates feature vectors that represent distance patterns between protein residues. These feature vectors are then used as evidence for classification. Singular value decomposition is used as a preprocessing step to reduce dimensionality and noise. The aspect of protein function considered in the present work is enzyme activity. A series of experiments was performed on datasets based on Enzyme Commission (EC) numbers and mechanistically different enzyme superfamilies as well as other datasets derived from SCOP release 1.75. Results CSM was able to achieve a precision of up to 99% after SVD preprocessing for a database derived from manually curated protein superfamilies and up to 95% for a dataset of the 950 most-populated EC numbers. Moreover, we conducted experiments to verify our ability to assign SCOP class, superfamily, family and fold to protein domains. An experiment using the whole set of domains found in last SCOP version yielded high levels of precision and recall (up to 95%). Finally, we compared our structural classification results with those in the literature to place this work into context. Our method was capable of significantly improving the recall of a previous study while preserving a compatible precision level. Conclusions We showed that the patterns derived from CSMs could effectively be used to predict protein function and thus help with automatic function annotation. We also demonstrated that our method is effective in structural classification tasks. These facts reinforce the idea that the pattern of inter-residue distances is an important component of family structural signatures. Furthermore, singular value decomposition provided a consistent increase in precision and recall, which makes it an important preprocessing step when dealing with noisy data.
Collapse
Affiliation(s)
- Douglas E V Pires
- Department of Biochemistry and Immunology, Universidade Federal de Minas Gerais, Belo Horizonte, 31270-901, Brazil.
| | | | | | | | | | | |
Collapse
|
74
|
PSS-3D1D: an improved 3D1D profile method of protein fold recognition for the annotation of twilight zone sequences. ACTA ACUST UNITED AC 2011; 12:181-9. [DOI: 10.1007/s10969-011-9119-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2011] [Accepted: 11/24/2011] [Indexed: 10/14/2022]
|
75
|
CHEN YUEHUI, CHEN FENG, YANG JACKY, YANG MARYQU. ENSEMBLE VOTING SYSTEM FOR MULTICLASS PROTEIN FOLD RECOGNITION. INT J PATTERN RECOGN 2011. [DOI: 10.1142/s0218001408006454] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Protein structure classification is an important issue in understanding the associations between sequence and structure as well as possible functional and evolutionary relationships. Recently structural genomes initiatives and other high-throughput experiments have populated the biological databases at a rapid pace. In this paper, three types of classifiers, k nearest neighbors, class center and nearest neighbor and probabilistic neural networks and their homogenous ensemble for multiclass protein fold recognition problem are evaluated firstly, and then a heterogenous ensemble Voting System is designed for the same problem. The different features and/or their combinations extracted from the protein fold dataset are used in these classification models. The heterogenous classification results are then put into a voting system to get the final result. The experimental results show that the proposed method can improve prediction accuracy by 4%–10% on a benchmark dataset containing 27 SCOP folds.
Collapse
Affiliation(s)
- YUEHUI CHEN
- School of Information Science and Engineering, University of Jinan, 106 Jiwei Road, 250022 Jinan, P. R. China
| | - FENG CHEN
- School of Software, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China
| | - JACK Y. YANG
- Harvard Medical School, Harvard University, P.O. Box 400888, Cambridge, MA 02140, USA
| | - MARY QU YANG
- National Human Genome Research Institute, National Institutes of Health, US Department of Health and Human Services Bethesda, MD 20852, USA
| |
Collapse
|
76
|
Yang Y, Faraggi E, Zhao H, Zhou Y. Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates. Bioinformatics 2011; 27:2076-82. [PMID: 21666270 DOI: 10.1093/bioinformatics/btr350] [Citation(s) in RCA: 245] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION In recent years, development of a single-method fold-recognition server lags behind consensus and multiple template techniques. However, a good consensus prediction relies on the accuracy of individual methods. This article reports our efforts to further improve a single-method fold recognition technique called SPARKS by changing the alignment scoring function and incorporating the SPINE-X techniques that make improved prediction of secondary structure, backbone torsion angle and solvent accessible surface area. RESULTS The new method called SPARKS-X was tested with the SALIGN benchmark for alignment accuracy, Lindahl and SCOP benchmarks for fold recognition, and CASP 9 blind test for structure prediction. The method is compared to several state-of-the-art techniques such as HHPRED and BoostThreader. Results show that SPARKS-X is one of the best single-method fold recognition techniques. We further note that incorporating multiple templates and refinement in model building will likely further improve SPARKS-X. AVAILABILITY The method is available as a SPARKS-X server at http://sparks.informatics.iupui.edu/
Collapse
Affiliation(s)
- Yuedong Yang
- School of Informatics, Indiana University Purdue University, Indianapolis, IN 46202, USA
| | | | | | | |
Collapse
|
77
|
|
78
|
Yang JY, Chen X. Improving taxonomy-based protein fold recognition by using global and local features. Proteins 2011; 79:2053-64. [DOI: 10.1002/prot.23025] [Citation(s) in RCA: 55] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2010] [Revised: 02/05/2011] [Accepted: 03/03/2011] [Indexed: 11/05/2022]
|
79
|
Pandit SB, Skolnick J. TASSER_low-zsc: an approach to improve structure prediction using low z-score-ranked templates. Proteins 2011; 78:2769-80. [PMID: 20635423 DOI: 10.1002/prot.22791] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
In a variety of threading methods, often poorly ranked (low z-score) templates have good alignments. Here, a new method, TASSER_low-zsc that identifies these low z-score-ranked templates to improve protein structure prediction accuracy, is described. The approach consists of clustering of threading templates by affinity propagation on the basis of structural similarity (thread_cluster) followed by TASSER modeling, with final models selected by using a TASSER_QA variant. To establish the generality of the approach, templates provided by two threading methods, SP(3) and SPARKS(2), are examined. The SP(3) and SPARKS(2) benchmark datasets consist of 351 and 357 medium/hard proteins (those with moderate to poor quality templates and/or alignments) of length < or =250 residues, respectively. For SP(3) medium and hard targets, using thread_cluster, the TM-scores of the best template improve by approximately 4 and 9% over the original set (without low z-score templates) respectively; after TASSER modeling/refinement and ranking, the best model improves by approximately 7 and 9% over the best model generated with the original template set. Moreover, TASSER_low-zsc generates 22% (43%) more foldable medium (hard) targets. Similar improvements are observed with low-ranked templates from SPARKS(2). The template clustering approach could be applied to other modeling methods that utilize multiple templates to improve structure prediction.
Collapse
Affiliation(s)
- Shashi B Pandit
- Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30318, USA
| | | |
Collapse
|
80
|
Hu Y, Dong X, Wu A, Cao Y, Tian L, Jiang T. Incorporation of local structural preference potential improves fold recognition. PLoS One 2011; 6:e17215. [PMID: 21365008 PMCID: PMC3041821 DOI: 10.1371/journal.pone.0017215] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2010] [Accepted: 01/25/2011] [Indexed: 11/19/2022] Open
Abstract
Fold recognition, or threading, is a popular protein structure modeling approach that uses known structure templates to build structures for those of unknown. The key to the success of fold recognition methods lies in the proper integration of sequence, physiochemical and structural information. Here we introduce another type of information, local structural preference potentials of 3-residue and 9-residue fragments, for fold recognition. By combining the two local structural preference potentials with the widely used sequence profile, secondary structure information and hydrophobic score, we have developed a new threading method called FR-t5 (fold recognition by use of 5 terms). In benchmark testings, we have found the consideration of local structural preference potentials in FR-t5 not only greatly enhances the alignment accuracy and recognition sensitivity, but also significantly improves the quality of prediction models.
Collapse
Affiliation(s)
- Yun Hu
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Graduate University of Chinese Academy of Sciences, Beijing, China
| | - Xiaoxi Dong
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Graduate University of Chinese Academy of Sciences, Beijing, China
| | - Aiping Wu
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
| | - Yang Cao
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Graduate University of Chinese Academy of Sciences, Beijing, China
| | - Liqing Tian
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Graduate University of Chinese Academy of Sciences, Beijing, China
| | - Taijiao Jiang
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- * E-mail:
| |
Collapse
|
81
|
Chen H, Kihara D. Effect of using suboptimal alignments in template-based protein structure prediction. Proteins 2011; 79:315-34. [PMID: 21058297 PMCID: PMC3058269 DOI: 10.1002/prot.22885] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Computational protein structure prediction remains a challenging task in protein bioinformatics. In the recent years, the importance of template-based structure prediction is increasing because of the growing number of protein structures solved by the structural genomics projects. To capitalize the significant efforts and investments paid on the structural genomics projects, it is urgent to establish effective ways to use the solved structures as templates by developing methods for exploiting remotely related proteins that cannot be simply identified by homology. In this work, we examine the effect of using suboptimal alignments in template-based protein structure prediction. We showed that suboptimal alignments are often more accurate than the optimal one, and such accurate suboptimal alignments can occur even at a very low rank of the alignment score. Suboptimal alignments contain a significant number of correct amino acid residue contacts. Moreover, suboptimal alignments can improve template-based models when used as input to Modeller. Finally, we use suboptimal alignments for handling a contact potential in a probabilistic way in a threading program, SUPRB. The probabilistic contacts strategy outperforms the partly thawed approach, which only uses the optimal alignment in defining residue contacts, and also the re-ranking strategy, which uses the contact potential in re-ranking alignments. The comparison with existing methods in the template-recognition test shows that SUPRB is very competitive and outperforms existing methods.
Collapse
Affiliation(s)
- Hao Chen
- Department of Biological Sciences College of Science, Purdue University, West Lafayette, IN, 47907, USA
| | - Daisuke Kihara
- Department of Biological Sciences College of Science, Purdue University, West Lafayette, IN, 47907, USA
- Department of Computer Science College of Science, Purdue University, West Lafayette, IN, 47907, USA
- Markey Center for Structural Biology College of Science, Purdue University, West Lafayette, IN, 47907, USA
| |
Collapse
|
82
|
Ahmed F, Benedito VA, Zhao PX. Mining Functional Elements in Messenger RNAs: Overview, Challenges, and Perspectives. FRONTIERS IN PLANT SCIENCE 2011; 2:84. [PMID: 22639614 PMCID: PMC3355573 DOI: 10.3389/fpls.2011.00084] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/31/2011] [Accepted: 11/03/2011] [Indexed: 05/03/2023]
Abstract
Eukaryotic messenger RNA (mRNA) contains not only protein-coding regions but also a plethora of functional cis-elements that influence or coordinate a number of regulatory aspects of gene expression, such as mRNA stability, splicing forms, and translation rates. Understanding the rules that apply to each of these element types (e.g., whether the element is defined by primary or higher-order structure) allows for the discovery of novel mechanisms of gene expression as well as the design of transcripts with controlled expression. Bioinformatics plays a major role in creating databases and finding non-evident patterns governing each type of eukaryotic functional element. Much of what we currently know about mRNA regulatory elements in eukaryotes is derived from microorganism and animal systems, with the particularities of plant systems lagging behind. In this review, we provide a general introduction to the most well-known eukaryotic mRNA regulatory motifs (splicing regulatory elements, internal ribosome entry sites, iron-responsive elements, AU-rich elements, zipcodes, and polyadenylation signals) and describe available bioinformatics resources (databases and analysis tools) to analyze eukaryotic transcripts in search of functional elements, focusing on recent trends in bioinformatics methods and tool development. We also discuss future directions in the development of better computational tools based upon current knowledge of these functional elements. Improved computational tools would advance our understanding of the processes underlying gene regulations. We encourage plant bioinformaticians to turn their attention to this subject to help identify novel mechanisms of gene expression regulation using RNA motifs that have potentially evolved or diverged in plant species.
Collapse
Affiliation(s)
- Firoz Ahmed
- Bioinformatics Laboratory, Plant Biology Division, Samuel Roberts Noble FoundationArdmore, OK, USA
| | - Vagner A. Benedito
- Genetics and Developmental Biology, Plant and Soil Sciences Division, West Virginia UniversityMorgantown, WV, USA
| | - Patrick Xuechun Zhao
- Bioinformatics Laboratory, Plant Biology Division, Samuel Roberts Noble FoundationArdmore, OK, USA
- *Correspondence: Patrick Xuechun Zhao, Bioinformatics Laboratory, Plant Biology Division, Samuel Roberts Noble Foundation, 2510 Sam Noble Parkway, Ardmore, OK 73401, USA e-mail:
| |
Collapse
|
83
|
Zhou H, Skolnick J. Improving threading algorithms for remote homology modeling by combining fragment and template comparisons. Proteins 2010; 78:2041-8. [PMID: 20455261 DOI: 10.1002/prot.22717] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
In this work, we develop a method called fragment comparison and the template comparison (FTCOM) for assessing the global quality of protein structural models for targets of medium and hard difficulty (remote homology) produced by structure prediction approaches such as threading or ab initio structure prediction. FTCOM requires the C(alpha) coordinates of full length models and assesses model quality based on fragment comparison and a score derived from comparison of the model to top threading templates. On a set of 361 medium/hard targets, FTCOM was applied to and assessed for its ability to improve on the results from the SP(3), SPARKS, PROSPECTOR_3, and PRO-SP(3)-TASSER threading algorithms. The average TM-score improves by 5-10% for the first selected model by the new method over models obtained by the original selection procedure in the respective threading methods. Moreover, the number of foldable targets (TM-score >or= 0.4) increases from least 7.6% for SP(3) to 54% for SPARKS. Thus, FTCOM is a promising approach to template selection. Proteins 2010. (c) 2010 Wiley-Liss, Inc.
Collapse
Affiliation(s)
- Hongyi Zhou
- Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30318, USA
| | | |
Collapse
|
84
|
Kumar A, Cowen L. Recognition of beta-structural motifs using hidden Markov models trained with simulated evolution. Bioinformatics 2010; 26:i287-93. [PMID: 20529918 PMCID: PMC2881384 DOI: 10.1093/bioinformatics/btq199] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Motivation: One of the most successful methods to date for recognizing protein sequences that are evolutionarily related, has been profile hidden Markov models. However, these models do not capture pairwise statistical preferences of residues that are hydrogen bonded in β-sheets. We thus explore methods for incorporating pairwise dependencies into these models. Results: We consider the remote homology detection problem for β-structural motifs. In particular, we ask if a statistical model trained on members of only one family in a SCOP β-structural superfamily, can recognize members of other families in that superfamily. We show that HMMs trained with our pairwise model of simulated evolution achieve nearly a median 5% improvement in AUC for β-structural motif recognition as compared to ordinary HMMs. Availability: All datasets and HMMs are available at: http://bcb.cs.tufts.edu/pairwise/ Contact:anoop.kumar@tufts.edu; lenore.cowen@tufts.edu
Collapse
Affiliation(s)
- Anoop Kumar
- Department of Computer Science, Tufts University, Medford, MA, USA.
| | | |
Collapse
|
85
|
Jain P, Hirst JD. Automatic structure classification of small proteins using random forest. BMC Bioinformatics 2010; 11:364. [PMID: 20594334 PMCID: PMC2916923 DOI: 10.1186/1471-2105-11-364] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2010] [Accepted: 07/01/2010] [Indexed: 11/29/2022] Open
Abstract
Background Random forest, an ensemble based supervised machine learning algorithm, is used to predict the SCOP structural classification for a target structure, based on the similarity of its structural descriptors to those of a template structure with an equal number of secondary structure elements (SSEs). An initial assessment of random forest is carried out for domains consisting of three SSEs. The usability of random forest in classifying larger domains is demonstrated by applying it to domains consisting of four, five and six SSEs. Results Random forest, trained on SCOP version 1.69, achieves a predictive accuracy of up to 94% on an independent and non-overlapping test set derived from SCOP version 1.73. For classification to the SCOP Class, Fold, Super-family or Family levels, the predictive quality of the model in terms of Matthew's correlation coefficient (MCC) ranged from 0.61 to 0.83. As the number of constituent SSEs increases the MCC for classification to different structural levels decreases. Conclusions The utility of random forest in classifying domains from the place-holder classes of SCOP to the true Class, Fold, Super-family or Family levels is demonstrated. Issues such as introduction of a new structural level in SCOP and the merger of singleton levels can also be addressed using random forest. A real-world scenario is mimicked by predicting the classification for those protein structures from the PDB, which are yet to be assigned to the SCOP classification hierarchy.
Collapse
Affiliation(s)
- Pooja Jain
- School of Chemistry, The University of Nottingham, University Park, Nottingham, NG7 2RD, UK
| | | |
Collapse
|
86
|
Karakaş M, Woetzel N, Meiler J. BCL::contact-low confidence fold recognition hits boost protein contact prediction and de novo structure determination. J Comput Biol 2010; 17:153-68. [PMID: 19772383 DOI: 10.1089/cmb.2009.0030] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Knowledge of all residue-residue contacts within a protein allows determination of the protein fold. Accurate prediction of even a subset of long-range contacts (contacts between amino acids far apart in sequence) can be instrumental for determining tertiary structure. Here we present BCL::Contact, a novel contact prediction method that utilizes artificial neural networks (ANNs) and specializes in the prediction of medium to long-range contacts. BCL::Contact comes in two modes: sequence-based and structure-based. The sequence-based mode uses only sequence information and has individual ANNs specialized for helix-helix, helix-strand, strand-helix, strand-strand, and sheet-sheet contacts. The structure-based mode combines results from 32-fold recognition methods with sequence information to a consensus prediction. The two methods were presented in the 6(th) and 7(th) Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiments. The present work focuses on elucidating the impact of fold recognition results onto contact prediction via a direct comparison of both methods on a joined benchmark set of proteins. The sequence-based mode predicted contacts with 42% accuracy (7% false positive rate), while the structure-based mode achieved 45% accuracy (2% false positive rate). Predictions by both modes of BCL::Contact were supplied as input to the protein tertiary structure prediction program Rosetta for a benchmark of 17 proteins with no close sequence homologs in the protein data bank (PDB). Rosetta created higher accuracy models, signified by an improvement of 1.3 A on average root mean square deviation (RMSD), when driven by the predicted contacts. Further, filtering Rosetta models by agreement with the predicted contacts enriches for native-like fold topologies.
Collapse
Affiliation(s)
- Mert Karakaş
- Department of Chemistry, Center for Structural Biology, Vanderbilt University, Nashville, Tennessee, USA
| | | | | |
Collapse
|
87
|
Wang Z, Eickholt J, Cheng J. MULTICOM: a multi-level combination approach to protein structure prediction and its assessments in CASP8. Bioinformatics 2010; 26:882-8. [PMID: 20150411 PMCID: PMC2844995 DOI: 10.1093/bioinformatics/btq058] [Citation(s) in RCA: 73] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2009] [Revised: 02/02/2010] [Accepted: 02/08/2010] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Protein structure prediction is one of the most important problems in structural bioinformatics. Here we describe MULTICOM, a multi-level combination approach to improve the various steps in protein structure prediction. In contrast to those methods which look for the best templates, alignments and models, our approach tries to combine complementary and alternative templates, alignments and models to achieve on average better accuracy. RESULTS The multi-level combination approach was implemented via five automated protein structure prediction servers and one human predictor which participated in the eighth Critical Assessment of Techniques for Protein Structure Prediction (CASP8), 2008. The MULTICOM servers and human predictor were consistently ranked among the top predictors on the CASP8 benchmark. The methods can predict moderate- to high-resolution models for most template-based targets and low-resolution models for some template-free targets. The results show that the multi-level combination of complementary templates, alternative alignments and similar models aided by model quality assessment can systematically improve both template-based and template-free protein modeling. AVAILABILITY The MULTICOM server is freely available at http://casp.rnet.missouri.edu/multicom_3d.html .
Collapse
Affiliation(s)
- Zheng Wang
- Department of Computer Science, Informatics Institute and C. Bond Life Science Center, University of Missouri, Columbia, MO 65211, USA
| | | | | |
Collapse
|
88
|
Yan RX, Si JN, Wang C, Zhang Z. DescFold: a web server for protein fold recognition. BMC Bioinformatics 2009; 10:416. [PMID: 20003426 PMCID: PMC2803855 DOI: 10.1186/1471-2105-10-416] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2009] [Accepted: 12/14/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Machine learning-based methods have been proven to be powerful in developing new fold recognition tools. In our previous work [Zhang, Kochhar and Grigorov (2005) Protein Science, 14: 431-444], a machine learning-based method called DescFold was established by using Support Vector Machines (SVMs) to combine the following four descriptors: a profile-sequence-alignment-based descriptor using Psi-blast e-values and bit scores, a sequence-profile-alignment-based descriptor using Rps-blast e-values and bit scores, a descriptor based on secondary structure element alignment (SSEA), and a descriptor based on the occurrence of PROSITE functional motifs. In this work, we focus on the improvement of DescFold by incorporating more powerful descriptors and setting up a user-friendly web server. RESULTS In seeking more powerful descriptors, the profile-profile alignment score generated from the COMPASS algorithm was first considered as a new descriptor (i.e., PPA). When considering a profile-profile alignment between two proteins in the context of fold recognition, one protein is regarded as a template (i.e., its 3D structure is known). Instead of a sequence profile derived from a Psi-blast search, a structure-seeded profile for the template protein was generated by searching its structural neighbors with the assistance of the TM-align structural alignment algorithm. Moreover, the COMPASS algorithm was used again to derive a profile-structural-profile-alignment-based descriptor (i.e., PSPA). We trained and tested the new DescFold in a total of 1,835 highly diverse proteins extracted from the SCOP 1.73 version. When the PPA and PSPA descriptors were introduced, the new DescFold boosts the performance of fold recognition substantially. Using the SCOP_1.73_40% dataset as the fold library, the DescFold web server based on the trained SVM models was further constructed. To provide a large-scale test for the new DescFold, a stringent test set of 1,866 proteins were selected from the SCOP 1.75 version. At a less than 5% false positive rate control, the new DescFold is able to correctly recognize structural homologs at the fold level for nearly 46% test proteins. Additionally, we also benchmarked the DescFold method against several well-established fold recognition algorithms through the LiveBench targets and Lindahl dataset. CONCLUSIONS The new DescFold method was intensively benchmarked to have very competitive performance compared with some well-established fold recognition methods, suggesting that it can serve as a useful tool to assist in template-based protein structure prediction. The DescFold server is freely accessible at http://202.112.170.199/DescFold/index.html.
Collapse
Affiliation(s)
- Ren-Xiang Yan
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing 100193, China.
| | | | | | | |
Collapse
|
89
|
Chen CC, Hwang JK, Yang JM. (PS)2-v2: template-based protein structure prediction server. BMC Bioinformatics 2009; 10:366. [PMID: 19878598 PMCID: PMC2775752 DOI: 10.1186/1471-2105-10-366] [Citation(s) in RCA: 93] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2009] [Accepted: 10/31/2009] [Indexed: 03/11/2024] Open
Abstract
Background Template selection and target-template alignment are critical steps for template-based modeling (TBM) methods. To identify the template for the twilight zone of 15~25% sequence similarity between targets and templates is still difficulty for template-based protein structure prediction. This study presents the (PS)2-v2 server, based on our original server with numerous enhancements and modifications, to improve reliability and applicability. Results To detect homologous proteins with remote similarity, the (PS)2-v2 server utilizes the S2A2 matrix, which is a 60 × 60 substitution matrix using the secondary structure propensities of 20 amino acids, and the position-specific sequence profile (PSSM) generated by PSI-BLAST. In addition, our server uses multiple templates and multiple models to build and assess models. Our method was evaluated on the Lindahl benchmark for fold recognition and ProSup benchmark for sequence alignment. Evaluation results indicated that our method outperforms sequence-profile approaches, and had comparable performance to that of structure-based methods on these benchmarks. Finally, we tested our method using the 154 TBM targets of the CASP8 (Critical Assessment of Techniques for Protein Structure Prediction) dataset. Experimental results show that (PS)2-v2 is ranked 6th among 72 severs and is faster than the top-rank five serves, which utilize ab initio methods. Conclusion Experimental results demonstrate that (PS)2-v2 with the S2A2 matrix is useful for template selections and target-template alignments by blending the amino acid and structural propensities. The multiple-template and multiple-model strategies are able to significantly improve the accuracies for target-template alignments in the twilight zone. We believe that this server is useful in structure prediction and modeling, especially in detecting homologous templates with sequence similarity in the twilight zone.
Collapse
Affiliation(s)
- Chih-Chieh Chen
- Institute of Bioinformatics, National Chiao Tung University, Hsinchu 30050, Taiwan, Republic of China.
| | | | | |
Collapse
|
90
|
Wang Z, Tegge AN, Cheng J. Evaluating the absolute quality of a single protein model using structural features and support vector machines. Proteins 2009; 75:638-47. [PMID: 19004001 DOI: 10.1002/prot.22275] [Citation(s) in RCA: 78] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Knowing the quality of a protein structure model is important for its appropriate usage. We developed a model evaluation method to assess the absolute quality of a single protein model using only structural features with support vector machine regression. The method assigns an absolute quantitative score (i.e. GDT-TS) to a model by comparing its secondary structure, relative solvent accessibility, contact map, and beta sheet structure with their counterparts predicted from its primary sequence. We trained and tested the method on the CASP6 dataset using cross-validation. The correlation between predicted and true scores is 0.82. On the independent CASP7 dataset, the correlation averaged over 95 protein targets is 0.76; the average correlation for template-based and ab initio targets is 0.82 and 0.50, respectively. Furthermore, the predicted absolute quality scores can be used to rank models effectively. The average difference (or loss) between the scores of the top-ranked models and the best models is 5.70 on the CASP7 targets. This method performs favorably when compared with the other methods used on the same dataset. Moreover, the predicted absolute quality scores are comparable across models for different proteins. These features make the method a valuable tool for model quality assurance and ranking.
Collapse
Affiliation(s)
- Zheng Wang
- Computer Science Department, Informatics Institute, University of Missouri, Columbia, MO 65211, USA
| | | | | |
Collapse
|
91
|
Jain P, Hirst JD. Exploring protein structural dissimilarity to facilitate structure classification. BMC STRUCTURAL BIOLOGY 2009; 9:60. [PMID: 19765314 PMCID: PMC2754988 DOI: 10.1186/1472-6807-9-60] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/27/2009] [Accepted: 09/19/2009] [Indexed: 12/04/2022]
Abstract
BACKGROUND Classification of newly resolved protein structures is important in understanding their architectural, evolutionary and functional relatedness to known protein structures. Among various efforts to improve the database of Structural Classification of Proteins (SCOP), automation has received particular attention. Herein, we predict the deepest SCOP structural level that an unclassified protein shares with classified proteins with an equal number of secondary structure elements (SSEs). RESULTS We compute a coefficient of dissimilarity (Omega) between proteins, based on structural and sequence-based descriptors characterising the respective constituent SSEs. For a set of 1,661 pairs of proteins with sequence identity up to 35%, the performance of Omega in predicting shared Class, Fold and Super-family levels is comparable to that of DaliLite Z score and shows a greater than four-fold increase in the true positive rate (TPR) for proteins sharing the Family level. On a larger set of 600 domains representing 200 families, the performance of Z score improves in predicting a shared Family, but still only achieves about half of the TPR of Omega. The TPR for structures sharing a Super-family is lower than in the first dataset, but Omega performs slightly better than Z score. Overall, the sensitivity of Omega in predicting common Fold level is higher than that of the DaliLite Z score. CONCLUSION Classification to a deeper level in the hierarchy is specific and difficult. So the efficiency of Omega may be attractive to the curators and the end-users of SCOP. We suggest Omega may be a better measure for structure classification than the DaliLite Z score, with the caveat that currently we are restricted to comparing structures with equal number of SSEs.
Collapse
Affiliation(s)
- Pooja Jain
- School of Chemistry, The University of Nottingham, University Park, Nottingham, NG7 2RD, UK
| | - Jonathan D Hirst
- School of Chemistry, The University of Nottingham, University Park, Nottingham, NG7 2RD, UK
| |
Collapse
|
92
|
Horst J, Samudrala R. Diversity of protein structures and difficulties in fold recognition: the curious case of protein G. F1000 BIOLOGY REPORTS 2009; 1:69. [PMID: 20209018 PMCID: PMC2832337 DOI: 10.3410/b1-69] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
We examine the ability of current state-of-the-art methods in protein structure prediction to discriminate topologically distant folds encoded by highly similar (>90% sequence identity) designed proteins in blind protein structure prediction experiments. We detail the corresponding prognosis for the protein fold recognition field and highlight the features of the methodologies that successfully deciphered this folding riddle.
Collapse
Affiliation(s)
- Jeremy Horst
- Department of Oral Biology, University of Washington School of Dentistry1959 NE Pacific Street, Seattle, WA 98195-7132USA
- Department of Microbiology, University of Washington School of Medicine1959 NE Pacific Street, Seattle, WA 98195-7132USA
| | - Ram Samudrala
- Department of Oral Biology, University of Washington School of Dentistry1959 NE Pacific Street, Seattle, WA 98195-7132USA
- Department of Microbiology, University of Washington School of Medicine1959 NE Pacific Street, Seattle, WA 98195-7132USA
| |
Collapse
|
93
|
Dong Q, Zhou S, Guan J. A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics 2009; 25:2655-62. [DOI: 10.1093/bioinformatics/btp500] [Citation(s) in RCA: 150] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
94
|
Zhang H, Zhang T, Chen K, Shen S, Ruan J, Kurgan L. On the relation between residue flexibility and local solvent accessibility in proteins. Proteins 2009; 76:617-36. [DOI: 10.1002/prot.22375] [Citation(s) in RCA: 62] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
95
|
Liu Y, Carbonell J, Gopalakrishnan V, Weigele P. Conditional graphical models for protein structural motif recognition. J Comput Biol 2009; 16:639-57. [PMID: 19432536 DOI: 10.1089/cmb.2008.0176] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Determining protein structures is crucial to understanding the mechanisms of infection and designing drugs. However, the elucidation of protein folds by crystallographic experiments can be a bottleneck in the development process. In this article, we present a probabilistic graphical model framework, conditional graphical models, for predicting protein structural motifs. It represents the structure characteristics of a structural motif using a graph, where the nodes denote the secondary structure elements, and the edges indicate the side-chain interactions between the components either within one protein chain or between chains. Then the model defines the optimal segmentation of a protein sequence against the graph by maximizing its "conditional" probability so that it can take advantages of the discriminative training approach. Efficient approximate inference algorithms using reversible jump Markov Chain Monte Carlo (MCMC) algorithm are developed to handle the resulting complex graphical models. We test our algorithm on four important structural motifs, and our method outperforms other state-of-art algorithms for motif recognition. We also hypothesize potential membership proteins of target folds from Swiss-Prot, which further supports the evolutionary hypothesis about viral folds.
Collapse
Affiliation(s)
- Yan Liu
- IBM T.J. Watson Research Center, Yorktown Heights, New York 10598, USA.
| | | | | | | |
Collapse
|
96
|
Deschavanne P, Tufféry P. Enhanced protein fold recognition using a structural alphabet. Proteins 2009; 76:129-37. [DOI: 10.1002/prot.22324] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
97
|
Lee SY, Lee JY, Jung KS, Ryu KH. A 9-state hidden Markov model using protein secondary structure information for protein fold recognition. Comput Biol Med 2009; 39:527-34. [DOI: 10.1016/j.compbiomed.2009.03.008] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2008] [Revised: 01/20/2009] [Accepted: 03/11/2009] [Indexed: 11/30/2022]
|
98
|
Zhou H, Skolnick J. Protein structure prediction by pro-Sp3-TASSER. Biophys J 2009; 96:2119-27. [PMID: 19289038 DOI: 10.1016/j.bpj.2008.12.3898] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2008] [Revised: 11/12/2008] [Accepted: 12/03/2008] [Indexed: 12/29/2022] Open
Abstract
An automated protein structure prediction algorithm, pro-sp3-Threading/ASSEmbly/Refinement (TASSER), is described and benchmarked. Structural templates are identified using five different scoring functions derived from the previously developed threading methods PROSPECTOR_3 and SP(3). Top templates identified by each scoring function are combined to derive contact and distant restraints for subsequent model refinement by short TASSER simulations. For Medium/Hard targets (those with moderate to poor quality templates and/or alignments), alternative template alignments are also generated by parametric alignment and the top models selected by TASSER-QA are included in the contact and distance restraint derivation. Then, multiple short TASSER simulations are used to generate an ensemble of full-length models. Subsequently, the top models are selected from the ensemble by TASSER-QA and used to derive TASSER contacts and distant restraints for another round of full TASSER refinement. The final models are selected from both rounds of TASSER simulations by TASSER-QA. We compare pro-sp3-TASSER with our previously developed MetaTASSER method (enhanced with chunk-TASSER for Medium/Hard targets) on a representative test data set of 723 proteins <250 residues in length. For the 348 proteins classified as easy targets (those templates with good alignments and global structure similarity to the target), the cumulative TM-score of the best of top five models by pro-sp3-TASSER shows a 2.1% improvement over MetaTASSER. For the 155/220 medium/hard targets, the improvements in TM-score are 2.8% and 2.2%, respectively. All improvements are statistically significant. More importantly, the number of foldable targets (those having models whose TM-score to native >0.4 in the top five clusters) increases from 472 to 497 for all targets, and the relative increases for medium and hard targets are 10% and 15%, respectively. A server that implements the above algorithm is available at http://cssb.biology.gatech.edu/skolnick/webservice/pro-sp3-TASSER/. The source code is also available upon request.
Collapse
Affiliation(s)
- Hongyi Zhou
- Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia, USA
| | | |
Collapse
|
99
|
Gao X, Bu D, Xu J, Li M. Improving consensus contact prediction via server correlation reduction. BMC STRUCTURAL BIOLOGY 2009; 9:28. [PMID: 19419562 PMCID: PMC2689239 DOI: 10.1186/1472-6807-9-28] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/25/2008] [Accepted: 05/06/2009] [Indexed: 11/10/2022]
Abstract
Background Protein inter-residue contacts play a crucial role in the determination and prediction of protein structures. Previous studies on contact prediction indicate that although template-based consensus methods outperform sequence-based methods on targets with typical templates, such consensus methods perform poorly on new fold targets. However, we find out that even for new fold targets, the models generated by threading programs can contain many true contacts. The challenge is how to identify them. Results In this paper, we develop an integer linear programming model for consensus contact prediction. In contrast to the simple majority voting method assuming that all the individual servers are equally important and independent, the newly developed method evaluates their correlation by using maximum likelihood estimation and extracts independent latent servers from them by using principal component analysis. An integer linear programming method is then applied to assign a weight to each latent server to maximize the difference between true contacts and false ones. The proposed method is tested on the CASP7 data set. If the top L/5 predicted contacts are evaluated where L is the protein size, the average accuracy is 73%, which is much higher than that of any previously reported study. Moreover, if only the 15 new fold CASP7 targets are considered, our method achieves an average accuracy of 37%, which is much better than that of the majority voting method, SVM-LOMETS, SVM-SEQ, and SAM-T06. These methods demonstrate an average accuracy of 13.0%, 10.8%, 25.8% and 21.2%, respectively. Conclusion Reducing server correlation and optimally combining independent latent servers show a significant improvement over the traditional consensus methods. This approach can hopefully provide a powerful tool for protein structure refinement and prediction use.
Collapse
Affiliation(s)
- Xin Gao
- David R, Cheriton School of Computer Science, University of Waterloo, N2L3G1, Canada.
| | | | | | | |
Collapse
|
100
|
Jain P, Garibaldi JM, Hirst JD. Supervised machine learning algorithms for protein structure classification. Comput Biol Chem 2009; 33:216-23. [PMID: 19473879 DOI: 10.1016/j.compbiolchem.2009.04.004] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2008] [Revised: 03/25/2009] [Accepted: 04/23/2009] [Indexed: 10/20/2022]
Abstract
We explore automation of protein structural classification using supervised machine learning methods on a set of 11,360 pairs of protein domains (up to 35% sequence identity) consisting of three secondary structure elements. Fifteen algorithms from five categories of supervised algorithms are evaluated for their ability to learn for a pair of protein domains, the deepest common structural level within the SCOP hierarchy, given a one-dimensional representation of the domain structures. This representation encapsulates evolutionary information in terms of sequence identity and structural information characterising the secondary structure elements and lengths of the respective domains. The evaluation is performed in two steps, first selecting the best performing base learners and subsequently evaluating boosted and bagged meta learners. The boosted random forest, a collection of decision trees, is found to be the most accurate, with a cross-validated accuracy of 97.0% and F-measures of 0.97, 0.85, 0.93 and 0.98 for classification of proteins to the Class, Fold, Super-Family and Family levels in the SCOP hierarchy. The meta learning regime, especially boosting, improved performance by more accurately classifying the instances from less populated classes.
Collapse
Affiliation(s)
- Pooja Jain
- School of Chemistry, The University of Nottingham, University Park, Nottingham, NG7 2RD, UK
| | | | | |
Collapse
|