1
|
Gormez Y, Aydin Z. IGPRED-MultiTask: A Deep Learning Model to Predict Protein Secondary Structure, Torsion Angles and Solvent Accessibility. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1104-1113. [PMID: 35849663 DOI: 10.1109/tcbb.2022.3191395] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Protein secondary structure, solvent accessibility and torsion angle predictions are preliminary steps to predict 3D structure of a protein. Deep learning approaches have achieved significant improvements in predicting various features of protein structure. In this study, IGPRED-Multitask, a deep learning model with multi task learning architecture based on deep inception network, graph convolutional network and a bidirectional long short-term memory is proposed. Moreover, hyper-parameters of the model are fine-tuned using Bayesian optimization, which is faster and more effective than grid search. The same benchmark test data sets as in the OPUS-TASS paper including TEST2016, TEST2018, CASP12, CASP13, CASPFM, HARD68, CAMEO93, CAMEO93_HARD, as well as the train and validation sets, are used for fair comparison with the literature. Statistically significant improvements are observed in secondary structure prediction on 4 datasets, in phi angle prediction on 2 datasets and in psi angel prediction on 3 datasets compared to the state-of-the-art methods. For solvent accessibility prediction, TEST2016 and TEST2018 datasets are used only to assess the performance of the proposed model.
Collapse
|
2
|
Görmez Y, Sabzekar M, Aydın Z. IGPRED: Combination of convolutional neural and graph convolutional networks for protein secondary structure prediction. Proteins 2021; 89:1277-1288. [PMID: 33993559 DOI: 10.1002/prot.26149] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2021] [Revised: 04/21/2021] [Accepted: 05/11/2021] [Indexed: 11/10/2022]
Abstract
There is a close relationship between the tertiary structure and the function of a protein. One of the important steps to determine the tertiary structure is protein secondary structure prediction (PSSP). For this reason, predicting secondary structure with higher accuracy will give valuable information about the tertiary structure. Recently, deep learning techniques have obtained promising improvements in several machine learning applications including PSSP. In this article, a novel deep learning model, based on convolutional neural network and graph convolutional network is proposed. PSIBLAST PSSM, HHMAKE PSSM, physico-chemical properties of amino acids are combined with structural profiles to generate a rich feature set. Furthermore, the hyper-parameters of the proposed network are optimized using Bayesian optimization. The proposed model IGPRED obtained 89.19%, 86.34%, 87.87%, 85.76%, and 86.54% Q3 accuracies for CullPDB, EVAset, CASP10, CASP11, and CASP12 datasets, respectively.
Collapse
Affiliation(s)
- Yasin Görmez
- Faculty of Economics and Administrative Sciences, Management Information Systems, Sivas Cumhuriyet University, Sivas, Turkey
| | - Mostafa Sabzekar
- Department of Computer Engineering, Birjand University of Technology, Birjand, Iran
| | - Zafer Aydın
- Engineering Faculty, Computer Engineering Department, Abdullah Gül University, Kayseri, Turkey
| |
Collapse
|
3
|
Krieger S, Kececioglu J. Boosting the accuracy of protein secondary structure prediction through nearest neighbor search and method hybridization. Bioinformatics 2021; 36:i317-i325. [PMID: 32657384 PMCID: PMC7355242 DOI: 10.1093/bioinformatics/btaa336] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Protein secondary structure prediction is a fundamental precursor to many bioinformatics tasks. Nearly all state-of-the-art tools when computing their secondary structure prediction do not explicitly leverage the vast number of proteins whose structure is known. Leveraging this additional information in a so-called template-based method has the potential to significantly boost prediction accuracy. METHOD We present a new hybrid approach to secondary structure prediction that gains the advantages of both template- and non-template-based methods. Our core template-based method is an algorithmic approach that uses metric-space nearest neighbor search over a template database of fixed-length amino acid words to determine estimated class-membership probabilities for each residue in the protein. These probabilities are then input to a dynamic programming algorithm that finds a physically valid maximum-likelihood prediction for the entire protein. Our hybrid approach exploits a novel accuracy estimator for our core method, which estimates the unknown true accuracy of its prediction, to discern when to switch between template- and non-template-based methods. RESULTS On challenging CASP benchmarks, the resulting hybrid approach boosts the state-of-the-art Q8 accuracy by more than 2-10%, and Q3 accuracy by more than 1-3%, yielding the most accurate method currently available for both 3- and 8-state secondary structure prediction. AVAILABILITY AND IMPLEMENTATION A preliminary implementation in a new tool we call Nnessy is available free for non-commercial use at http://nnessy.cs.arizona.edu.
Collapse
Affiliation(s)
- Spencer Krieger
- Department of Computer Science, The University of Arizona, Tucson, AZ 85721, USA
| | - John Kececioglu
- Department of Computer Science, The University of Arizona, Tucson, AZ 85721, USA
| |
Collapse
|
4
|
Smolarczyk T, Roterman-Konieczna I, Stapor K. Protein Secondary Structure Prediction: A Review of Progress and Directions. Curr Bioinform 2020. [DOI: 10.2174/1574893614666191017104639] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Over the last few decades, a search for the theory of protein folding has
grown into a full-fledged research field at the intersection of biology, chemistry and informatics.
Despite enormous effort, there are still open questions and challenges, like understanding the rules
by which amino acid sequence determines protein secondary structure.
Objective:
In this review, we depict the progress of the prediction methods over the years and
identify sources of improvement.
Methods:
The protein secondary structure prediction problem is described followed by the discussion
on theoretical limitations, description of the commonly used data sets, features and a review
of three generations of methods with the focus on the most recent advances. Additionally, methods
with available online servers are assessed on the independent data set.
Results:
The state-of-the-art methods are currently reaching almost 88% for 3-class prediction and
76.5% for an 8-class prediction.
Conclusion:
This review summarizes recent advances and outlines further research directions.
Collapse
Affiliation(s)
- Tomasz Smolarczyk
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
| | - Irena Roterman-Konieczna
- Department of Bioinformatics and Telemedicine, Jagiellonian University Medical College, Krakow, Poland
| | - Katarzyna Stapor
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
| |
Collapse
|
5
|
Aydin Z, Azginoglu N, Bilgin HI, Celik M. Developing structural profile matrices for protein secondary structure and solvent accessibility prediction. Bioinformatics 2019; 35:4004-4010. [DOI: 10.1093/bioinformatics/btz238] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2018] [Revised: 02/17/2019] [Accepted: 03/29/2019] [Indexed: 11/13/2022] Open
Abstract
Abstract
Motivation
Predicting secondary structure and solvent accessibility of proteins are among the essential steps that preclude more elaborate 3D structure prediction tasks. Incorporating class label information contained in templates with known structures has the potential to improve the accuracy of prediction methods. Building a structural profile matrix is one such technique that provides a distribution for class labels at each amino acid position of the target.
Results
In this paper, a new structural profiling technique is proposed that is based on deriving PFAM families and is combined with an existing approach. Cross-validation experiments on two benchmark datasets and at various similarity intervals demonstrate that the proposed profiling strategy performs significantly better than Homolpro, a state-of-the-art method for incorporating template information, as assessed by statistical hypothesis tests.
Availability and implementation
The DSPRED method can be accessed by visiting the PSP server at http://psp.agu.edu.tr. Source code and binaries are freely available at https://github.com/yusufzaferaydin/dspred.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zafer Aydin
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Nuh Azginoglu
- Department of Computer Engineering, Nevsehir Haci Bektas Veli University, Nevsehir, Turkey
| | | | - Mete Celik
- Department of Computer Engineering, Erciyes University, Kayseri, Turkey
| |
Collapse
|
6
|
Aydin Z, Kaynar O, Görmez Y. Dimensionality reduction for protein secondary structure and solvent accesibility prediction. J Bioinform Comput Biol 2018; 16:1850020. [PMID: 30353781 DOI: 10.1142/s0219720018500208] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Secondary structure and solvent accessibility prediction provide valuable information for estimating the three dimensional structure of a protein. As new feature extraction methods are developed the dimensionality of the input feature space increases steadily. Reducing the number of dimensions provides several advantages such as faster model training, faster prediction and noise elimination. In this work, several dimensionality reduction techniques have been employed including various feature selection methods, autoencoders and PCA for protein secondary structure and solvent accessibility prediction. The reduced feature set is used to train a support vector machine at the second stage of a hybrid classifier. Cross-validation experiments on two difficult benchmarks demonstrate that the dimension of the input space can be reduced substantially while maintaining the prediction accuracy. This will enable the incorporation of additional informative features derived for predicting the structural properties of proteins without reducing the accuracy due to overfitting.
Collapse
Affiliation(s)
- Zafer Aydin
- * Department of Computer Engineering, Abdullah Gul University, Kayseri 38080, Turkey
| | - Oğuz Kaynar
- † Department of Management Information Systems, Cumhuriyet University, Sivas 58000, Turkey
| | - Yasin Görmez
- † Department of Management Information Systems, Cumhuriyet University, Sivas 58000, Turkey
| |
Collapse
|
7
|
Yang Y, Gao J, Wang J, Heffernan R, Hanson J, Paliwal K, Zhou Y. Sixty-five years of the long march in protein secondary structure prediction: the final stretch? Brief Bioinform 2018; 19:482-494. [PMID: 28040746 PMCID: PMC5952956 DOI: 10.1093/bib/bbw129] [Citation(s) in RCA: 84] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2016] [Revised: 11/15/2016] [Indexed: 11/13/2022] Open
Abstract
Protein secondary structure prediction began in 1951 when Pauling and Corey predicted helical and sheet conformations for protein polypeptide backbone even before the first protein structure was determined. Sixty-five years later, powerful new methods breathe new life into this field. The highest three-state accuracy without relying on structure templates is now at 82-84%, a number unthinkable just a few years ago. These improvements came from increasingly larger databases of protein sequences and structures for training, the use of template secondary structure information and more powerful deep learning techniques. As we are approaching to the theoretical limit of three-state prediction (88-90%), alternative to secondary structure prediction (prediction of backbone torsion angles and Cα-atom-based angles and torsion angles) not only has more room for further improvement but also allows direct prediction of three-dimensional fragment structures with constantly improved accuracy. About 20% of all 40-residue fragments in a database of 1199 non-redundant proteins have <6 Å root-mean-squared distance from the native conformations by SPIDER2. More powerful deep learning methods with improved capability of capturing long-range interactions begin to emerge as the next generation of techniques for secondary structure prediction. The time has come to finish off the final stretch of the long march towards protein secondary structure prediction.
Collapse
Affiliation(s)
- Yuedong Yang
- Insitute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Drive, Southport, QLD, Australia
| | - Jianzhao Gao
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China
| | - Jihua Wang
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, China
| | - Rhys Heffernan
- Signal Processing Laboratory, Griffith University, Brisbane, Australia
| | - Jack Hanson
- Signal Processing Laboratory, Griffith University, Brisbane, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, Griffith University, Brisbane, Australia
| | - Yaoqi Zhou
- Insitute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Drive, Southport, QLD, Australia
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, China
| |
Collapse
|
8
|
|
9
|
Protein secondary structure prediction: A survey of the state of the art. J Mol Graph Model 2017; 76:379-402. [DOI: 10.1016/j.jmgm.2017.07.015] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2017] [Revised: 07/14/2017] [Accepted: 07/17/2017] [Indexed: 11/21/2022]
|
10
|
周 鹏. A Predictor of Protein Secondary Structure Based on a Continuously Updated Templet Library. ACTA ACUST UNITED AC 2017. [DOI: 10.12677/hjcb.2017.72002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
11
|
Szafran AT, Mancini MG, Nickerson JA, Edwards DP, Mancini MA. Use of HCA in subproteome-immunization and screening of hybridoma supernatants to define distinct antibody binding patterns. Methods 2015; 96:75-84. [PMID: 26521976 DOI: 10.1016/j.ymeth.2015.10.021] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2015] [Revised: 10/28/2015] [Accepted: 10/29/2015] [Indexed: 11/15/2022] Open
Abstract
Understanding the properties and functions of complex biological systems depends upon knowing the proteins present and the interactions between them. Recent advances in mass spectrometry have given us greater insights into the participating proteomes, however, monoclonal antibodies remain key to understanding the structures, functions, locations and macromolecular interactions of the involved proteins. The traditional single immunogen method to produce monoclonal antibodies using hybridoma technology are time, resource and cost intensive, limiting the number of reagents that are available. Using a high content analysis screening approach, we have developed a method in which a complex mixture of proteins (e.g., subproteome) is used to generate a panel of monoclonal antibodies specific to a subproteome located in a defined subcellular compartment such as the nucleus. The immunofluorescent images in the primary hybridoma screen are analyzed using an automated processing approach and classified using a recursive partitioning forest classification model derived from images obtained from the Human Protein Atlas. Using an ammonium sulfate purified nuclear matrix fraction as an example of reverse proteomics, we identified 866 hybridoma supernatants with a positive immunofluorescent signal. Of those, 402 produced a nuclear signal from which patterns similar to known nuclear matrix associated proteins were identified. Detailed here is our method, the analysis techniques, and a discussion of the application to further in vivo antibody production.
Collapse
Affiliation(s)
- Adam T Szafran
- Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, TX 77030, United States
| | - Maureen G Mancini
- Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, TX 77030, United States
| | - Jeffrey A Nickerson
- Department of Cell Biology, University of Massachusetts Medical School, Worcester, MA 01655, United States
| | - Dean P Edwards
- Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, TX 77030, United States; Department of Immunology & Pathology, Baylor College of Medicine, Houston, TX 77030, United States
| | - Michael A Mancini
- Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, TX 77030, United States.
| |
Collapse
|
12
|
Nguyen T, Khosravi A, Creighton D, Nahavandi S. Multi-Output Interval Type-2 Fuzzy Logic System for Protein Secondary Structure Prediction. INT J UNCERTAIN FUZZ 2015. [DOI: 10.1142/s0218488515500324] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
A new multi-output interval type-2 fuzzy logic system (MOIT2FLS) is introduced for protein secondary structure prediction in this paper. Three outputs of the MOIT2FLS correspond to three structure classes including helix, strand (sheet) and coil. Quantitative properties of amino acids are employed to characterize twenty amino acids rather than the widely used computationally expensive binary encoding scheme. Three clustering tasks are performed using the adaptive vector quantization method to construct an equal number of initial rules for each type of secondary structure. Genetic algorithm is applied to optimally adjust parameters of the MOIT2FLS. The genetic fitness function is designed based on the Q3 measure. Experimental results demonstrate the dominance of the proposed approach against the traditional methods that are Chou-Fasman method, Garnier-Osguthorpe-Robson method, and artificial neural network models.
Collapse
Affiliation(s)
- Thanh Nguyen
- Centre for Intelligent Systems Research (CISR), Deakin University, Waurn Ponds, Victoria, 3216, Australia
| | - Abbas Khosravi
- Centre for Intelligent Systems Research (CISR), Deakin University, Waurn Ponds, Victoria, 3216, Australia
| | - Douglas Creighton
- Centre for Intelligent Systems Research (CISR), Deakin University, Waurn Ponds, Victoria, 3216, Australia
| | - Saeid Nahavandi
- Centre for Intelligent Systems Research (CISR), Deakin University, Waurn Ponds, Victoria, 3216, Australia
| |
Collapse
|
13
|
Dai HL. Imbalanced Protein Data Classification Using Ensemble FTM-SVM. IEEE Trans Nanobioscience 2015; 14:350-359. [DOI: 10.1109/tnb.2015.2431292] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
14
|
Supek F, Miñana B, Valcárcel J, Gabaldón T, Lehner B. Synonymous Mutations Frequently Act as Driver Mutations in Human Cancers. Cell 2014; 156:1324-1335. [DOI: 10.1016/j.cell.2014.01.051] [Citation(s) in RCA: 331] [Impact Index Per Article: 33.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2013] [Revised: 11/20/2013] [Accepted: 01/15/2014] [Indexed: 01/05/2023]
|
15
|
Mao W, Cong P, Wang Z, Lu L, Zhu Z, Li T. NMRDSP: an accurate prediction of protein shape strings from NMR chemical shifts and sequence data. PLoS One 2013; 8:e83532. [PMID: 24376713 PMCID: PMC3871590 DOI: 10.1371/journal.pone.0083532] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2013] [Accepted: 11/04/2013] [Indexed: 11/28/2022] Open
Abstract
Shape string is structural sequence and is an extremely important structure representation of protein backbone conformations. Nuclear magnetic resonance chemical shifts give a strong correlation with the local protein structure, and are exploited to predict protein structures in conjunction with computational approaches. Here we demonstrate a novel approach, NMRDSP, which can accurately predict the protein shape string based on nuclear magnetic resonance chemical shifts and structural profiles obtained from sequence data. The NMRDSP uses six chemical shifts (HA, H, N, CA, CB and C) and eight elements of structure profiles as features, a non-redundant set (1,003 entries) as the training set, and a conditional random field as a classification algorithm. For an independent testing set (203 entries), we achieved an accuracy of 75.8% for S8 (the eight states accuracy) and 87.8% for S3 (the three states accuracy). This is higher than only using chemical shifts or sequence data, and confirms that the chemical shift and the structure profile are significant features for shape string prediction and their combination prominently improves the accuracy of the predictor. We have constructed the NMRDSP web server and believe it could be employed to provide a solid platform to predict other protein structures and functions. The NMRDSP web server is freely available at http://cal.tongji.edu.cn/NMRDSP/index.jsp.
Collapse
Affiliation(s)
- Wusong Mao
- Department of Chemistry, Tongji University, Shanghai, China
| | - Peisheng Cong
- Department of Chemistry, Tongji University, Shanghai, China
- * E-mail: (PC); (TL)
| | - Zhiheng Wang
- Department of Chemistry, Tongji University, Shanghai, China
| | - Longjian Lu
- Department of Chemistry, Tongji University, Shanghai, China
| | - Zhongliang Zhu
- Department of Chemistry, Tongji University, Shanghai, China
| | - Tonghua Li
- Department of Chemistry, Tongji University, Shanghai, China
- * E-mail: (PC); (TL)
| |
Collapse
|
16
|
GHANTY PRADIP, PAL NIKHILR, MUDI RAJANIK. PREDICTION OF PROTEIN SECONDARY STRUCTURE USING PROBABILITY BASED FEATURES AND A HYBRID SYSTEM. J Bioinform Comput Biol 2013; 11:1350012. [DOI: 10.1142/s0219720013500121] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper, we propose some co-occurrence probability-based features for prediction of protein secondary structure. The features are extracted using occurrence/nonoccurrence of secondary structures in the protein sequences. We explore two types of features: position-specific (based on position of amino acid on fragments of protein sequences) as well as position-independent (independent of amino acid position on fragments of protein sequences). We use a hybrid system, NEUROSVM, consisting of neural networks and support vector machines for classification of secondary structures. We propose two schemes NSVMps and NSVM for protein secondary structure prediction. The NSVMps uses position-specific probability-based features and NEUROSVM classifier whereas NSVM uses the same classifier with position-independent probability-based features. The proposed method falls in the single-sequence category of methods because it does not use any sequence profile information such as position specific scoring matrices (PSSM) derived from PSI-BLAST. Two widely used datasets RS126 and CB513 are used in the experiments. The results obtained using the proposed features and NEUROSVM classifier are better than most of the existing single-sequence prediction methods. Most importantly, the results using NSVMps that are obtained using lower dimensional features, are comparable to those by other existing methods. The NSVMps and NSVM are finally tested on target proteins of the critical assessment of protein structure prediction experiment-9 (CASP9). A larger dataset is used to compare the performance of the proposed methods with that of two recent single-sequence prediction methods. We also investigate the impact of presence of different amino acid residues (in protein sequences) that are responsible for the formation of different secondary structures.
Collapse
Affiliation(s)
- PRADIP GHANTY
- Praxis Softek Solutions Private Limited, Module 616, SDF Building, Sector V, Saltlake, Kolkata, India
| | - NIKHIL R. PAL
- Electronics and Communication Sciences Unit, Indian Statistical Institute, 203 B. T. Road, Calcutta 700108, India
| | - RAJANI K. MUDI
- Department of Instrumentation and Electronics Engineering, Jadavpur University, Saltlake Campus, Kolkata, India
| |
Collapse
|
17
|
Cong P, Li D, Wang Z, Tang S, Li T. SPSSM8: an accurate approach for predicting eight-state secondary structures of proteins. Biochimie 2013; 95:2460-4. [PMID: 24056076 DOI: 10.1016/j.biochi.2013.09.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2013] [Accepted: 09/09/2013] [Indexed: 11/15/2022]
Abstract
Protein eight-state secondary structure prediction is challenging, but is necessary to determine protein structure and function. Here, we report the development of a novel approach, SPSSM8, to predict eight-state secondary structures of proteins accurately from sequences based on the structural position-specific scoring matrix (SPSSM). The SPSSM has been successfully utilized to predict three-state secondary structures. Now we employ an eight-state SPSSM as a feature that is obtained from sequence structure alignment against a large database of 9 million sequences with putative structural information. The SPSSM8 uses a low sequence identity dataset (9062 entries) as a training set and conditional random field for the classification algorithm. The SPSSM8 achieved an average eight-state secondary structure accuracy (Q8) of 71.7% (Q3, 81.6%) for an independent testing set (463 entries), which had an improved accuracy of 10.1% and 4.6% compared with SSPro8 and CNF, respectively, and significantly improved the accuracy of eight-state secondary structure prediction. For CASP 9 dataset (92 entries) the SPSSM8 achieved a Q8 accuracy of 80.1% (Q3, 83.0%). The SPSSM8 was confirmed as an outstanding predictor for eight-state secondary structures of proteins. SPSSM8 is freely available at http://cal.tongji.edu.cn/SPSSM8.
Collapse
Affiliation(s)
- Peisheng Cong
- Department of Chemistry, Tongji University, Shanghai, PR China.
| | | | | | | | | |
Collapse
|
18
|
Zhang XY, Lu LJ, Song Q, Yang QQ, Li DP, Sun JM, Li TH, Cong PS. DomHR: accurately identifying domain boundaries in proteins using a hinge region strategy. PLoS One 2013; 8:e60559. [PMID: 23593247 PMCID: PMC3623903 DOI: 10.1371/journal.pone.0060559] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2012] [Accepted: 02/27/2013] [Indexed: 11/18/2022] Open
Abstract
Motivation The precise prediction of protein domains, which are the structural, functional and evolutionary units of proteins, has been a research focus in recent years. Although many methods have been presented for predicting protein domains and boundaries, the accuracy of predictions could be improved. Results In this study we present a novel approach, DomHR, which is an accurate predictor of protein domain boundaries based on a creative hinge region strategy. A hinge region was defined as a segment of amino acids that covers part of a domain region and a boundary region. We developed a strategy to construct profiles of domain-hinge-boundary (DHB) features generated by sequence-domain/hinge/boundary alignment against a database of known domain structures. The DHB features had three elements: normalized domain, hinge, and boundary probabilities. The DHB features were used as input to identify domain boundaries in a sequence. DomHR used a nonredundant dataset as the training set, the DHB and predicted shape string as features, and a conditional random field as the classification algorithm. In predicted hinge regions, a residue was determined to be a domain or a boundary according to a decision threshold. After decision thresholds were optimized, DomHR was evaluated by cross-validation, large-scale prediction, independent test and CASP (Critical Assessment of Techniques for Protein Structure Prediction) tests. All results confirmed that DomHR outperformed other well-established, publicly available domain boundary predictors for prediction accuracy. Availability The DomHR is available at http://cal.tongji.edu.cn/domain/.
Collapse
Affiliation(s)
- Xiao-yan Zhang
- Department of Chemistry, Tongji University, Shanghai, China
| | - Long-jian Lu
- Department of Chemistry, Tongji University, Shanghai, China
| | - Qi Song
- Department of Chemistry, Tongji University, Shanghai, China
| | - Qian-qian Yang
- Department of Chemistry, Tongji University, Shanghai, China
| | - Da-peng Li
- Department of Chemistry, Tongji University, Shanghai, China
| | - Jiang-ming Sun
- Department of Chemistry, Tongji University, Shanghai, China
| | - Tong-hua Li
- Department of Chemistry, Tongji University, Shanghai, China
- * E-mail: (T-HL); (P-SC) (PC)
| | - Pei-sheng Cong
- Department of Chemistry, Tongji University, Shanghai, China
- * E-mail: (T-HL); (P-SC) (PC)
| |
Collapse
|
19
|
Song Q, Li T, Cong P, Sun J, Li D, Tang S. Predicting turns in proteins with a unified model. PLoS One 2012; 7:e48389. [PMID: 23144872 PMCID: PMC3492357 DOI: 10.1371/journal.pone.0048389] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2012] [Accepted: 09/24/2012] [Indexed: 11/18/2022] Open
Abstract
MOTIVATION Turns are a critical element of the structure of a protein; turns play a crucial role in loops, folds, and interactions. Current prediction methods are well developed for the prediction of individual turn types, including α-turn, β-turn, and γ-turn, etc. However, for further protein structure and function prediction it is necessary to develop a uniform model that can accurately predict all types of turns simultaneously. RESULTS In this study, we present a novel approach, TurnP, which offers the ability to investigate all the turns in a protein based on a unified model. The main characteristics of TurnP are: (i) using newly exploited features of structural evolution information (secondary structure and shape string of protein) based on structure homologies, (ii) considering all types of turns in a unified model, and (iii) practical capability of accurate prediction of all turns simultaneously for a query. TurnP utilizes predicted secondary structures and predicted shape strings, both of which have greater accuracy, based on innovative technologies which were both developed by our group. Then, sequence and structural evolution features, which are profile of sequence, profile of secondary structures and profile of shape strings are generated by sequence and structure alignment. When TurnP was validated on a non-redundant dataset (4,107 entries) by five-fold cross-validation, we achieved an accuracy of 88.8% and a sensitivity of 71.8%, which exceeded the most state-of-the-art predictors of certain type of turn. Newly determined sequences, the EVA and CASP9 datasets were used as independent tests and the results we achieved were outstanding for turn predictions and confirmed the good performance of TurnP for practical applications.
Collapse
Affiliation(s)
- Qi Song
- Department of Chemistry, Tongji University, Shanghai, China
| | - Tonghua Li
- Department of Chemistry, Tongji University, Shanghai, China
- * E-mail:
| | - Peisheng Cong
- Department of Chemistry, Tongji University, Shanghai, China
| | - Jiangming Sun
- Department of Chemistry, Tongji University, Shanghai, China
| | - Dapeng Li
- Department of Chemistry, Tongji University, Shanghai, China
| | - Shengnan Tang
- Department of Chemistry, Tongji University, Shanghai, China
| |
Collapse
|
20
|
Sun J, Tang S, Xiong W, Cong P, Li T. DSP: a protein shape string and its profile prediction server. Nucleic Acids Res 2012; 40:W298-302. [PMID: 22553364 PMCID: PMC3394270 DOI: 10.1093/nar/gks361] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Many studies have demonstrated that shape string is an extremely important structure representation, since it is more complete than the classical secondary structure. The shape string provides detailed information also in the regions denoted random coil. But few services are provided for systematic analysis of protein shape string. To fill this gap, we have developed an accurate shape string predictor based on two innovative technologies: a knowledge-driven sequence alignment and a sequence shape string profile method. The performance on blind test data demonstrates that the proposed method can be used for accurate prediction of protein shape string. The DSP server provides both predicted shape string and sequence shape string profile for each query sequence. Using this information, the users can compare protein structure or display protein evolution in shape string space. The DSP server is available at both http://cheminfo.tongji.edu.cn/dsp/ and its main mirror http://chemcenter.tongji.edu.cn/dsp/.
Collapse
Affiliation(s)
- Jiangming Sun
- Department of Chemistry, Tongji University, 1239 Siping Road, Shanghai 200092, China
| | | | | | | | | |
Collapse
|
21
|
Sun JM, Li TH, Cong PS, Tang SN, Xiong WW. Retrieving backbone string neighbors provides insights into structural modeling of membrane proteins. Mol Cell Proteomics 2012; 11:M111.016808. [PMID: 22415040 DOI: 10.1074/mcp.m111.016808] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
Identification of protein structural neighbors to a query is fundamental in structure and function prediction. Here we present BS-align, a systematic method to retrieve backbone string neighbors from primary sequences as templates for protein modeling. The backbone conformation of a protein is represented by the backbone string, as defined in Ramachandran space. The backbone string of a query can be accurately predicted by two innovative technologies: a knowledge-driven sequence alignment and encoding of a backbone string element profile. Then, the predicted backbone string is employed to align against a backbone string database and retrieve a set of backbone string neighbors. The backbone string neighbors were shown to be close to native structures of query proteins. BS-align was successfully employed to predict models of 10 membrane proteins with lengths ranging between 229 and 595 residues, and whose high-resolution structural determinations were difficult to elucidate both by experiment and prediction. The obtained TM-scores and root mean square deviations of the models confirmed that the models based on the backbone string neighbors retrieved by the BS-align were very close to the native membrane structures although the query and the neighbor shared a very low sequence identity. The backbone string system represents a new road for the prediction of protein structure from sequence, and suggests that the similarity of the backbone string would be more informative than describing a protein as belonging to a fold.
Collapse
Affiliation(s)
- Jiang-Ming Sun
- Department of Chemistry, Tongji University, 1239 Siping Road, Shanghai 200092, China
| | | | | | | | | |
Collapse
|