1
|
Sidi T, Keasar C. Redundancy-weighting the PDB for detailed secondary structure prediction using deep-learning models. Bioinformatics 2020; 36:3733-3738. [PMID: 32186698 DOI: 10.1093/bioinformatics/btaa196] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Revised: 03/12/2020] [Accepted: 03/16/2020] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION The Protein Data Bank (PDB), the ultimate source for data in structural biology, is inherently imbalanced. To alleviate biases, virtually all structural biology studies use nonredundant (NR) subsets of the PDB, which include only a fraction of the available data. An alternative approach, dubbed redundancy-weighting (RW), down-weights redundant entries rather than discarding them. This approach may be particularly helpful for machine-learning (ML) methods that use the PDB as their source for data. Methods for secondary structure prediction (SSP) have greatly improved over the years with recent studies achieving above 70% accuracy for eight-class (DSSP) prediction. As these methods typically incorporate ML techniques, training on RW datasets might improve accuracy, as well as pave the way toward larger and more informative secondary structure classes. RESULTS This study compares the SSP performances of deep-learning models trained on either RW or NR datasets. We show that training on RW sets consistently results in better prediction of 3- (HCE), 8- (DSSP) and 13-class (STR2) secondary structures. AVAILABILITY AND IMPLEMENTATION The ML models, the datasets used for their derivation and testing, and a stand-alone SSP program for DSSP and STR2 predictions, are freely available under LGPL license in http://meshi1.cs.bgu.ac.il/rw. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tomer Sidi
- Department of Computer Science, Ben-Gurion University, P.O.B 653, Be'er Sheva 84105, Israel
| | - Chen Keasar
- Department of Computer Science, Ben-Gurion University, P.O.B 653, Be'er Sheva 84105, Israel
| |
Collapse
|
2
|
Huang GD, Liu XM, Huang TL, Xia LC. The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer. Synth Syst Biotechnol 2019; 4:150-156. [PMID: 31508512 PMCID: PMC6723412 DOI: 10.1016/j.synbio.2019.08.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2019] [Revised: 07/14/2019] [Accepted: 08/05/2019] [Indexed: 12/21/2022] Open
Abstract
Alignment-based database search and sequence comparison are commonly used to detect horizontal gene transfer (HGT). However, with the rapid increase of sequencing depth, hundreds of thousands of contigs are routinely assembled from metagenomics studies, which challenges alignment-based HGT analysis by overwhelming the known reference sequences. Detecting HGT by k-mer statistics thus becomes an attractive alternative. These alignment-free statistics have been demonstrated in high performance and efficiency in whole-genome and transcriptome comparisons. To adapt k-mer statistics for HGT detection, we developed two aggregative statistics TsumS and Tsum*, which subsample metagenome contigs by their representative regions, and summarize the regional D2S and D2* metrics by their upper bounds. We systematically studied the aggregative statistics’ power at different k-mer size using simulations. Our analysis showed that, in general, the power of TsumS and Tsum* increases with sequencing coverage, and reaches a maximum power >80% at k = 6, with 5% Type-I error and the coverage ratio >0.2x. The statistical power of TsumS and Tsum* was evaluated with realistic simulations of HGT mechanism, sequencing depth, read length, and base error. We expect these statistics to be useful distance metrics for identifying HGT in metagenomic studies.
Collapse
Affiliation(s)
- Guan-Da Huang
- School of Physics and Optoelectronics, South China University of Technology, Guangzhou, 510640, China
| | - Xue-Mei Liu
- School of Physics and Optoelectronics, South China University of Technology, Guangzhou, 510640, China
| | - Tian-Lai Huang
- School of Physics and Optoelectronics, South China University of Technology, Guangzhou, 510640, China
| | - Li-C Xia
- Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
| |
Collapse
|
3
|
Predicting protein-ligand interactions based on bow-pharmacological space and Bayesian additive regression trees. Sci Rep 2019; 9:7703. [PMID: 31118426 PMCID: PMC6531441 DOI: 10.1038/s41598-019-43125-6] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2017] [Accepted: 04/12/2019] [Indexed: 02/08/2023] Open
Abstract
Identifying potential protein-ligand interactions is central to the field of drug discovery as it facilitates the identification of potential novel drug leads, contributes to advancement from hits to leads, predicts potential off-target explanations for side effects of approved drugs or candidates, as well as de-orphans phenotypic hits. For the rapid identification of protein-ligand interactions, we here present a novel chemogenomics algorithm for the prediction of protein-ligand interactions using a new machine learning approach and novel class of descriptor. The algorithm applies Bayesian Additive Regression Trees (BART) on a newly proposed proteochemical space, termed the bow-pharmacological space. The space spans three distinctive sub-spaces that cover the protein space, the ligand space, and the interaction space. Thereby, the model extends the scope of classical target prediction or chemogenomic modelling that relies on one or two of these subspaces. Our model demonstrated excellent prediction power, reaching accuracies of up to 94.5–98.4% when evaluated on four human target datasets constituting enzymes, nuclear receptors, ion channels, and G-protein-coupled receptors . BART provided a reliable probabilistic description of the likelihood of interaction between proteins and ligands, which can be used in the prioritization of assays to be performed in both discovery and vigilance phases of small molecule development.
Collapse
|
4
|
Oldfield CJ, Chen K, Kurgan L. Computational Prediction of Secondary and Supersecondary Structures from Protein Sequences. Methods Mol Biol 2019; 1958:73-100. [PMID: 30945214 DOI: 10.1007/978-1-4939-9161-7_4] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Many new methods for the sequence-based prediction of the secondary and supersecondary structures have been developed over the last several years. These and older sequence-based predictors are widely applied for the characterization and prediction of protein structure and function. These efforts have produced countless accurate predictors, many of which rely on state-of-the-art machine learning models and evolutionary information generated from multiple sequence alignments. We describe and motivate both types of predictions. We introduce concepts related to the annotation and computational prediction of the three-state and eight-state secondary structure as well as several types of supersecondary structures, such as β hairpins, coiled coils, and α-turn-α motifs. We review 34 predictors focusing on recent tools and provide detailed information for a selected set of 14 secondary structure and 3 supersecondary structure predictors. We conclude with several practical notes for the end users of these predictive methods.
Collapse
Affiliation(s)
- Christopher J Oldfield
- Department of Computer Science, College of Engineering, Virginia Commonwealth University, Richmond, VA, USA
| | - Ke Chen
- School of Computer Science and Software Engineering, Tianjin Polytechnic University, Tianjin, People's Republic of China
| | - Lukasz Kurgan
- Department of Computer Science, College of Engineering, Virginia Commonwealth University, Richmond, VA, USA.
| |
Collapse
|
5
|
Wang L, Law J, Kale SD, Murali TM, Pandey G. Large-scale protein function prediction using heterogeneous ensembles. F1000Res 2018; 7. [PMID: 30450194 PMCID: PMC6221071 DOI: 10.12688/f1000research.16415.1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/26/2018] [Indexed: 12/24/2022] Open
Abstract
Heterogeneous ensembles are an effective approach in scenarios where the ideal data type and/or individual predictor are unclear for a given problem. These ensembles have shown promise for protein function prediction (PFP), but their ability to improve PFP at a large scale is unclear. The overall goal of this study is to critically assess this ability of a variety of heterogeneous ensemble methods across a multitude of functional terms, proteins and organisms. Our results show that these methods, especially Stacking using Logistic Regression, indeed produce more accurate predictions for a variety of Gene Ontology terms differing in size and specificity. To enable the application of these methods to other related problems, we have publicly shared the HPC-enabled code underlying this work as LargeGOPred ( https://github.com/GauravPandeyLab/LargeGOPred).
Collapse
Affiliation(s)
- Linhua Wang
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| | - Jeffrey Law
- Genetics, Bioinformatics, and Computational Biology Ph.D. Program, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA
| | - Shiv D Kale
- Biocomplexity Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA
| | - T M Murali
- Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA
| | - Gaurav Pandey
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| |
Collapse
|
6
|
Protein secondary structure prediction: A survey of the state of the art. J Mol Graph Model 2017; 76:379-402. [DOI: 10.1016/j.jmgm.2017.07.015] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2017] [Revised: 07/14/2017] [Accepted: 07/17/2017] [Indexed: 11/21/2022]
|
7
|
RNAMethPre: A Web Server for the Prediction and Query of mRNA m6A Sites. PLoS One 2016; 11:e0162707. [PMID: 27723837 PMCID: PMC5056760 DOI: 10.1371/journal.pone.0162707] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2016] [Accepted: 08/27/2016] [Indexed: 01/01/2023] Open
Abstract
N6-Methyladenosine (m6A) is the most common mRNA modification; it occurs in a wide range of taxon and is associated with many key biological processes. High-throughput experiments have identified m6A-peaks and sites across the transcriptome, but studies of m6A sites at the transcriptome-wide scale are limited to a few species and tissue types. Therefore, the computational prediction of mRNA m6A sites has become an important strategy. In this study, we integrated multiple features of mRNA (flanking sequences, local secondary structure information, and relative position information) and trained a SVM classifier to predict m6A sites in mammalian mRNA sequences. Our method achieves ideal performance in both cross-validation tests and rigorous independent dataset tests. The server also provides a comprehensive database of predicted transcriptome-wide m6A sites and curated m6A-seq peaks from the literature for both human and mouse, and these can be queried and visualized in a genome browser. The RNAMethPre web server provides a user-friendly tool for the prediction and query of mRNA m6A sites, which is freely accessible for public use at http://bioinfo.tsinghua.edu.cn/RNAMethPre/index.html.
Collapse
|
8
|
Liu B, Wang S, Dong Q, Li S, Liu X. Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning. IEEE Trans Nanobioscience 2016; 15:328-334. [PMID: 28113908 DOI: 10.1109/tnb.2016.2555951] [Citation(s) in RCA: 65] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
DNA-binding proteins play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. With the rapid development of next generation of sequencing technique, the number of protein sequences is unprecedentedly increasing. Thus it is necessary to develop computational methods to identify the DNA-binding proteins only based on the protein sequence information. In this study, a novel method called iDNA-KACC is presented, which combines the Support Vector Machine (SVM) and the auto-cross covariance transformation. The protein sequences are first converted into profile-based protein representation, and then converted into a series of fixed-length vectors by the auto-cross covariance transformation with Kmer composition. The sequence order effect can be effectively captured by this scheme. These vectors are then fed into Support Vector Machine (SVM) to discriminate the DNA-binding proteins from the non DNA-binding ones. iDNA-KACC achieves an overall accuracy of 75.16% and Matthew correlation coefficient of 0.5 by a rigorous jackknife test. Its performance is further improved by employing an ensemble learning approach, and the improved predictor is called iDNA-KACC-EL. Experimental results on an independent dataset shows that iDNA-KACC-EL outperforms all the other state-of-the-art predictors, indicating that it would be a useful computational tool for DNA binding protein identification. .
Collapse
|
9
|
Xiang S, Yan Z, Liu K, Zhang Y, Sun Z. AthMethPre: a web server for the prediction and query of mRNA m6A sites in Arabidopsis thaliana. MOLECULAR BIOSYSTEMS 2016; 12:3333-3337. [DOI: 10.1039/c6mb00536e] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
The first web server that provides a user-friendly tool for the prediction and query of A. thaliana mRNA m6A sites.
Collapse
Affiliation(s)
- Shunian Xiang
- MOE Key Laboratory of Bioinformatics
- School of Life Sciences
- Tsinghua University
- Beijing 100084
- P. R. China
| | - Zhangming Yan
- MOE Key Laboratory of Bioinformatics
- School of Life Sciences
- Tsinghua University
- Beijing 100084
- P. R. China
| | - Ke Liu
- Key Lab in Healthy Science and Technology
- Division of Life Science
- Graduate School at Shenzhen
- Tsinghua University
- Shenzhen
| | - Yaou Zhang
- Department of Statistics
- University of California
- Berkeley
- USA
| | - Zhirong Sun
- MOE Key Laboratory of Bioinformatics
- School of Life Sciences
- Tsinghua University
- Beijing 100084
- P. R. China
| |
Collapse
|
10
|
Li Q, Dahl DB, Vannucci M, Hyun Joo, Tsai JW. Bayesian model of protein primary sequence for secondary structure prediction. PLoS One 2014; 9:e109832. [PMID: 25314659 PMCID: PMC4196994 DOI: 10.1371/journal.pone.0109832] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2014] [Accepted: 09/02/2014] [Indexed: 01/26/2023] Open
Abstract
Determining the primary structure (i.e., amino acid sequence) of a protein has become cheaper, faster, and more accurate. Higher order protein structure provides insight into a protein's function in the cell. Understanding a protein's secondary structure is a first step towards this goal. Therefore, a number of computational prediction methods have been developed to predict secondary structure from just the primary amino acid sequence. The most successful methods use machine learning approaches that are quite accurate, but do not directly incorporate structural information. As a step towards improving secondary structure reduction given the primary structure, we propose a Bayesian model based on the knob-socket model of protein packing in secondary structure. The method considers the packing influence of residues on the secondary structure determination, including those packed close in space but distant in sequence. By performing an assessment of our method on 2 test sets we show how incorporation of multiple sequence alignment data, similarly to PSIPRED, provides balance and improves the accuracy of the predictions. Software implementing the methods is provided as a web application and a stand-alone implementation.
Collapse
Affiliation(s)
- Qiwei Li
- Department of Statistics, Rice University, Houston, Texas, United States of America
| | - David B. Dahl
- Department of Statistics, Brigham Young University, Provo, Utah, United States of America
| | - Marina Vannucci
- Department of Statistics, Rice University, Houston, Texas, United States of America
| | - Hyun Joo
- Department of Chemistry, University of the Pacific, Stockton, California, United States of America
| | - Jerry W. Tsai
- Department of Chemistry, University of the Pacific, Stockton, California, United States of America
| |
Collapse
|
11
|
Saraswathi S, Fernández-Martínez JL, Koliński A, Jernigan RL, Kloczkowski A. Distributions of amino acids suggest that certain residue types more effectively determine protein secondary structure. J Mol Model 2013; 19:4337-48. [PMID: 23907551 DOI: 10.1007/s00894-013-1911-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2013] [Accepted: 06/05/2013] [Indexed: 11/27/2022]
Abstract
Exponential growth in the number of available protein sequences is unmatched by the slower growth in the number of structures. As a result, the development of efficient and fast protein secondary structure prediction methods is essential for the broad comprehension of protein structures. Computational methods that can efficiently determine secondary structure can in turn facilitate protein tertiary structure prediction, since most methods rely initially on secondary structure predictions. Recently, we have developed a fast learning optimized prediction methodology (FLOPRED) for predicting protein secondary structure (Saraswathi et al. in JMM 18:4275, 2012). Data are generated by using knowledge-based potentials combined with structure information from the CATH database. A neural network-based extreme learning machine (ELM) and advanced particle swarm optimization (PSO) are used with this data to obtain better and faster convergence to more accurate secondary structure predicted results. A five-fold cross-validated testing accuracy of 83.8 % and a segment overlap (SOV) score of 78.3 % are obtained in this study. Secondary structure predictions and their accuracy are usually presented for three secondary structure elements: α-helix, β-strand and coil but rarely have the results been analyzed with respect to their constituent amino acids. In this paper, we use the results obtained with FLOPRED to provide detailed behaviors for different amino acid types in the secondary structure prediction. We investigate the influence of the composition, physico-chemical properties and position specific occurrence preferences of amino acids within secondary structure elements. In addition, we identify the correlation between these properties and prediction accuracy. The present detailed results suggest several important ways that secondary structure predictions can be improved in the future that might lead to improved protein design and engineering.
Collapse
Affiliation(s)
- S Saraswathi
- Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children's Hospital, 700 Children's Drive, Columbus, OH, USA
| | | | | | | | | |
Collapse
|
12
|
Li Y, Liu H, Rata I, Jakobsson E. Building a knowledge-based statistical potential by capturing high-order inter-residue interactions and its applications in protein secondary structure assessment. J Chem Inf Model 2013; 53:500-8. [PMID: 23336295 DOI: 10.1021/ci300207x] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The rapidly increasing number of protein crystal structures available in the Protein Data Bank (PDB) has naturally made statistical analyses feasible in studying complex high-order inter-residue correlations. In this paper, we report a context-based secondary structure potential (CSSP) for assessing the quality of predicted protein secondary structures generated by various prediction servers. CSSP is a sequence-position-specific knowledge-based potential generated based on the potentials of mean force approach, where high-order inter-residue interactions are taken into consideration. The CSSP potential is effective in identifying secondary structure predictions with good quality. In 56% of the targets in the CB513 benchmark, the optimal CSSP potential is able to recognize the native secondary structure or a prediction with Q3 accuracy higher than 90% as best scored in the predicted secondary structures generated by 10 popularly used secondary structure prediction servers. In more than 80% of the CB513 targets, the predicted secondary structures with the lowest CSSP potential values yield higher than 80% Q3 accuracy. Similar performance of CSSP is found on the CASP9 targets as well. Moreover, our computational results also show that the CSSP potential using triplets outperforms the CSSP potential using doublets and is currently better than the CSSP potential using quartets.
Collapse
Affiliation(s)
- Yaohang Li
- Department of Computer Science, Old Dominion University, Norfolk, Virginia, USA.
| | | | | | | |
Collapse
|
13
|
Yan J, Marcus M, Kurgan L. Comprehensively designed consensus of standalone secondary structure predictors improves Q3 by over 3%. J Biomol Struct Dyn 2013; 32:36-51. [PMID: 23298369 DOI: 10.1080/07391102.2012.746945] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Protein fold is defined by a spatial arrangement of three types of secondary structures (SSs) including helices, sheets, and coils/loops. Current methods that predict SS from sequences rely on complex machine learning-derived models and provide the three-state accuracy (Q3) at about 82%. Further improvements in predictive quality could be obtained with a consensus-based approach, which so far received limited attention. We perform first-of-its-kind comprehensive design of a SS consensus predictor (SScon), in which we consider 12 modern standalone SS predictors and utilize Support Vector Machine (SVM) to combine their predictions. Using a large benchmark data-set with 10 random training-test splits, we show that a simple, voting-based consensus of carefully selected base methods improves Q3 by 1.9% when compared to the best single predictor. Use of SVM provides additional 1.4% improvement with the overall Q3 at 85.6% and segment overlap (SOV3) at 83.7%, when compared to 82.3 and 80.9%, respectively, obtained by the best individual methods. We also show strong improvements when the consensus is based on ab-initio methods, with Q3 = 82.3% and SOV3 = 80.7% that match the results from the best template-based approaches. Our consensus reduces the number of significant errors where helix is confused with a strand, provides particularly good results for short helices and strands, and gives the most accurate estimates of the content of individual SSs in the chain. Case studies are used to visualize the improvements offered by the consensus at the residue level. A web-server and a standalone implementation of SScon are available at http://biomine.ece.ualberta.ca/SSCon/ .
Collapse
Affiliation(s)
- Jing Yan
- a Department of Electrical and Computer Engineering , University of Alberta , Edmonton , Canada
| | | | | |
Collapse
|
14
|
Saraswathi S, Fernández-Martínez JL, Kolinski A, Jernigan RL, Kloczkowski A. Fast learning optimized prediction methodology (FLOPRED) for protein secondary structure prediction. J Mol Model 2012; 18:4275-89. [PMID: 22562230 PMCID: PMC3694724 DOI: 10.1007/s00894-012-1410-7] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2012] [Accepted: 03/19/2012] [Indexed: 10/28/2022]
Abstract
Computational methods are rapidly gaining importance in the field of structural biology, mostly due to the explosive progress in genome sequencing projects and the large disparity between the number of sequences and the number of structures. There has been an exponential growth in the number of available protein sequences and a slower growth in the number of structures. There is therefore an urgent need to develop computational methods to predict structures and identify their functions from the sequence. Developing methods that will satisfy these needs both efficiently and accurately is of paramount importance for advances in many biomedical fields, including drug development and discovery of biomarkers. A novel method called fast learning optimized prediction methodology (FLOPRED) is proposed for predicting protein secondary structure, using knowledge-based potentials combined with structure information from the CATH database. A neural network-based extreme learning machine (ELM) and advanced particle swarm optimization (PSO) are used with this data that yield better and faster convergence to produce more accurate results. Protein secondary structures are predicted reliably, more efficiently and more accurately using FLOPRED. These techniques yield superior classification of secondary structure elements, with a training accuracy ranging between 83 % and 87 % over a widerange of hidden neurons and a cross-validated testing accuracy ranging between 81 % and 84 % and a segment overlap (SOV) score of 78 % that are obtained with different sets of proteins. These results are comparable to other recently published studies, but are obtained with greater efficiencies, in terms of time and cost.
Collapse
Affiliation(s)
- Saras Saraswathi
- Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children’s Hospital, 700 Children’s Drive, Columbus, OH, USA
| | | | - Andrzej Kolinski
- Department of Mathematics, Laboratory of Theory of Biopolymers, Faculty of Chemistry, Warsaw University, Pasteura 1, 02-093 Warsaw
| | - Robert L. Jernigan
- Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA, USA
| | - Andrzej Kloczkowski
- Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children’s Hospital, 700 Children’s Drive, Columbus, OH, USA, Tel.: +1-614-722-3880 Fax: +1-614-355-2728
| |
Collapse
|
15
|
Chen K, Kurgan L. Computational prediction of secondary and supersecondary structures. Methods Mol Biol 2012; 932:63-86. [PMID: 22987347 DOI: 10.1007/978-1-62703-065-6_5] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
The sequence-based prediction of the secondary and supersecondary structures enjoys strong interest and finds applications in numerous areas related to the characterization and prediction of protein structure and function. Substantial efforts in these areas over the last three decades resulted in the development of accurate predictors, which take advantage of modern machine learning models and availability of evolutionary information extracted from multiple sequence alignment. In this chapter, we first introduce and motivate both prediction areas and introduce basic concepts related to the annotation and prediction of the secondary and supersecondary structures, focusing on the β hairpin, coiled coil, and α-turn-α motifs. Next, we overview state-of-the-art prediction methods, and we provide details for 12 modern secondary structure predictors and 4 representative supersecondary structure predictors. Finally, we provide several practical notes for the users of these prediction tools.
Collapse
Affiliation(s)
- Ke Chen
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada
| | | |
Collapse
|
16
|
Wei Y, Thompson J, Floudas CA. CONCORD: a consensus method for protein secondary structure prediction via mixed integer linear optimization. Proc Math Phys Eng Sci 2011. [DOI: 10.1098/rspa.2011.0514] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Most of the protein structure prediction methods use a multi-step process, which often includes secondary structure prediction, contact prediction, fragment generation, clustering, etc. For many years, secondary structure prediction has been the workhorse for numerous methods aimed at predicting protein structure and function. This paper presents a new mixed integer linear optimization (MILP)-based consensus method: a Consensus scheme based On a mixed integer liNear optimization method for seCOndary stRucture preDiction (CONCORD). Based on seven secondary structure prediction methods, SSpro, DSC, PROF, PROFphd, PSIPRED, Predator and GorIV, the MILP-based consensus method combines the strengths of different methods, maximizes the number of correctly predicted amino acids and achieves a better prediction accuracy. The method is shown to perform well compared with the seven individual methods when tested on the PDBselect25 training protein set using sixfold cross validation. It also performs well compared with another set of 10 online secondary structure prediction servers (including several recent ones) when tested on the CASP9 targets (
http://predictioncenter.org/casp9/
). The average Q3 prediction accuracy is 83.04 per cent for the sixfold cross validation of the PDBselect25 set and 82.3 per cent for the CASP9 targets. We have developed a MILP-based consensus method for protein secondary structure prediction. A web server, CONCORD, is available to the scientific community at
http://helios.princeton.edu/CONCORD
.
Collapse
Affiliation(s)
- Y. Wei
- Department of Chemical and Biological Engineering, Princeton University, Princeton, NJ 08544, USA
| | - J. Thompson
- Department of Chemical and Biological Engineering, Princeton University, Princeton, NJ 08544, USA
| | - C. A. Floudas
- Department of Chemical and Biological Engineering, Princeton University, Princeton, NJ 08544, USA
| |
Collapse
|
17
|
Grahnen JA, Kubelka J, Liberles DA. Fast Side Chain Replacement in Proteins Using a Coarse-Grained Approach for Evaluating the Effects of Mutation During Evolution. J Mol Evol 2011; 73:23-33. [DOI: 10.1007/s00239-011-9454-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2010] [Accepted: 07/14/2011] [Indexed: 11/28/2022]
|
18
|
Kedarisetti KD, Mizianty MJ, Dick S, Kurgan L. Improved sequence-based prediction of strand residues. J Bioinform Comput Biol 2011; 9:67-89. [PMID: 21328707 DOI: 10.1142/s0219720011005355] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2010] [Revised: 11/19/2010] [Accepted: 11/19/2010] [Indexed: 01/02/2023]
Abstract
Accurate identification of strand residues aids prediction and analysis of numerous structural and functional aspects of proteins. We propose a sequence-based predictor, BETArPRED, which improves prediction of strand residues and β-strand segments. BETArPRED uses a novel design that accepts strand residues predicted by SSpro and predicts the remaining positions utilizing a logistic regression classifier with nine custom-designed features. These are derived from the primary sequence, the secondary structure (SS) predicted by SSpro, PSIPRED and SPINE, and residue depth as predicted by RDpred. Our features utilize certain local (window-based) patterns in the predicted SS and combine information about the predicted SS and residue depth. BETArPRED is evaluated on 432 sequences that share low identity with the training chains, and on the CASP8 dataset. We compare BETArPRED with seven modern SS predictors, and the top-performing automated structure predictor in CASP8, the ZHANG-server. BETArPRED provides statistically significant improvements over each of the SS predictors; it improves prediction of strand residues and β-strands, and it finds β-strands that were missed by the other methods. When compared with the ZHANG-server, we improve predictions of strand segments and predict more actual strand residues, while the other predictor achieves higher rate of correct strand residue predictions when under-predicting them.
Collapse
|
19
|
Species specific amino acid sequence–protein local structure relationships: An analysis in the light of a structural alphabet. J Theor Biol 2011; 276:209-17. [DOI: 10.1016/j.jtbi.2011.01.047] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2010] [Revised: 01/28/2011] [Accepted: 01/31/2011] [Indexed: 11/24/2022]
|
20
|
Abstract
MOTIVATION Identifying piwi-interacting RNAs (piRNAs) of non-model organisms is a difficult and unsolved problem because piRNAs lack conservative secondary structure motifs and sequence homology in different species. RESULTS In this article, a k-mer scheme is proposed to identify piRNA sequences, relying on the training sets from non-piRNA and piRNA sequences of five model species sequenced: rat, mouse, human, fruit fly and nematode. Compared with the existing 'static' scheme based on the position-specific base usage, our novel 'dynamic' algorithm performs much better with a precision of over 90% and a sensitivity of over 60%, and the precision is verified by 5-fold cross-validation in these species. To test its validity, we use the algorithm to identify piRNAs of the migratory locust based on 603 607 deep-sequenced small RNA sequences. Totally, 87 536 piRNAs of the locust are predicted, and 4426 of them matched with existing locust transposons. The transcriptional difference between solitary and gregarious locusts was described. We also revisit the position-specific base usage of piRNAs and find the conservation in the end of piRNAs. Therefore, the method we developed can be used to identify piRNAs of non-model organisms without complete genome sequences. AVAILABILITY The web server for implementing the algorithm and the software code are freely available to the academic community at http://59.79.168.90/piRNA/index.php.
Collapse
Affiliation(s)
- Yi Zhang
- Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | | | | |
Collapse
|
21
|
Joosten RP, te Beek TAH, Krieger E, Hekkelman ML, Hooft RWW, Schneider R, Sander C, Vriend G. A series of PDB related databases for everyday needs. Nucleic Acids Res 2010; 39:D411-9. [PMID: 21071423 PMCID: PMC3013697 DOI: 10.1093/nar/gkq1105] [Citation(s) in RCA: 527] [Impact Index Per Article: 37.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
The Protein Data Bank (PDB) is the world-wide repository of macromolecular structure information. We present a series of databases that run parallel to the PDB. Each database holds one entry, if possible, for each PDB entry. DSSP holds the secondary structure of the proteins. PDBREPORT holds reports on the structure quality and lists errors. HSSP holds a multiple sequence alignment for all proteins. The PDBFINDER holds easy to parse summaries of the PDB file content, augmented with essentials from the other systems. PDB_REDO holds re-refined, and often improved, copies of all structures solved by X-ray. WHY_NOT summarizes why certain files could not be produced. All these systems are updated weekly. The data sets can be used for the analysis of properties of protein structures in areas ranging from structural genomics, to cancer biology and protein design.
Collapse
Affiliation(s)
- Robbie P Joosten
- Department of Biochemistry, Netherlands Cancer Institute, Plesmanlaan 121, 1066 CX Amsterdam, The Netherlands
| | | | | | | | | | | | | | | |
Collapse
|