1
|
Yue T, Wang Y, Zhang L, Gu C, Xue H, Wang W, Lyu Q, Dun Y. Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models. Int J Mol Sci 2023; 24:15858. [PMID: 37958843 PMCID: PMC10649223 DOI: 10.3390/ijms242115858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 10/24/2023] [Accepted: 10/30/2023] [Indexed: 11/15/2023] Open
Abstract
The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning, since we expect a superhuman intelligence that explores beyond our knowledge to interpret the genome from deep learning. A powerful deep learning model should rely on the insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with proper deep learning-based architecture, and we remark on practical considerations of developing deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research and point out current challenges and potential research directions for future genomics applications. We believe the collaborative use of ever-growing diverse data and the fast iteration of deep learning models will continue to contribute to the future of genomics.
Collapse
Affiliation(s)
- Tianwei Yue
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Yuanxin Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Longxiang Zhang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Chunming Gu
- Department of Biomedical Engineering, School of Medicine, Johns Hopkins University, Baltimore, MD 21218, USA;
| | - Haoru Xue
- The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA;
| | - Wenping Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Qi Lyu
- Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI 48824, USA;
| | - Yujie Dun
- School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China;
| |
Collapse
|
2
|
Jing X, Dong Q, Hong D, Lu R. Amino Acid Encoding Methods for Protein Sequences: A Comprehensive Review and Assessment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1918-1931. [PMID: 30998480 DOI: 10.1109/tcbb.2019.2911677] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
As the first step of machine-learning based protein structure and function prediction, the amino acid encoding play a fundamental role in the final success of those methods. Different from the protein sequence encoding, the amino acid encoding can be used in both residue-level and sequence-level prediction of protein properties by combining them with different algorithms. However, it has not attracted enough attention in the past decades, and there are no comprehensive reviews and assessments about encoding methods so far. In this article, we make a systematic classification and propose a comprehensive review and assessment for various amino acid encoding methods. Those methods are grouped into five categories according to their information sources and information extraction methodologies, including binary encoding, physicochemical properties encoding, evolution-based encoding, structure-based encoding, and machine-learning encoding. Then, 16 representative methods from five categories are selected and compared on protein secondary structure prediction and protein fold recognition tasks by using large-scale benchmark datasets. The results show that the evolution-based position-dependent encoding method PSSM achieved the best performance, and the structure-based and machine-learning encoding methods also show some potential for further application, the neural network based distributed representation of amino acids in particular may bring new light to this area. We hope that the review and assessment are useful for future studies in amino acid encoding.
Collapse
|
3
|
Playe B, Stoven V. Evaluation of deep and shallow learning methods in chemogenomics for the prediction of drugs specificity. J Cheminform 2020; 12:11. [PMID: 33431042 PMCID: PMC7011501 DOI: 10.1186/s13321-020-0413-0] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2019] [Accepted: 01/27/2020] [Indexed: 01/09/2023] Open
Abstract
Chemogenomics, also called proteochemometrics, covers a range of computational methods that can be used to predict protein–ligand interactions at large scales in the protein and chemical spaces. They differ from more classical ligand-based methods (also called QSAR) that predict ligands for a given protein receptor. In the context of drug discovery process, chemogenomics allows to tackle the question of predicting off-target proteins for drug candidates, one of the main causes of undesirable side-effects and failure within drugs development processes. The present study compares shallow and deep machine-learning approaches for chemogenomics, and explores data augmentation techniques for deep learning algorithms in chemogenomics. Shallow machine-learning algorithms rely on expert-based chemical and protein descriptors, while recent developments in deep learning algorithms enable to learn abstract numerical representations of molecular graphs and protein sequences, in order to optimise the performance of the prediction task. We first propose a formulation of chemogenomics with deep learning, called the chemogenomic neural network (CN), as a feed-forward neural network taking as input the combination of molecule and protein representations learnt by molecular graph and protein sequence encoders. We show that, on large datasets, the deep learning CN model outperforms state-of-the-art shallow methods, and competes with deep methods with expert-based descriptors. However, on small datasets, shallow methods present better prediction performance than deep learning methods. Then, we evaluate data augmentation techniques, namely multi-view and transfer learning, to improve the prediction performance of the chemogenomic neural network. We conclude that a promising research direction is to integrate heterogeneous sources of data such as auxiliary tasks for which large datasets are available, or independently, multiple molecule and protein attribute views.
Collapse
Affiliation(s)
- Benoit Playe
- Center for Computational Biology, Mines ParisTech, PSL Research University, 60 Bd Saint-Michel, 75006, Paris, France.,Institut Curie, 75248, Paris, France.,INSERM U900, 75248, Paris, France
| | - Veronique Stoven
- Center for Computational Biology, Mines ParisTech, PSL Research University, 60 Bd Saint-Michel, 75006, Paris, France. .,Institut Curie, 75248, Paris, France. .,INSERM U900, 75248, Paris, France.
| |
Collapse
|
4
|
Toussi CA, Haddadnia J. Improving protein secondary structure prediction: the evolutionary optimized classification algorithms. Struct Chem 2019. [DOI: 10.1007/s11224-018-1271-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
5
|
Wardah W, Khan M, Sharma A, Rashid MA. Protein secondary structure prediction using neural networks and deep learning: A review. Comput Biol Chem 2019; 81:1-8. [DOI: 10.1016/j.compbiolchem.2019.107093] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2018] [Revised: 12/28/2018] [Accepted: 07/10/2019] [Indexed: 02/02/2023]
|
6
|
Abstract
Since the 1980s, deep learning and biomedical data have been coevolving and feeding each other. The breadth, complexity, and rapidly expanding size of biomedical data have stimulated the development of novel deep learning methods, and application of these methods to biomedical data have led to scientific discoveries and practical solutions. This overview provides technical and historical pointers to the field, and surveys current applications of deep learning to biomedical data organized around five subareas, roughly of increasing spatial scale: chemoinformatics, proteomics, genomics and transcriptomics, biomedical imaging, and health care. The black box problem of deep learning methods is also briefly discussed.
Collapse
Affiliation(s)
- Pierre Baldi
- Department of Computer Science, Institute for Genomics and Bioinformatics, and Center for Machine Learning and Intelligent Systems, University of California, Irvine, California 92697, USA
| |
Collapse
|
7
|
Shen HB, Yi DL, Yao LX, Yang J, Chou KC. Knowledge-based computational intelligence development for predicting protein secondary structures from sequences. Expert Rev Proteomics 2014; 5:653-62. [DOI: 10.1586/14789450.5.5.653] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
8
|
A methodological review of data mining techniques in predictive medicine: An application in hemodynamic prediction for abdominal aortic aneurysm disease. Biocybern Biomed Eng 2014. [DOI: 10.1016/j.bbe.2014.03.003] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
9
|
|
10
|
Zangooei MH, Jalili S. Protein secondary structure prediction using DWKF based on SVR-NSGAII. Neurocomputing 2012. [DOI: 10.1016/j.neucom.2012.04.015] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
11
|
Armano G, Ledda F. Exploiting intrastructure information for secondary structure prediction with multifaceted pipelines. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:799-808. [PMID: 22201070 DOI: 10.1109/tcbb.2011.159] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Predicting the secondary structure of proteins is still a typical step in several bioinformatic tasks, in particular, for tertiary structure prediction. Notwithstanding the impressive results obtained so far, mostly due to the advent of sequence encoding schemes based on multiple alignment, in our view the problem should be studied from a novel perspective, in which understanding how available information sources are dealt with plays a central role. After revisiting a well-known secondary structure predictor viewed from this perspective (with the goal of identifying which sources of information have been considered and which have not), we propose a generic software architecture designed to account for all relevant information sources. To demonstrate the validity of the approach, a predictor compliant with the proposed generic architecture has been implemented and compared with several state-of-the-art secondary structure predictors. Experiments have been carried out on standard data sets, and the corresponding results confirm the validity of the approach. The predictor is available at http://iasc.diee.unica.it/ssp2/ through the corresponding web application or as downloadable stand-alone portable unpack-and-run bundle.
Collapse
Affiliation(s)
- Giuliano Armano
- Department of Electrical and Electronic Engineering, University of Cagliari, Piazza d’Armi, Cagliari 09123, Italy.
| | | |
Collapse
|
12
|
Qi Y, Oja M, Weston J, Noble WS. A unified multitask architecture for predicting local protein properties. PLoS One 2012; 7:e32235. [PMID: 22461885 PMCID: PMC3312883 DOI: 10.1371/journal.pone.0032235] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2011] [Accepted: 01/25/2012] [Indexed: 01/27/2023] Open
Abstract
A variety of functionally important protein properties, such as secondary structure, transmembrane topology and solvent accessibility, can be encoded as a labeling of amino acids. Indeed, the prediction of such properties from the primary amino acid sequence is one of the core projects of computational biology. Accordingly, a panoply of approaches have been developed for predicting such properties; however, most such approaches focus on solving a single task at a time. Motivated by recent, successful work in natural language processing, we propose to use multitask learning to train a single, joint model that exploits the dependencies among these various labeling tasks. We describe a deep neural network architecture that, given a protein sequence, outputs a host of predicted local properties, including secondary structure, solvent accessibility, transmembrane topology, signal peptides and DNA-binding residues. The network is trained jointly on all these tasks in a supervised fashion, augmented with a novel form of semi-supervised learning in which the model is trained to distinguish between local patterns from natural and synthetic protein sequences. The task-independent architecture of the network obviates the need for task-specific feature engineering. We demonstrate that, for all of the tasks that we considered, our approach leads to statistically significant improvements in performance, relative to a single task neural network approach, and that the resulting model achieves state-of-the-art performance.
Collapse
Affiliation(s)
- Yanjun Qi
- Machine Learning Department, NEC Labs America, Princeton, New Jersey, United States of America
| | - Merja Oja
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | - Jason Weston
- Google, New York, New York, United States of America
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
- * E-mail:
| |
Collapse
|
13
|
|
14
|
YANG BO, SU XIAOHONG, WANG YADONG. DISTRIBUTED LEARNING STRATEGY BASED ON CHIPS FOR CLASSIFICATION WITH LARGE-SCALE DATASET. INT J PATTERN RECOGN 2011. [DOI: 10.1142/s0218001407005739] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Learning with very large-scale datasets is always necessary when handling real problems using artificial neural networks. However, it is still an open question how to balance computing efficiency and learning stability, when traditional neural networks spend a large amount of running time and memory to solve a problem with large-scale learning dataset. In this paper, we report the first evaluation of neural network distributed-learning strategies in large-scale classification over protein secondary structure. Our accomplishments include: (1) an architecture analysis on distributed-learning, (2) the development of scalable distributed system for large-scale dataset classification, (3) the description of a novel distributed-learning strategy based on chips, (4) a theoretical analysis of distributed-learning strategies for structure-distributed and data-distributed, (5) an investigation and experimental evaluation of distributed-learning strategy based-on chips with respect to time complexity and their effect on the classification accuracy of artificial neural networks. It is demonstrated that the novel distributed-learning strategy is better-balanced in parallel computing efficiency and stability as compared with the previous algorithms. The application of the protein secondary structure prediction demonstrates that this method is feasible and effective in practical applications.
Collapse
Affiliation(s)
- BO YANG
- IBM China Research Laboratory, Building 19 Zhongguancun Software Park, 8 Dongbeiwang West Road, Haidian District, Beijing 100094, P. R. China
- School of Computer Science and Technology, Harbin Institute of Technology, P.O.Box 319, Harbin 150001, P. R. China
| | - XIAOHONG SU
- School of Computer Science and Technology, Harbin Institute of Technology, P.O.Box 319, Harbin 150001, P. R. China
| | - YADONG WANG
- School of Computer Science and Technology, Harbin Institute of Technology, P.O.Box 319, Harbin 150001, P. R. China
| |
Collapse
|
15
|
Bouziane H, Messabih B, Chouarfia A. Profiles and majority voting-based ensemble method for protein secondary structure prediction. Evol Bioinform Online 2011; 7:171-89. [PMID: 22058650 PMCID: PMC3204938 DOI: 10.4137/ebo.s7931] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022] Open
Abstract
Machine learning techniques have been widely applied to solve the problem of predicting protein secondary structure from the amino acid sequence. They have gained substantial success in this research area. Many methods have been used including k-Nearest Neighbors (k-NNs), Hidden Markov Models (HMMs), Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs), which have attracted attention recently. Today, the main goal remains to improve the prediction quality of the secondary structure elements. The prediction accuracy has been continuously improved over the years, especially by using hybrid or ensemble methods and incorporating evolutionary information in the form of profiles extracted from alignments of multiple homologous sequences. In this paper, we investigate how best to combine k-NNs, ANNs and Multi-class SVMs (M-SVMs) to improve secondary structure prediction of globular proteins. An ensemble method which combines the outputs of two feed-forward ANNs, k-NN and three M-SVM classifiers has been applied. Ensemble members are combined using two variants of majority voting rule. An heuristic based filter has also been applied to refine the prediction. To investigate how much improvement the general ensemble method can give rather than the individual classifiers that make up the ensemble, we have experimented with the proposed system on the two widely used benchmark datasets RS126 and CB513 using cross-validation tests by including PSI-BLAST position-specific scoring matrix (PSSM) profiles as inputs. The experimental results reveal that the proposed system yields significant performance gains when compared with the best individual classifier.
Collapse
Affiliation(s)
- Hafida Bouziane
- Department of Computer Science, USTO-MB University, BP 1505 El Mnaouer, Oran, Algeria
| | | | | |
Collapse
|
16
|
Nanni L, Lumini A, Gupta D, Garg A. Identifying bacterial virulent proteins by fusing a set of classifiers based on variants of Chou's pseudo amino acid composition and on evolutionary information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 9:467-475. [PMID: 21860064 DOI: 10.1109/tcbb.2011.117] [Citation(s) in RCA: 113] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
The availability of a reliable prediction method for prediction of bacterial virulent proteins has several important applications in research efforts targeted aimed at finding novel drug targets, vaccine candidates, and understanding virulence mechanisms in pathogens. In this work, we have studied several feature extraction approaches for representing proteins and propose a novel bacterial virulent protein prediction method, based on an ensemble of classifiers where the features are extracted directly from the amino acid sequence and from the evolutionary information of a given protein. We have evaluated and compared several ensembles obtained by combining six feature extraction methods and several classification approaches based on two general purpose classifiers (i.e., Support Vector Machine and a variant of input decimated ensemble) and their random subspace version. An extensive evaluation was performed according to a blind testing protocol, where the parameters of the system are optimized using the training set and the system is validated in three different independent data sets, allowing selection of the most performing system and demonstrating the validity of the proposed method. Based on the results obtained using the blind test protocol, it is interesting to note that even if in each independent data set the most performing stand-alone method is not always the same, the fusion of different methods enhances prediction efficiency in all the tested independent data sets.
Collapse
|
17
|
Nguyen MN, Zurada JM, Rajapakse JC. Toward better understanding of protein secondary structure: extracting prediction rules. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:858-864. [PMID: 21393657 DOI: 10.1109/tcbb.2010.16] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Although numerous computational techniques have been applied to predict protein secondary structure (PSS), only limited studies have dealt with discovery of logic rules underlying the prediction itself. Such rules offer interesting links between the prediction model and the underlying biology. In addition, they enhance interpretability of PSS prediction by providing a degree of transparency to the predicting model usually regarded as a black box. In this paper, we explore the generation and use of C4.5 decision trees to extract relevant rules from PSS predictions modeled with two-stage support vector machines (TS-SVM). The proposed rules were derived on the RS126 data set of 126 nonhomologous globular proteins and on the PSIPRED data set of 1,923 protein sequences. Our approach has produced sets of comprehensible, and often interpretable, rules underlying the PSS predictions. Moreover, many of the rules seem to be strongly supported by biological evidence. Further, our approach resulted in good prediction accuracy, few and usually compact rules, and rules that are generally of higher confidence levels than those generated by other rule extraction techniques.
Collapse
Affiliation(s)
- Minh N Nguyen
- BioInfomatics Institute, 30 Biopolis Street, #07-01 Matrix, Singapore.
| | | | | |
Collapse
|
18
|
Wu J, Li ML, Yu LZ, Wang C. An ensemble classifier of support vector machines used to predict protein structural classes by fusing auto covariance and pseudo-amino acid composition. Protein J 2010; 29:62-7. [PMID: 20049515 DOI: 10.1007/s10930-009-9222-z] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The purpose of this article is to identify protein structural classes by using support vector machine (SVM) ensemble classifier, which is very efficient in enhancing prediction performance. Firstly, auto covariance (AC) and pseudo-amino acid composition (PseAAC) were used in protein representation. AC focuses on adjacent effects and PseAA composition takes sequence order patterns into account. Secondly, SVMs were trained on the datasets represented by different descriptors. The last, ensemble classifier, which constructed on the individual classifiers through a voting strategy, gave the final prediction results. Meanwhile, very promising prediction accuracy 93.14% was obtained by Jackknife test. The experimental results showed that the ensemble system can improve the prediction performance greatly and generate more stable and safer predictors. The current method featured by fusing the protein primary sequence information transferred by AC and described by protein PseAA composition may play an important complementary role in other related applications.
Collapse
Affiliation(s)
- Jiang Wu
- College of Chemistry, Sichuan University, 610064, Chengdu, China
| | | | | | | |
Collapse
|
19
|
Bagos PG, Tsaousis GN, Hamodrakas SJ. How many 3D structures do we need to train a predictor? GENOMICS PROTEOMICS & BIOINFORMATICS 2010; 7:128-37. [PMID: 19944385 PMCID: PMC5054404 DOI: 10.1016/s1672-0229(08)60041-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
It has been shown that the progress in the determination of membrane protein structure grows exponentially, with approximately the same growth rate as that of the water-soluble proteins. In order to investigate the effect of this, on the performance of prediction algorithms for both alpha-helical and beta-barrel membrane proteins, we conducted a prospective study based on historical records. We trained separate hidden Markov models with different sized training sets and evaluated their performance on topology prediction for the two classes of transmembrane proteins. We show that the existing top-scoring algorithms for predicting the transmembrane segments of alpha-helical membrane proteins perform slightly better than that of beta-barrel outer membrane proteins in all measures of accuracy. With the same rationale, a meta-analysis of the performance of the secondary structure prediction algorithms indicates that existing algorithmic techniques cannot be further improved by just adding more non-homologous sequences to the training sets. The upper limit for secondary structure prediction is estimated to be no more than 70% and 80% of correctly predicted residues for single sequence based methods and multiple sequence based ones, respectively. Therefore, we should concentrate our efforts on utilizing new techniques for the development of even better scoring predictors.
Collapse
Affiliation(s)
- Pantelis G Bagos
- Department of Cell Biology and Biophysics, Faculty of Biology, University of Athens, Athens 15701, Greece.
| | | | | |
Collapse
|
20
|
Kountouris P, Hirst JD. Prediction of backbone dihedral angles and protein secondary structure using support vector machines. BMC Bioinformatics 2009; 10:437. [PMID: 20025785 PMCID: PMC2811710 DOI: 10.1186/1471-2105-10-437] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2009] [Accepted: 12/22/2009] [Indexed: 11/26/2022] Open
Abstract
Background The prediction of the secondary structure of a protein is a critical step in the prediction of its tertiary structure and, potentially, its function. Moreover, the backbone dihedral angles, highly correlated with secondary structures, provide crucial information about the local three-dimensional structure. Results We predict independently both the secondary structure and the backbone dihedral angles and combine the results in a loop to enhance each prediction reciprocally. Support vector machines, a state-of-the-art supervised classification technique, achieve secondary structure predictive accuracy of 80% on a non-redundant set of 513 proteins, significantly higher than other methods on the same dataset. The dihedral angle space is divided into a number of regions using two unsupervised clustering techniques in order to predict the region in which a new residue belongs. The performance of our method is comparable to, and in some cases more accurate than, other multi-class dihedral prediction methods. Conclusions We have created an accurate predictor of backbone dihedral angles and secondary structure. Our method, called DISSPred, is available online at http://comp.chem.nottingham.ac.uk/disspred/.
Collapse
Affiliation(s)
- Petros Kountouris
- School of Chemistry, University of Nottingham, University Park, Nottingham NG7 2RD, UK.
| | | |
Collapse
|
21
|
Zhao Y, Gutshall L, Jiang H, Baker A, Beil E, Obmolova G, Carton J, Taudte S, Amegadzie B. Two routes for production and purification of Fab fragments in biopharmaceutical discovery research: Papain digestion of mAb and transient expression in mammalian cells. Protein Expr Purif 2009; 67:182-9. [DOI: 10.1016/j.pep.2009.04.012] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2009] [Revised: 04/24/2009] [Accepted: 04/28/2009] [Indexed: 01/10/2023]
|
22
|
Xue B, Faraggi E, Zhou Y. Predicting residue-residue contact maps by a two-layer, integrated neural-network method. Proteins 2009; 76:176-83. [PMID: 19137600 DOI: 10.1002/prot.22329] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
A neural network method (SPINE-2D) is introduced to provide a sequence-based prediction of residue-residue contact maps. This method is built on the success of SPINE in predicting secondary structure, residue solvent accessibility, and backbone torsion angles via large-scale training with overfit protection and a two-layer neural network. SPINE-2D achieved a 10-fold cross-validated accuracy of 47% (+/-2%) for top L/5 predicted contacts between two residues with sequence separation of six or more and an accuracy of 24 +/- 1% for nonlocal contacts with sequence separation of 24 residues or more. The accuracies of 23% and 26% for nonlocal contact predictions are achieved for two independent datasets of 500 proteins and 82 CASP 7 targets, respectively. A comparison with other methods indicates that SPINE-2D is among the most accurate methods for contact-map prediction. SPINE-2D is available as a webserver at http://sparks.informatics.iupui.edu.
Collapse
Affiliation(s)
- Bin Xue
- Indiana University School of Informatics, Indiana University-Purdue University, Indianapolis, Indiana 46202, USA
| | | | | |
Collapse
|
23
|
Walsh I, Baù D, Martin AJM, Mooney C, Vullo A, Pollastri G. Ab initio and template-based prediction of multi-class distance maps by two-dimensional recursive neural networks. BMC STRUCTURAL BIOLOGY 2009; 9:5. [PMID: 19183478 PMCID: PMC2654788 DOI: 10.1186/1472-6807-9-5] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/29/2008] [Accepted: 01/30/2009] [Indexed: 11/17/2022]
Abstract
Background Prediction of protein structures from their sequences is still one of the open grand challenges of computational biology. Some approaches to protein structure prediction, especially ab initio ones, rely to some extent on the prediction of residue contact maps. Residue contact map predictions have been assessed at the CASP competition for several years now. Although it has been shown that exact contact maps generally yield correct three-dimensional structures, this is true only at a relatively low resolution (3–4 Å from the native structure). Another known weakness of contact maps is that they are generally predicted ab initio, that is not exploiting information about potential homologues of known structure. Results We introduce a new class of distance restraints for protein structures: multi-class distance maps. We show that Cα trace reconstructions based on 4-class native maps are significantly better than those from residue contact maps. We then build two predictors of 4-class maps based on recursive neural networks: one ab initio, or relying on the sequence and on evolutionary information; one template-based, or in which homology information to known structures is provided as a further input. We show that virtually any level of sequence similarity to structural templates (down to less than 10%) yields more accurate 4-class maps than the ab initio predictor. We show that template-based predictions by recursive neural networks are consistently better than the best template and than a number of combinations of the best available templates. We also extract binary residue contact maps at an 8 Å threshold (as per CASP assessment) from the 4-class predictors and show that the template-based version is also more accurate than the best template and consistently better than the ab initio one, down to very low levels of sequence identity to structural templates. Furthermore, we test both ab-initio and template-based 8 Å predictions on the CASP7 targets using a pre-CASP7 PDB, and find that both predictors are state-of-the-art, with the template-based one far outperforming the best CASP7 systems if templates with sequence identity to the query of 10% or better are available. Although this is not the main focus of this paper we also report on reconstructions of Cα traces based on both ab initio and template-based 4-class map predictions, showing that the latter are generally more accurate even when homology is dubious. Conclusion Accurate predictions of multi-class maps may provide valuable constraints for improved ab initio and template-based prediction of protein structures, naturally incorporate multiple templates, and yield state-of-the-art binary maps. Predictions of protein structures and 8 Å contact maps based on the multi-class distance map predictors described in this paper are freely available to academic users at the url .
Collapse
Affiliation(s)
- Ian Walsh
- School of Computer Science and Informatics, University College Dublin, Dublin, Ireland.
| | | | | | | | | | | |
Collapse
|
24
|
Bengio Y, Senecal JS. Adaptive importance sampling to accelerate training of a neural probabilistic language model. ACTA ACUST UNITED AC 2008; 19:713-22. [PMID: 18390314 DOI: 10.1109/tnn.2007.912312] [Citation(s) in RCA: 67] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Previous work on statistical language modeling has shown that it is possible to train a feedforward neural network to approximate probabilities over sequences of words, resulting in significant error reduction when compared to standard baseline models based on n-grams. However, training the neural network model with the maximum-likelihood criterion requires computations proportional to the number of words in the vocabulary. In this paper, we introduce adaptive importance sampling as a way to accelerate training of the model. The idea is to use an adaptive n-gram model to track the conditional distributions produced by the neural network. We show that a very significant speedup can be obtained on standard problems.
Collapse
Affiliation(s)
- Y Bengio
- Department of IRO, Universite de Montreal, Montreal, Canada.
| | | |
Collapse
|
25
|
Afonnikov DA, Morozov AV, Kolchanov NA. Prediction of contact numbers of amino acid residues using a neural network regression algorithm. Biophysics (Nagoya-shi) 2008. [DOI: 10.1134/s0006350906070128] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
|
26
|
Won KJ, Sandelin A, Marstrand TT, Krogh A. Modeling promoter grammars with evolving hidden Markov models. ACTA ACUST UNITED AC 2008; 24:1669-75. [PMID: 18535083 DOI: 10.1093/bioinformatics/btn254] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Describing and modeling biological features of eukaryotic promoters remains an important and challenging problem within computational biology. The promoters of higher eukaryotes in particular display a wide variation in regulatory features, which are difficult to model. Often several factors are involved in the regulation of a set of co-regulated genes. If so, promoters can be modeled with connected regulatory features, where the network of connections is characteristic for a particular mode of regulation. RESULTS With the goal of automatically deciphering such regulatory structures, we present a method that iteratively evolves an ensemble of regulatory grammars using a hidden Markov Model (HMM) architecture composed of interconnected blocks representing transcription factor binding sites (TFBSs) and background regions of promoter sequences. The ensemble approach reduces the risk of overfitting and generally improves performance. We apply this method to identify TFBSs and to classify promoters preferentially expressed in macrophages, where it outperforms other methods due to the increased predictive power given by the grammar. AVAILABILITY The software and the datasets are available from http://modem.ucsd.edu/won/eHMM.tar.gz
Collapse
Affiliation(s)
- Kyoung-Jae Won
- The Bioinformatics Centre, Department of Biology & Biotech Research and Innovation Centre, University of Copenhagen, Ole Maaloes Vej 5, 2200 Copenhagen N, Denmark
| | | | | | | |
Collapse
|
27
|
An ensemble of reduced alphabets with protein encoding based on grouped weight for predicting DNA-binding proteins. Amino Acids 2008; 36:167-75. [DOI: 10.1007/s00726-008-0044-7] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2007] [Accepted: 02/07/2008] [Indexed: 10/22/2022]
|
28
|
Durbin B, Dudoit S, van der Laan MJ. A deletion/substitution/addition algorithm for classification neural networks, with applications to biomedical data. J Stat Plan Inference 2008. [DOI: 10.1016/j.jspi.2007.06.002] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
29
|
Nanni L, Lumini A. Combing ontologies and dipeptide composition for predicting DNA-binding proteins. Amino Acids 2008; 34:635-41. [PMID: 18175049 DOI: 10.1007/s00726-007-0016-3] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2007] [Accepted: 12/06/2007] [Indexed: 12/11/2022]
Abstract
Given a novel protein it is very important to know if it is a DNA-binding protein, because DNA-binding proteins participate in the fundamental role to regulate gene expression. In this work, we propose a parallel fusion between a classifier trained using the features extracted from the gene ontology database and a classifier trained using the dipeptide composition of the protein. As classifiers the support vector machine (SVM) and the 1-nearest neighbour are used. Matthews's correlation coefficient obtained by our fusion method is approximately 0.97 when the jackknife cross-validation is used; this result outperforms the best performance obtained in the literature (0.924) using the same dataset where the SVM is trained using only the Chou's pseudo amino acid based features. In this work also the area under the ROC-curve (AUC) is reported and our results show that the fusion permits to obtain a very interesting 0.995 AUC. In particular we want to stress that our fusion obtains a 5% false negative with a 0% of false positive. Matthews's correlation coefficient obtained using the single best GO-number is only 0.7211 and hence it is not possible to use the gene ontology database as a simple lookup table. Finally, we test the complementarity of the two tested feature extraction methods using the Q-statistic. We obtain the very interesting result of 0.58, which means that the features extracted from the gene ontology database and the features extracted from the amino acid sequence are partially independent and that their parallel fusion should be studied more.
Collapse
Affiliation(s)
- Loris Nanni
- DEIS, IEIIT-CNR, Università di Bologna, Viale Risorgimento 2, 40136 Bologna, Italy.
| | | |
Collapse
|
30
|
Ghosh A, Parai B. Protein secondary structure prediction using distance based classifiers. Int J Approx Reason 2008. [DOI: 10.1016/j.ijar.2007.03.007] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
31
|
Hu HJ, Holley J, He J, Harrison RW, Yang H, Tai PC, Pan Y. To be or not to be: predicting soluble SecAs as membrane proteins. IEEE Trans Nanobioscience 2007; 6:168-79. [PMID: 17695753 DOI: 10.1109/tnb.2007.897486] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
SecA is an important component of protein translocation in bacteria, and exists in soluble and membrane-integrated forms. Most membrane prediction programs predict SecA as being a soluble protein, with the exception of TMpred and Top-Pred. However, the membrane associated predicted segments by TMpred and TopPred are inconsistent across bacterial species in spite of high sequence homology. In this paper we describe a new method for membrane protein prediction, PSSM_SVM, which provides consistent results for integral membrane domains of SecAs across bacterial species. This PSSM encoding scheme demonstrates the highest accuracy in terms of Q2 among the common prediction methods, and produces consistent results on blind test data. None of the previously described methods showed this kind of consistency when tested against the same blind test set. This scheme predicts traditional transmembrane segments and most of the soluble proteins accurately. The PSSM scheme applied to the membrane-associated protein SecA shows characteristic features. In the set of 223 known SecA sequences, the PSSM_SVM prediction scheme predicts eight to nine residue embedded membrane segments. This predicted region is part of a 12 residue helix from known X-ray crystal structures of SecAs. This information could be important for determining the structure of SecA proteins in the membrane which have different conformational properties from other transmembrane proteins, as well as other soluble proteins that may similarly integrate into lipid bi-layers.
Collapse
Affiliation(s)
- Hae-Jin Hu
- Molecular Basis of Disease Program, Georgia State University, Atlanta, GA 30303, USA.
| | | | | | | | | | | | | |
Collapse
|
32
|
Pollastri G, Martin AJM, Mooney C, Vullo A. Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information. BMC Bioinformatics 2007; 8:201. [PMID: 17570843 PMCID: PMC1913928 DOI: 10.1186/1471-2105-8-201] [Citation(s) in RCA: 85] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2007] [Accepted: 06/14/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Structural properties of proteins such as secondary structure and solvent accessibility contribute to three-dimensional structure prediction, not only in the ab initio case but also when homology information to known structures is available. Structural properties are also routinely used in protein analysis even when homology is available, largely because homology modelling is lower throughput than, say, secondary structure prediction. Nonetheless, predictors of secondary structure and solvent accessibility are virtually always ab initio. RESULTS Here we develop high-throughput machine learning systems for the prediction of protein secondary structure and solvent accessibility that exploit homology to proteins of known structure, where available, in the form of simple structural frequency profiles extracted from sets of PDB templates. We compare these systems to their state-of-the-art ab initio counterparts, and with a number of baselines in which secondary structures and solvent accessibilities are extracted directly from the templates. We show that structural information from templates greatly improves secondary structure and solvent accessibility prediction quality, and that, on average, the systems significantly enrich the information contained in the templates. For sequence similarity exceeding 30%, secondary structure prediction quality is approximately 90%, close to its theoretical maximum, and 2-class solvent accessibility roughly 85%. Gains are robust with respect to template selection noise, and significant for marginal sequence similarity and for short alignments, supporting the claim that these improved predictions may prove beneficial beyond the case in which clear homology is available. CONCLUSION The predictive system are publicly available at the address http://distill.ucd.ie.
Collapse
Affiliation(s)
- Gianluca Pollastri
- Complex and Adaptive Systems Laboratory, School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland
| | - Alberto JM Martin
- Complex and Adaptive Systems Laboratory, School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland
| | - Catherine Mooney
- Complex and Adaptive Systems Laboratory, School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland
| | - Alessandro Vullo
- Complex and Adaptive Systems Laboratory, School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland
| |
Collapse
|
33
|
Gassend B, O'Donnell CW, Thies W, Lee A, van Dijk M, Devadas S. Learning biophysically-motivated parameters for alpha helix prediction. BMC Bioinformatics 2007; 8 Suppl 5:S3. [PMID: 17570862 PMCID: PMC1892091 DOI: 10.1186/1471-2105-8-s5-s3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Our goal is to develop a state-of-the-art protein secondary structure predictor, with an intuitive and biophysically-motivated energy model. We treat structure prediction as an optimization problem, using parameterizable cost functions representing biological "pseudo-energies". Machine learning methods are applied to estimate the values of the parameters to correctly predict known protein structures. Results Focusing on the prediction of alpha helices in proteins, we show that a model with 302 parameters can achieve a Qα value of 77.6% and an SOVα value of 73.4%. Such performance numbers are among the best for techniques that do not rely on external databases (such as multiple sequence alignments). Further, it is easier to extract biological significance from a model with so few parameters. Conclusion The method presented shows promise for the prediction of protein secondary structure. Biophysically-motivated elementary free-energies can be learned using SVM techniques to construct an energy cost function whose predictive performance rivals state-of-the-art. This method is general and can be extended beyond the all-alpha case described here.
Collapse
Affiliation(s)
- Blaise Gassend
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Charles W O'Donnell
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - William Thies
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Andrew Lee
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Marten van Dijk
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Srinivas Devadas
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| |
Collapse
|
34
|
Mooney C, Vullo A, Pollastri G. Protein structural motif prediction in multidimensional phi-psi space leads to improved secondary structure prediction. J Comput Biol 2007; 13:1489-502. [PMID: 17061924 DOI: 10.1089/cmb.2006.13.1489] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
A significant step towards establishing the structure and function of a protein is the prediction of the local conformation of the polypeptide chain. In this article, we present systems for the prediction of three new alphabets of local structural motifs. The motifs are built by applying multidimensional scaling (MDS) and clustering to pair-wise angular distances for multiple phi-psi angle values collected from high-resolution protein structures. The predictive systems, based on ensembles of bidirectional recurrent neural network architectures, and trained on a large non-redundant set of protein structures, achieve 72%, 66%, and 60% correct motif prediction on an independent test set for di-peptides (six classes), tri-peptides (eight classes) and tetra-peptides (14 classes), respectively, 28-30% above baseline statistical predictors. We then build a further system, based on ensembles of two-layered bidirectional recurrent neural networks, to map structural motif predictions into a traditional 3-class (helix, strand, coil) secondary structure. This system achieves 79.5% correct prediction using the "hard" CASP 3-class assignment, and 81.4% with a more lenient assignment, outperforming a sophisticated state-of-the-art predictor (Porter) trained in the same experimental conditions. The structural motif predictor is publicly available at: http://distill.ucd.ie/porter+/.
Collapse
Affiliation(s)
- Catherine Mooney
- School of Computer Science and Informatics, University College Dublin, Belfield, Dublin, Ireland
| | | | | |
Collapse
|
35
|
|
36
|
Pal S, Bandyopadhyay S, Ray S. Evolutionary computation in bioinformatics: a review. ACTA ACUST UNITED AC 2006. [DOI: 10.1109/tsmcc.2005.855515] [Citation(s) in RCA: 69] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
37
|
Sapay N, Guermeur Y, Deléage G. Prediction of amphipathic in-plane membrane anchors in monotopic proteins using a SVM classifier. BMC Bioinformatics 2006; 7:255. [PMID: 16704727 PMCID: PMC1564421 DOI: 10.1186/1471-2105-7-255] [Citation(s) in RCA: 109] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2006] [Accepted: 05/16/2006] [Indexed: 11/12/2022] Open
Abstract
Background Membrane proteins are estimated to represent about 25% of open reading frames in fully sequenced genomes. However, the experimental study of proteins remains difficult. Considerable efforts have thus been made to develop prediction methods. Most of these were conceived to detect transmembrane helices in polytopic proteins. Alternatively, a membrane protein can be monotopic and anchored via an amphipathic helix inserted in a parallel way to the membrane interface, so-called in-plane membrane (IPM) anchors. This type of membrane anchor is still poorly understood and no suitable prediction method is currently available. Results We report here the "AmphipaSeeK" method developed to predict IPM anchors. It uses a set of 21 reported examples of IPM anchored proteins. The method is based on a pattern recognition Support Vector Machine with a dedicated kernel. Conclusion AmphipaSeeK was shown to be highly specific, in contrast with classically used methods (e.g. hydrophobic moment). Additionally, it has been able to retrieve IPM anchors in naively tested sets of transmembrane proteins (e.g. PagP). AmphipaSeek and the list of the 21 IPM anchored proteins is available on NPS@, our protein sequence analysis server.
Collapse
Affiliation(s)
- Nicolas Sapay
- Institut de Biologie et Chimie des Protéines, UMR 5086 CNRS-Univ. Lyon 1 – IFR128 BioSciences Lyon-Gerland, F-69367 Lyon Cedex 07, France
| | - Yann Guermeur
- LORIA-CNRS, Campus Scientifique – BP 239, 54506 Vandœuvre-lès-Nancy Cedex, France
| | - Gilbert Deléage
- Institut de Biologie et Chimie des Protéines, UMR 5086 CNRS-Univ. Lyon 1 – IFR128 BioSciences Lyon-Gerland, F-69367 Lyon Cedex 07, France
| |
Collapse
|
38
|
Ceroni A, Frasconi P, Pollastri G. Learning protein secondary structure from sequential and relational data. Neural Netw 2005; 18:1029-39. [PMID: 16182513 DOI: 10.1016/j.neunet.2005.07.001] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
We propose a method for sequential supervised learning that exploits explicit knowledge of short- and long-range dependencies. The architecture consists of a recursive and bi-directional neural network that takes as input a sequence along with an associated interaction graph. The interaction graph models (partial) knowledge about long-range dependency relations. We tested the method on the prediction of protein secondary structure, a task in which relations due to beta-strand pairings and other spatial proximities are known to have a significant effect on the prediction accuracy. In this particular task, interactions can be derived from knowledge of protein contact maps at the residue level. Our results show that prediction accuracy can be significantly boosted by the integration of interaction graphs.
Collapse
Affiliation(s)
- Alessio Ceroni
- Dipartimento di Sistemi e Informatica, Università degli Studi di Firenze Via Santa Marta, Firenze, Italy
| | | | | |
Collapse
|
39
|
Chen J, Chaudhari N. Bidirectional segmented-memory recurrent neural network for protein secondary structure prediction. Soft comput 2005. [DOI: 10.1007/s00500-005-0489-5] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
40
|
Eghbalnia HR, Wang L, Bahrami A, Assadi A, Markley JL. Protein energetic conformational analysis from NMR chemical shifts (PECAN) and its use in determining secondary structural elements. JOURNAL OF BIOMOLECULAR NMR 2005; 32:71-81. [PMID: 16041485 DOI: 10.1007/s10858-005-5705-1] [Citation(s) in RCA: 62] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/28/2004] [Accepted: 03/08/2005] [Indexed: 05/03/2023]
Abstract
We present an energy model that combines information from the amino acid sequence of a protein and available NMR chemical shifts for the purposes of identifying low energy conformations and determining elements of secondary structure. The model ("PECAN", Protein Energetic Conformational Analysis from NMR chemical shifts) optimizes a combination of sequence information and residue-specific statistical energy function to yield energetic descriptions most favorable to predicting secondary structure. Compared to prior methods for secondary structure determination, PECAN provides increased accuracy and range, particularly in regions of extended structure. Moreover, PECAN uses the energetics to identify residues located at the boundaries between regions of predicted secondary structure that may not fit the stringent secondary structure class definitions. The energy model offers insights into the local energetic patterns that underlie conformational preferences. For example, it shows that the information content for defining secondary structure is localized about a residue and reaches a maximum when two residues on either side are considered. The current release of the PECAN software determines the well-defined regions of secondary structure in novel proteins with assigned chemical shifts with an overall accuracy of 90%, which is close to the practical limit of achievable accuracy in classifying the states.
Collapse
Affiliation(s)
- Hamid R Eghbalnia
- Biochemistry Department, National Magnetic Resonance Facility at Madison, 433 Babcock Drive, Madison, WI 53706, USA.
| | | | | | | | | |
Collapse
|
41
|
Appropriate Kernel Functions for Support Vector Machine Learning with Sequences of Symbolic Data. LECTURE NOTES IN COMPUTER SCIENCE 2005. [DOI: 10.1007/11559887_16] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/10/2023]
|
42
|
Mitra S. Computational Intelligence in Bioinformatics. TRANSACTIONS ON ROUGH SETS III 2005. [DOI: 10.1007/11427834_6] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
|
43
|
Hu HJ, Pan Y, Harrison R, Tai PC. Improved Protein Secondary Structure Prediction Using Support Vector Machine With a New Encoding Scheme and an Advanced Tertiary Classifier. IEEE Trans Nanobioscience 2004; 3:265-71. [PMID: 15631138 DOI: 10.1109/tnb.2004.837906] [Citation(s) in RCA: 67] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Prediction of protein secondary structures is an important problem in bioinformatics and has many applications. The recent trend of secondary structure prediction studies is mostly based on the neural network or the support vector machine (SVM). The SVM method is a comparatively new learning system which has mostly been used in pattern recognition problems. In this study, SVM is used as a machine learning tool for the prediction of secondary structure and several encoding schemes, including orthogonal matrix, hydrophobicity matrix, BLOSUM62 substitution matrix, and combined matrix of these, are applied and optimized to improve the prediction accuracy. Also, the optimal window length for six SVM binary classifiers is established by testing different window sizes and our new encoding scheme is tested based on this optimal window size via sevenfold cross validation tests. The results show 2% increase in the accuracy of the binary classifiers when compared with the instances in which the classical orthogonal matrix is used. Finally, to combine the results of the six SVM binary classifiers, a new tertiary classifier which combines the results of one-versus-one binary classifiers is introduced and the performance is compared with those of existing tertiary classifiers. According to the results, the Q3 prediction accuracy of new tertiary classifier reaches 78.8% and this is better than the best result reported in the literature.
Collapse
Affiliation(s)
- Hae-Jin Hu
- Department of Computer Science, Georgia State University, Atlanta, GA 30303-4110, USA.
| | | | | | | |
Collapse
|
44
|
A neural network based multi-classifier system for gene identification in DNA sequences. Neural Comput Appl 2004. [DOI: 10.1007/s00521-004-0447-7] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
45
|
Wu KP, Lin HN, Chang JM, Sung TY, Hsu WL. HYPROSP: a hybrid protein secondary structure prediction algorithm--a knowledge-based approach. Nucleic Acids Res 2004; 32:5059-65. [PMID: 15448186 PMCID: PMC521652 DOI: 10.1093/nar/gkh836] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We develop a knowledge-based approach (called PROSP) for protein secondary structure prediction. The knowledge base contains small peptide fragments together with their secondary structural information. A quantitative measure M, called match rate, is defined to measure the amount of structural information that a target protein can extract from the knowledge base. Our experimental results show that proteins with a higher match rate will likely be predicted more accurately based on PROSP. That is, there is roughly a monotone correlation between the prediction accuracy and the amount of structure matching with the knowledge base. To fully utilize the strength of our knowledge base, a hybrid prediction method is proposed as follows: if the match rate of a target protein is at least 80%, we use the extracted information to make the prediction; otherwise, we adopt a popular machine-learning approach. This comprises our hybrid protein structure prediction (HYPROSP) approach. We use the DSSP and EVA data as our datasets and PSIPRED as our underlying machine-learning algorithm. For target proteins with match rate at least 80%, the average Q3 of PROSP is 3.96 and 7.2 better than that of PSIPRED on DSSP and EVA data, respectively.
Collapse
Affiliation(s)
- Kuen-Pin Wu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | | | | | | | | |
Collapse
|
46
|
Liu H, Wong L. Data mining tools for biological sequences. J Bioinform Comput Biol 2004; 1:139-67. [PMID: 15290785 DOI: 10.1142/s0219720003000216] [Citation(s) in RCA: 54] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2002] [Revised: 04/07/2003] [Accepted: 04/07/2003] [Indexed: 11/18/2022]
Abstract
We describe a methodology, as well as some related data mining tools, for analyzing sequence data. The methodology comprises three steps: (a) generating candidate features from the sequences, (b) selecting relevant features from the candidates, and (c) integrating the selected features to build a system to recognize specific properties in sequence data. We also give relevant techniques for each of these three steps. For generating candidate features, we present various types of features based on the idea of k-grams. For selecting relevant features, we discuss signal-to-noise, t-statistics, and entropy measures, as well as a correlation-based feature selection method. For integrating selected features, we use machine learning methods, including C4.5, SVM, and Naive Bayes. We illustrate this methodology on the problem of recognizing translation initiation sites. We discuss how to generate and select features that are useful for understanding the distinction between ATG sites that are translation initiation sites and those that are not. We also discuss how to use such features to build reliable systems for recognizing translation initiation sites in DNA sequences.
Collapse
Affiliation(s)
- Huiqing Liu
- Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613, Singapore.
| | | |
Collapse
|
47
|
Guo J, Chen H, Sun Z, Lin Y. A novel method for protein secondary structure prediction using dual-layer SVM and profiles. Proteins 2004; 54:738-43. [PMID: 14997569 DOI: 10.1002/prot.10634] [Citation(s) in RCA: 137] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
A high-performance method was developed for protein secondary structure prediction based on the dual-layer support vector machine (SVM) and position-specific scoring matrices (PSSMs). SVM is a new machine learning technology that has been successfully applied in solving problems in the field of bioinformatics. The SVM's performance is usually better than that of traditional machine learning approaches. The performance was further improved by combining PSSM profiles with the SVM analysis. The PSSMs were generated from PSI-BLAST profiles, which contain important evolution information. The final prediction results were generated from the second SVM layer output. On the CB513 data set, the three-state overall per-residue accuracy, Q3, reached 75.2%, while segment overlap (SOV) accuracy increased to 80.0%. On the CB396 data set, the Q3 of our method reached 74.0% and the SOV reached 78.1%. A web server utilizing the method has been constructed and is available at http://www.bioinfo.tsinghua.edu.cn/pmsvm.
Collapse
Affiliation(s)
- Jian Guo
- Institute of Bioinformatics, State Key Laboratory of Biomembrane and Membrane Biotechnology, Department of Biological Sciences and Biotechnology, Tsinghua University, Beijing, China
| | | | | | | |
Collapse
|
48
|
Combining protein secondary structure prediction models with ensemble methods of optimal complexity. Neurocomputing 2004. [DOI: 10.1016/j.neucom.2003.10.004] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
49
|
A Combination of Support Vector Machines and Bidirectional Recurrent Neural Networks for Protein Secondary Structure Prediction. ACTA ACUST UNITED AC 2003. [DOI: 10.1007/978-3-540-39853-0_12] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
|
50
|
Pollastri G, Przybylski D, Rost B, Baldi P. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins 2002; 47:228-35. [PMID: 11933069 DOI: 10.1002/prot.10082] [Citation(s) in RCA: 545] [Impact Index Per Article: 24.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Secondary structure predictions are increasingly becoming the workhorse for several methods aiming at predicting protein structure and function. Here we use ensembles of bidirectional recurrent neural network architectures, PSI-BLAST-derived profiles, and a large nonredundant training set to derive two new predictors: (a) the second version of the SSpro program for secondary structure classification into three categories and (b) the first version of the SSpro8 program for secondary structure classification into the eight classes produced by the DSSP program. We describe the results of three different test sets on which SSpro achieved a sustained performance of about 78% correct prediction. We report confusion matrices, compare PSI-BLAST to BLAST-derived profiles, and assess the corresponding performance improvements. SSpro and SSpro8 are implemented as web servers, available together with other structural feature predictors at: http://promoter.ics.uci.edu/BRNN-PRED/.
Collapse
Affiliation(s)
- Gianluca Pollastri
- Department of Information and Computer Science, Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697-3425, USA
| | | | | | | |
Collapse
|