1
|
Heinzinger M, Elnaggar A, Wang Y, Dallago C, Nechaev D, Matthes F, Rost B. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 2019; 20:723. [PMID: 31847804 PMCID: PMC6918593 DOI: 10.1186/s12859-019-3220-8] [Citation(s) in RCA: 241] [Impact Index Per Article: 48.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2019] [Accepted: 11/13/2019] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Predicting protein function and structure from sequence is one important challenge for computational biology. For 26 years, most state-of-the-art approaches combined machine learning and evolutionary information. However, for some applications retrieving related proteins is becoming too time-consuming. Additionally, evolutionary information is less powerful for small families, e.g. for proteins from the Dark Proteome. Both these problems are addressed by the new methodology introduced here. RESULTS We introduced a novel way to represent protein sequences as continuous vectors (embeddings) by using the language model ELMo taken from natural language processing. By modeling protein sequences, ELMo effectively captured the biophysical properties of the language of life from unlabeled big data (UniRef50). We refer to these new embeddings as SeqVec (Sequence-to-Vector) and demonstrate their effectiveness by training simple neural networks for two different tasks. At the per-residue level, secondary structure (Q3 = 79% ± 1, Q8 = 68% ± 1) and regions with intrinsic disorder (MCC = 0.59 ± 0.03) were predicted significantly better than through one-hot encoding or through Word2vec-like approaches. At the per-protein level, subcellular localization was predicted in ten classes (Q10 = 68% ± 1) and membrane-bound were distinguished from water-soluble proteins (Q2 = 87% ± 1). Although SeqVec embeddings generated the best predictions from single sequences, no solution improved over the best existing method using evolutionary information. Nevertheless, our approach improved over some popular methods using evolutionary information and for some proteins even did beat the best. Thus, they prove to condense the underlying principles of protein sequences. Overall, the important novelty is speed: where the lightning-fast HHblits needed on average about two minutes to generate the evolutionary information for a target protein, SeqVec created embeddings on average in 0.03 s. As this speed-up is independent of the size of growing sequence databases, SeqVec provides a highly scalable approach for the analysis of big data in proteomics, i.e. microbiome or metaproteome analysis. CONCLUSION Transfer-learning succeeded to extract information from unlabeled sequence databases relevant for various protein prediction tasks. SeqVec modeled the language of life, namely the principles underlying protein sequences better than any features suggested by textbooks and prediction methods. The exception is evolutionary information, however, that information is not available on the level of a single sequence.
Collapse
Affiliation(s)
- Michael Heinzinger
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany.
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany.
| | - Ahmed Elnaggar
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Yu Wang
- Leibniz Supercomputing Centre, Boltzmannstr. 1, 85748, Garching/Munich, Germany
| | - Christian Dallago
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Dmitrii Nechaev
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Florian Matthes
- TUM Department of Informatics, Software Engineering and Business Information Systems, Boltzmannstr. 1, 85748, Garching/Munich, Germany
| | - Burkhard Rost
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching/Munich, Germany
- TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
- Department of Biochemistry and Molecular Biophysics & New York Consortium on Membrane Protein Structure (NYCOMPS), Columbia University, 701 West, 168th Street, New York, NY, 10032, USA
| |
Collapse
|
2
|
Wardah W, Khan M, Sharma A, Rashid MA. Protein secondary structure prediction using neural networks and deep learning: A review. Comput Biol Chem 2019; 81:1-8. [DOI: 10.1016/j.compbiolchem.2019.107093] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2018] [Revised: 12/28/2018] [Accepted: 07/10/2019] [Indexed: 02/02/2023]
|
3
|
Reaching optimized parameter set: protein secondary structure prediction using neural network. Neural Comput Appl 2016. [DOI: 10.1007/s00521-015-2150-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
4
|
Fellner L, Simon S, Scherling C, Witting M, Schober S, Polte C, Schmitt-Kopplin P, Keim DA, Scherer S, Neuhaus K. Evidence for the recent origin of a bacterial protein-coding, overlapping orphan gene by evolutionary overprinting. BMC Evol Biol 2015; 15:283. [PMID: 26677845 PMCID: PMC4683798 DOI: 10.1186/s12862-015-0558-z] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2015] [Accepted: 12/06/2015] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND Gene duplication is believed to be the classical way to form novel genes, but overprinting may be an important alternative. Overprinting allows entirely novel proteins to evolve de novo, i.e., formerly non-coding open reading frames within functional genes become expressed. Only three cases have been described for Escherichia coli. Here, a fourth example is presented. RESULTS RNA sequencing revealed an open reading frame weakly transcribed in cow dung, coding for 101 residues and embedded completely in the -2 reading frame of citC in enterohemorrhagic E. coli. This gene is designated novel overlapping gene, nog1. The promoter region fused to gfp exhibits specific activities and 5' rapid amplification of cDNA ends indicated the transcriptional start 40-bp upstream of the start codon. nog1 was strand-specifically arrested in translation by a nonsense mutation silent in citC. This Nog1-mutant showed a phenotype in competitive growth against wild type in the presence of MgCl2. Small differences in metabolite concentrations were also found. Bioinformatic analyses propose Nog1 to be inner membrane-bound and to possess at least one membrane-spanning domain. A phylogenetic analysis suggests that the orphan gene nog1 arose by overprinting after Escherichia/Shigella separated from the other γ-proteobacteria. CONCLUSIONS Since nog1 is of recent origin, non-essential, short, weakly expressed and only marginally involved in E. coli's central metabolism, we propose that this gene is in an initial stage of evolution. While we present specific experimental evidence for the existence of a fourth overlapping gene in enterohemorrhagic E. coli, we believe that this may be an initial finding only and overlapping genes in bacteria may be more common than is currently assumed by microbiologists.
Collapse
Affiliation(s)
- Lea Fellner
- Lehrstuhl für Mikrobielle Ökologie, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85350, Freising, Germany.
| | - Svenja Simon
- Lehrstuhl für Datenanalyse und Visualisierung, Fachbereich Informatik und Informationswissenschaft, Universität Konstanz, Box 78, 78457, Constance, Germany.
| | - Christian Scherling
- Lehrstuhl für Ernährungsphysiologie, Wissenschaftszentrum Weihenstephan, Technische Universität München, Gregor-Mendel-Straße 2, D-85354, Freising, Germany.
| | - Michael Witting
- Research Unit Analytical BioGeoChemistry, Deutsches Forschungszentrum für Gesundheit und Umwelt GmbH, Helmholtz Zentrum München, Ingolstädter Landstraße 1, 85754, Neuherberg, Germany.
| | - Steffen Schober
- Institute of Communications Engineering, Universität Ulm, Albert-Einstein-Allee 43, 89081, Ulm, Germany. .,Present address: Blue Yonder GmbH, Ohiostraße 8, Karlsruhe, Germany.
| | - Christine Polte
- Lehrstuhl für Mikrobielle Ökologie, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85350, Freising, Germany. .,Present address: Institut für Biochemie und Molekularbiologie, Universität Hamburg, Martin-Luther-King Platz 6, 20146, Hamburg, Germany.
| | - Philippe Schmitt-Kopplin
- Research Unit Analytical BioGeoChemistry, Deutsches Forschungszentrum für Gesundheit und Umwelt GmbH, Helmholtz Zentrum München, Ingolstädter Landstraße 1, 85754, Neuherberg, Germany.
| | - Daniel A Keim
- Lehrstuhl für Datenanalyse und Visualisierung, Fachbereich Informatik und Informationswissenschaft, Universität Konstanz, Box 78, 78457, Constance, Germany.
| | - Siegfried Scherer
- Lehrstuhl für Mikrobielle Ökologie, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85350, Freising, Germany.
| | - Klaus Neuhaus
- Lehrstuhl für Mikrobielle Ökologie, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85350, Freising, Germany.
| |
Collapse
|
5
|
Secondary and Tertiary Structure Prediction of Proteins: A Bioinformatic Approach. COMPLEX SYSTEM MODELLING AND CONTROL THROUGH INTELLIGENT SOFT COMPUTATIONS 2015. [DOI: 10.1007/978-3-319-12883-2_19] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
|
6
|
Zangooei MH, Jalili S. Protein secondary structure prediction using DWKF based on SVR-NSGAII. Neurocomputing 2012. [DOI: 10.1016/j.neucom.2012.04.015] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
7
|
|
8
|
Watts MJ, Li Y, Russell BD, Mellin C, Connell SD, Fordham DA. A novel method for mapping reefs and subtidal rocky habitats using artificial neural networks. Ecol Modell 2011. [DOI: 10.1016/j.ecolmodel.2011.04.024] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
9
|
Sahu SS, Panda G. A novel feature representation method based on Chou's pseudo amino acid composition for protein structural class prediction. Comput Biol Chem 2010; 34:320-7. [DOI: 10.1016/j.compbiolchem.2010.09.002] [Citation(s) in RCA: 147] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2010] [Revised: 09/28/2010] [Accepted: 09/28/2010] [Indexed: 10/19/2022]
|
10
|
Song J, Tan H, Mahmood K, Law RHP, Buckle AM, Webb GI, Akutsu T, Whisstock JC. Prodepth: predict residue depth by support vector regression approach from protein sequences only. PLoS One 2009; 4:e7072. [PMID: 19759917 PMCID: PMC2742725 DOI: 10.1371/journal.pone.0007072] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2009] [Accepted: 08/20/2009] [Indexed: 11/24/2022] Open
Abstract
Residue depth (RD) is a solvent exposure measure that complements the information provided by conventional accessible surface area (ASA) and describes to what extent a residue is buried in the protein structure space. Previous studies have established that RD is correlated with several protein properties, such as protein stability, residue conservation and amino acid types. Accurate prediction of RD has many potentially important applications in the field of structural bioinformatics, for example, facilitating the identification of functionally important residues, or residues in the folding nucleus, or enzyme active sites from sequence information. In this work, we introduce an efficient approach that uses support vector regression to quantify the relationship between RD and protein sequence. We systematically investigated eight different sequence encoding schemes including both local and global sequence characteristics and examined their respective prediction performances. For the objective evaluation of our approach, we used 5-fold cross-validation to assess the prediction accuracies and showed that the overall best performance could be achieved with a correlation coefficient (CC) of 0.71 between the observed and predicted RD values and a root mean square error (RMSE) of 1.74, after incorporating the relevant multiple sequence features. The results suggest that residue depth could be reliably predicted solely from protein primary sequences: local sequence environments are the major determinants, while global sequence features could influence the prediction performance marginally. We highlight two examples as a comparison in order to illustrate the applicability of this approach. We also discuss the potential implications of this new structural parameter in the field of protein structure prediction and homology modeling. This method might prove to be a powerful tool for sequence analysis.
Collapse
Affiliation(s)
- Jiangning Song
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, Japan
- * E-mail: (JS); (JCW)
| | - Hao Tan
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Khalid Mahmood
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
- ARC Centre of Excellence for Structural and Functional Microbial Genomics, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Ruby H. P. Law
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Ashley M. Buckle
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Geoffrey I. Webb
- Faculty of Information Technology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, Japan
| | - James C. Whisstock
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
- ARC Centre of Excellence for Structural and Functional Microbial Genomics, Monash University, Clayton, Melbourne, Victoria, Australia
- * E-mail: (JS); (JCW)
| |
Collapse
|
11
|
Matsuo K, Watanabe H, Gekko K. Improved sequence-based prediction of protein secondary structures by combining vacuum-ultraviolet circular dichroism spectroscopy with neural network. Proteins 2009; 73:104-12. [PMID: 18395813 DOI: 10.1002/prot.22055] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Synchrotron-radiation vacuum-ultraviolet circular dichroism (VUVCD) spectroscopy can significantly improve the predictive accuracy of the contents and segment numbers of protein secondary structures by extending the short-wavelength limit of the spectra. In the present study, we combined VUVCD spectra down to 160 nm with neural-network (NN) method to improve the sequence-based prediction of protein secondary structures. The secondary structures of 30 target proteins (test set) were assigned into alpha-helices, beta-strands, and others by the DSSP program based on their X-ray crystal structures. Combining the alpha-helix and beta-strand contents estimated from the VUVCD spectra of the target proteins improved the overall sequence-based predictive accuracy Q(3) for three secondary-structure components from 59.5 to 60.7%. Incorporating the position-specific scoring matrix in the NN method improved the predictive accuracy from 70.9 to 72.1% when combining the secondary-structure contents, to 72.5% when combining the numbers of segments, and finally to 74.9% when filtering the VUVCD data. Improvement in the sequence-based prediction of secondary structures was also apparent in two other indices of the overall performance: the correlation coefficient (C) and the segment overlap value (SOV). These results suggest that VUVCD data could enhance the predictive accuracy to over 80% when combined with the currently best sequence-prediction algorithms, greatly expanding the applicability of VUVCD spectroscopy to protein structural biology.
Collapse
Affiliation(s)
- Koichi Matsuo
- Hiroshima Synchrotron Radiation Center, Hiroshima University, Higashi-Hiroshima, Japan
| | | | | |
Collapse
|
12
|
Xiao X, Lin WZ, Chou KC. Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes. J Comput Chem 2008; 29:2018-24. [PMID: 18381630 DOI: 10.1002/jcc.20955] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Using the pseudo amino acid (PseAA) composition to represent the sample of a protein can incorporate a considerable amount of sequence pattern information so as to improve the prediction quality for its structural or functional classification. However, how to optimally formulate the PseAA composition is an important problem yet to be solved. In this article the grey modeling approach is introduced that is particularly efficient in coping with complicated systems such as the one consisting of many proteins with different sequence orders and lengths. On the basis of the grey model, four coefficients derived from each of the protein sequences concerned are adopted for its PseAA components. The PseAA composition thus formulated is called the "grey-PseAA" composition that can catch the essence of a protein sequence and better reflect its overall pattern. In our study we have demonstrated that introduction of the grey-PseAA composition can remarkably enhance the success rates in predicting the protein structural class. It is anticipated that the concept of grey-PseAA composition can be also used to predict many other protein attributes, such as subcellular localization, membrane protein type, enzyme functional class, GPCR type, protease type, among many others.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333000, China.
| | | | | |
Collapse
|
13
|
Xiao X, Wang P, Chou KC. Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image. J Theor Biol 2008; 254:691-6. [PMID: 18634802 DOI: 10.1016/j.jtbi.2008.06.016] [Citation(s) in RCA: 89] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2008] [Revised: 06/18/2008] [Accepted: 06/18/2008] [Indexed: 11/28/2022]
Abstract
A novel approach was developed for predicting the structural classes of proteins based on their sequences. It was assumed that proteins belonging to the same structural class must bear some sort of similar texture on the images generated by the cellular automaton evolving rule [Wolfram, S., 1984. Cellular automation as models of complexity. Nature 311, 419-424]. Based on this, two geometric invariant moment factors derived from the image functions were used as the pseudo amino acid components [Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo amino acid composition. Proteins: Struct., Funct., Genet. (Erratum: ibid., 2001, vol. 44, 60) 43, 246-255] to formulate the protein samples for statistical prediction. The success rates thus obtained on a previously constructed benchmark dataset are quite promising, implying that the cellular automaton image can help to reveal some inherent and subtle features deeply hidden in a pile of long and complicated amino acid sequences.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 33300, China.
| | | | | |
Collapse
|
14
|
Feng J, Wang TM. Condensed Representations of Protein Secondary Structure Sequences and Their Application. J Biomol Struct Dyn 2008; 25:621-8. [DOI: 10.1080/07391102.2008.10507208] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
15
|
Song J, Tan H, Takemoto K, Akutsu T. HSEpred: predict half-sphere exposure from protein sequences. Bioinformatics 2008; 24:1489-97. [DOI: 10.1093/bioinformatics/btn222] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
16
|
Ghosh A, Parai B. Protein secondary structure prediction using distance based classifiers. Int J Approx Reason 2008. [DOI: 10.1016/j.ijar.2007.03.007] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
17
|
Shen HB, Yang J, Chou KC. Methodology development for predicting subcellular localization and other attributes of proteins. Expert Rev Proteomics 2007; 4:453-63. [PMID: 17705704 DOI: 10.1586/14789450.4.4.453] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Facing the explosion of newly generated protein sequences in the postgenomic age, we are challenged to develop computational methods for the fast and accurate identification of their subcellular localization and other attributes. This review summarizes recent methodology developments, with a focus on artificial neural networks, the statistical learning and support vector machine, the fuzzy logic-based algorithm and the evidence-theory-based algorithm, as well as the ensemble classifier approach. Meanwhile, an outline of the use of different descriptors for protein samples is given. In addition, a series of web servers established recently based on various ensemble classifiers are also briefly introduced.
Collapse
Affiliation(s)
- Hong-Bin Shen
- Shanghai Jiaotong University, Institute of Image Processing & Pattern Recognition, Shanghai, China.
| | | | | |
Collapse
|
18
|
Prediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern. J Theor Biol 2007; 250:186-93. [PMID: 17959199 DOI: 10.1016/j.jtbi.2007.09.014] [Citation(s) in RCA: 132] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2007] [Revised: 09/08/2007] [Accepted: 09/10/2007] [Indexed: 11/21/2022]
Abstract
Compared with the conventional amino acid (AA) composition, the pseudo-amino acid (PseAA) composition as originally introduced for protein subcellular location prediction can incorporate much more information of a protein sequence, so as to remarkably enhance the power of using a discrete model to predict various attributes of a protein. In this study, based on the concept of PseAA composition, the approximate entropy and hydrophobicity pattern of a protein sequence are used to characterize the PseAA components. Also, the immune genetic algorithm (IGA) is applied to search the optimal weight factors in generating the PseAA composition. Thus, for a given protein sequence sample, a 27-D (dimensional) PseAA composition is generated as its descriptor. The fuzzy K nearest neighbors (FKNN) classifier is adopted as the prediction engine. The results thus obtained in predicting protein structural classification are quite encouraging, indicating that the current approach may also be used to improve the prediction quality of other protein attributes, or at least can play a complimentary role to the existing methods in the relevant areas. Our algorithm is written in Matlab that is available by contacting the corresponding author.
Collapse
|
19
|
Sivan S, Filo O, Siegelmann H. Application of expert networks for predicting proteins secondary structure. ACTA ACUST UNITED AC 2007; 24:237-43. [PMID: 17236807 DOI: 10.1016/j.bioeng.2006.12.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2006] [Revised: 12/05/2006] [Accepted: 12/06/2006] [Indexed: 02/02/2023]
Abstract
The present study utilizes expert neural networks for the prediction of proteins secondary structure. We use three independent networks, one for each structure (alpha, beta and coil) as the first-level processing unit; decision upon the chosen structure for each residue is carried out by a second-level, post-processing unit, which utilizes the Chou and Fasman frequency values Falpha and Fbeta in order to strengthen and/or deplete the probability of the specific structure under investigation. The highest prediction case was 76%. Our method requires primitive computational means and a relatively small training set, while still been comparable to previous work. It is not meant to be an alternative to the determination of secondary structure by means of free energy minimization, integration of dynamic equations of motion or crystallography, which are expensive, time-consuming and complicated, but to provide additional constrains, which might be considered and incorporated into larger computing setups in order to reduce the initial search space for the above methods.
Collapse
Affiliation(s)
- Sarit Sivan
- Department of Biomedical Engineering, Technion, Israel Institute of Technology, IIT, Haifa 32000, Israel.
| | | | | |
Collapse
|
20
|
Chandonia JM. StrBioLib: a Java library for development of custom computational structural biology applications. Bioinformatics 2007; 23:2018-20. [PMID: 17537750 PMCID: PMC4566930 DOI: 10.1093/bioinformatics/btm269] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
SUMMARY StrBioLib is a library of Java classes useful for developing software for computational structural biology research. StrBioLib contains classes to represent and manipulate protein structures, biopolymer sequences, sets of biopolymer sequences, and alignments between biopolymers based on either sequence or structure. Interfaces are provided to interact with commonly used bioinformatics applications, including (psi)-blast, modeller, muscle and Primer3, and tools are provided to read and write many file formats used to represent bioinformatic data. The library includes a general-purpose neural network object with multiple training algorithms, the Hooke and Jeeves non-linear optimization algorithm, and tools for efficient C-style string parsing and formatting. StrBioLib is the basis for the Pred2ary secondary structure prediction program, is used to build the astral compendium for sequence and structure analysis, and has been extensively tested through use in many smaller projects. Examples and documentation are available at the site below. AVAILABILITY StrBioLib may be obtained under the terms of the GNU LGPL license from http://strbio.sourceforge.net/
Collapse
Affiliation(s)
- John-Marc Chandonia
- Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
| |
Collapse
|
21
|
Abstract
Current plant genome sequencing projects have called for development of novel and powerful high throughput tools for timely annotating the subcellular location of uncharacterized plant proteins. In view of this, an ensemble classifier, Plant-PLoc, formed by fusing many basic individual classifiers, has been developed for large-scale subcellular location prediction for plant proteins. Each of the basic classifiers was engineered by the K-Nearest Neighbor (KNN) rule. Plant-PLoc discriminates plant proteins among the following 11 subcellular locations: (1) cell wall, (2) chloroplast, (3) cytoplasm, (4) endoplasmic reticulum, (5) extracell, (6) mitochondrion, (7) nucleus, (8) peroxisome, (9) plasma membrane, (10) plastid, and (11) vacuole. As a demonstration, predictions were performed on a stringent benchmark dataset in which none of the proteins included has > or =25% sequence identity to any other in a same subcellular location to avoid the homology bias. The overall success rate thus obtained was 32-51% higher than the rates obtained by the previous methods on the same benchmark dataset. The essence of Plant-PLoc in enhancing the prediction quality and its significance in biological applications are discussed. Plant-PLoc is accessible to public as a free web-server at: (http://202.120.37.186/bioinf/plant). Furthermore, for public convenience, results predicted by Plant-PLoc have been provided in a downloadable file at the same website for all plant protein entries in the Swiss-Prot database that do not have subcellular location annotations, or are annotated as being uncertain. The large-scale results will be updated twice a year to include new entries of plant proteins and reflect the continuous development of Plant-PLoc.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, CA 92130, USA.
| | | |
Collapse
|
22
|
Liu N, Wang T. A simple method for protein structural classification. J Mol Graph Model 2007; 25:852-5. [PMID: 16997588 DOI: 10.1016/j.jmgm.2006.08.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2006] [Revised: 08/15/2006] [Accepted: 08/22/2006] [Indexed: 11/23/2022]
Abstract
Since the concept of structural classes of proteins was proposed, the problem of protein classification has been tackled by many groups. Most of their classification criteria are based only on the helix/strand contents of proteins. In this paper, we proposed a method for protein structural classification based on their secondary structure sequences. It is a classification scheme that can confirm existing classifications. Here a mathematical model is constructed to describe protein secondary structure sequences, in which each protein secondary structure sequence corresponds to a transition probability matrix that characterizes and differentiates protein structure numerically. Its application to a set of real data has indicated that our method can classify protein structures correctly. The final classification result is shown schematically. So it is visual to observe the structural classifications, which is different from traditional methods.
Collapse
Affiliation(s)
- Na Liu
- Department of Applied Mathematics, Dalian University of Technology, Dalian 116024, China.
| | | |
Collapse
|
23
|
Liu N, Wang T. Graphical representations for protein secondary structure sequences and their application. Chem Phys Lett 2007. [DOI: 10.1016/j.cplett.2006.12.041] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
24
|
Abstract
BACKGROUND Protein secondary structure prediction is a fundamental and important component in the analytical study of protein structure and functions. The prediction technique has been developed for several decades. The Chou-Fasman algorithm, one of the earliest methods, has been successfully applied to the prediction. However, this method has its limitations due to low accuracy, unreliable parameters, and over prediction. Thanks to the recent development in protein folding type-specific structure propensities and wavelet transformation, the shortcomings in Chou-Fasman method are able to be overcome. RESULTS We improved Chou-Fasman method in three aspects. (a) Replace the nucleation regions with extreme values of coefficients calculated by the continuous wavelet transform. (b) Substitute the original secondary structure conformational parameters with folding type-specific secondary structure propensities. (c) Modify Chou-Fasman rules. The CB396 data set was tested by using improved Chou-Fasman method and three indices: Q3, Qpre, SOV were used to measure this method. We compared the indices with those obtained from the original Chou-Fasman method and other four popular methods. The results showed that our improved Chou-Fasman method performs better than the original one in all indices, about 10-18% improvement. It is also comparable to other currently popular methods considering all the indices. CONCLUSION Our method has greatly improved Chou-Fasman method. It is able to predict protein secondary structure as good as current popular methods. By locating nucleation regions with refined wavelet transform technology and by calculating propensity factors with larger size data set, it is likely to get a better result.
Collapse
Affiliation(s)
- Hang Chen
- Department of Biomedical Engineering, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, 310027, China
| | - Fei Gu
- Department of Biotechnology, College of Life Sciences, Zhejiang University, Hangzhou, 310027, China
| | - Zhengge Huang
- Department of Computer Science, Center for engineering and scientific computation, Zhejiang University, Hangzhou, 310027, China
| |
Collapse
|
25
|
Huang WL, Chen HM, Hwang SF, Ho SY. Accurate prediction of enzyme subfamily class using an adaptive fuzzy k-nearest neighbor method. Biosystems 2006; 90:405-13. [PMID: 17140725 DOI: 10.1016/j.biosystems.2006.10.004] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2006] [Revised: 10/15/2006] [Accepted: 10/22/2006] [Indexed: 10/24/2022]
Abstract
Amphiphilic pseudo-amino acid composition (Am-Pse-AAC) with extra sequence-order information is a useful feature for representing enzymes. This study first utilizes the k-nearest neighbor (k-NN) rule to analyze the distribution of enzymes in the Am-Pse-AAC feature space. This analysis indicates the distributions of multiple classes of enzymes are highly overlapped. To cope with the overlap problem, this study proposes an efficient non-parametric classifier for predicting enzyme subfamily class using an adaptive fuzzy r-nearest neighbor (AFK-NN) method, where k and a fuzzy strength parameter m are adaptively specified. The fuzzy membership values of a query sample Q are dynamically determined according to the position of Q and its weighted distances to the k nearest neighbors. Using the same enzymes of the oxidoreductases family for comparisons, the prediction accuracy of AFK-NN is 76.6%, which is better than those of Support Vector Machine (73.6%), the decision tree method C5.0 (75.4%) and the existing covariant-discriminate algorithm (70.6%) using a jackknife test. To evaluate the generalization ability of AFK-NN, the datasets for all six families of entirely sequenced enzymes are established from the newly updated SWISS-PROT and ENZYME database. The accuracy of AFK-NN on the new large-scale dataset of oxidoreductases family is 83.3%, and the mean accuracy of the six families is 92.1%.
Collapse
Affiliation(s)
- Wen-Lin Huang
- Institute of Information Engineering and Computer Science, Feng Chia University, Taichung, Taiwan
| | | | | | | |
Collapse
|
26
|
|
27
|
Miyazaki S, Kuroda Y, Yokoyama S. Identification of putative domain linkers by a neural network - application to a large sequence database. BMC Bioinformatics 2006; 7:323. [PMID: 16800897 PMCID: PMC1538634 DOI: 10.1186/1471-2105-7-323] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2006] [Accepted: 06/27/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The reliable dissection of large proteins into structural domains represents an important issue for structural genomics/proteomics projects. To provide a practical approach to this issue, we tested the ability of neural network to identify domain linkers from the SWISSPROT database (101602 sequences). RESULTS Our search detected 3009 putative domain linkers adjacent to or overlapping with domains, as defined by sequence similarity to either Protein Data Bank (PDB) or Conserved Domain Database (CDD) sequences. Among these putative linkers, 75% were "correctly" located within 20 residues of a domain terminus, and the remaining 25% were found in the middle of a domain, and probably represented failed predictions. Moreover, our neural network predicted 5124 putative domain linkers in structurally un-annotated regions without sequence similarity to PDB or CDD sequences, which suggest to the possible existence of novel structural domains. As a comparison, we performed the same analysis by identifying low-complexity regions (LCR), which are known to encode unstructured polypeptide segments, and observed that the fraction of LCRs that correlate with domain termini is similar to that of domain linkers. However, domain linkers and LCRs appeared to identify different types of domain boundary regions, as only 32% of the putative domain linkers overlapped with LCRs. CONCLUSION Overall, our study indicates that the two methods detect independent and complementary regions, and that the combination of these methods can substantially improve the sensitivity of the domain boundary prediction. This finding should enable the identification of novel structural domains, yielding new targets for large scale protein analyses.
Collapse
Affiliation(s)
- Satoshi Miyazaki
- Department of Biophysics and Biochemistry, Graduate School of Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan
- RIKEN Genomic Sciences Center, 1-7-22, Suehiro-cho, Tsurumi, Yokohama 230-0045, Japan
| | - Yutaka Kuroda
- Department of Biotechnology and Life Science, Graduate School of Technology, Tokyo University of Agriculture and Technology, 2-24-16, Nakamachi, Koganei, 184-8588, Tokyo, Japan
| | - Shigeyuki Yokoyama
- Department of Biophysics and Biochemistry, Graduate School of Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan
- RIKEN Genomic Sciences Center, 1-7-22, Suehiro-cho, Tsurumi, Yokohama 230-0045, Japan
| |
Collapse
|
28
|
Sun XD, Huang RB. Prediction of protein structural classes using support vector machines. Amino Acids 2006; 30:469-75. [PMID: 16622605 DOI: 10.1007/s00726-005-0239-0] [Citation(s) in RCA: 100] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2005] [Accepted: 07/12/2005] [Indexed: 11/24/2022]
Abstract
The support vector machine, a machine-learning method, is used to predict the four structural classes, i.e. mainly alpha, mainly beta, alpha-beta and fss, from the topology-level of CATH protein structure database. For the binary classification, any two structural classes which do not share any secondary structure such as alpha and beta elements could be classified with as high as 90% accuracy. The accuracy, however, will decrease to less than 70% if the structural classes to be classified contain structure elements in common. Our study also shows that the dimensions of feature space 20(2) = 400 (for dipeptide) and 20(3) = 8 000 (for tripeptide) give nearly the same prediction accuracy. Among these 4 structural classes, multi-class classification gives an overall accuracy of about 52%, indicating that the multi-class classification technique in support of vector machines may still need to be further improved in future investigation.
Collapse
Affiliation(s)
- X-D Sun
- College of Life Science and Biotechnology, Guangxi University, Nanning, Guangxi, China
| | | |
Collapse
|
29
|
Aydin Z, Altunbasak Y, Borodovsky M. Protein secondary structure prediction for a single-sequence using hidden semi-Markov models. BMC Bioinformatics 2006; 7:178. [PMID: 16571137 PMCID: PMC1479840 DOI: 10.1186/1471-2105-7-178] [Citation(s) in RCA: 74] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2005] [Accepted: 03/30/2006] [Indexed: 11/10/2022] Open
Abstract
Background The accuracy of protein secondary structure prediction has been improving steadily towards the 88% estimated theoretical limit. There are two types of prediction algorithms: Single-sequence prediction algorithms imply that information about other (homologous) proteins is not available, while algorithms of the second type imply that information about homologous proteins is available, and use it intensively. The single-sequence algorithms could make an important contribution to studies of proteins with no detected homologs, however the accuracy of protein secondary structure prediction from a single-sequence is not as high as when the additional evolutionary information is present. Results In this paper, we further refine and extend the hidden semi-Markov model (HSMM) initially considered in the BSPSS algorithm. We introduce an improved residue dependency model by considering the patterns of statistically significant amino acid correlation at structural segment borders. We also derive models that specialize on different sections of the dependency structure and incorporate them into HSMM. In addition, we implement an iterative training method to refine estimates of HSMM parameters. The three-state-per-residue accuracy and other accuracy measures of the new method, IPSSP, are shown to be comparable or better than ones for BSPSS as well as for PSIPRED, tested under the single-sequence condition. Conclusions We have shown that new dependency models and training methods bring further improvements to single-sequence protein secondary structure prediction. The results are obtained under cross-validation conditions using a dataset with no pair of sequences having significant sequence similarity. As new sequences are added to the database it is possible to augment the dependency structure and obtain even higher accuracy. Current and future advances should contribute to the improvement of function prediction for orphan proteins inscrutable to current similarity search methods.
Collapse
Affiliation(s)
- Zafer Aydin
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0250, USA
| | - Yucel Altunbasak
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0250, USA
| | - Mark Borodovsky
- School of Biology, the Wallace H. Coulter Department of Biomedical Engineering and the Center for Bioinformatics and Computational Biology, Georgia Institute of Technology, Atlanta, GA 30332-0230, USA
| |
Collapse
|
30
|
Chou KC, Cai YD. Prediction of protease types in a hybridization space. Biochem Biophys Res Commun 2005; 339:1015-20. [PMID: 16325146 DOI: 10.1016/j.bbrc.2005.10.196] [Citation(s) in RCA: 55] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2005] [Accepted: 10/30/2005] [Indexed: 11/21/2022]
Abstract
Regulating most physiological processes by controlling the activation, synthesis, and turnover of proteins, proteases play pivotal regulatory roles in conception, birth, digestion, growth, maturation, ageing, and death of all organisms. Different types of proteases have different functions and biological processes. Therefore, it is important for both basic research and drug discovery to consider the following two problems. (1) Given the sequence of a protein, can we identify whether it is a protease or non-protease? (2) If it is, what protease type does it belong to? Although the two problems can be solved by various experimental means, it is both time-consuming and costly to do so. The avalanche of protein sequences generated in the post-genetic era has challenged us to develop an automated method for making a fast and reliable identification. By hybridizing the functional domain composition and pseudo-amino acid composition, we have introduced a new method called "FunD-PseAA predictor" that is operated in a hybridization space. To avoid redundancy and bias, demonstrations were performed on a dataset where none of the proteins has >or=25% sequence identity to any other. The overall success rate thus obtained by the jackknife cross-validation test in identifying protease and non-protease was 92.95%, and that in identifying the protease type was 94.75% among the following six types: (1) aspartic, (2) cysteine, (3) glutamic, (4) metallo, (5) serine, and (6) threonine. Demonstration was also made on an independent dataset, and the corresponding overall success rates were 98.36% and 97.11%, respectively, suggesting the FunD-PseAA predictor is very powerful and may become a useful tool in bioinformatics and proteomics.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, 13784 Torrey Del Mar, San Diego, CA 92130, USA.
| | | |
Collapse
|
31
|
Cai YD, Chou KC. Predicting membrane protein type by functional domain composition and pseudo-amino acid composition. J Theor Biol 2005; 238:395-400. [PMID: 16040052 DOI: 10.1016/j.jtbi.2005.05.035] [Citation(s) in RCA: 72] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2005] [Revised: 05/25/2005] [Accepted: 05/26/2005] [Indexed: 10/25/2022]
Abstract
Given the sequence of a protein, how can we predict whether it is a membrane protein or non-membrane protein? If it is, what membrane protein type it belongs to? Since these questions are closely relevant to the function of an uncharacterized protein, their importance is self-evident. Particularly, with the explosion of protein sequences entering into databanks and the relatively much slower progress in using biochemical experiments to determine their functions, it is highly desired to develop an automated method that can be used to give a fast answers to these questions. By hybridizing the functional domain (FunD) and pseudo-amino acid composition (PseAA), a new strategy called FunD-PseAA predictor was introduced. To test the power of the predictor, a highly non-homologous data set was constructed where none of proteins has 25% sequence identity to any other. The overall success rates obtained with the FunD-PseAA predictor on such a data set by the jackknife cross-validation test was 85% for the case in identifying membrane protein and non-membrane protein, and 91% in identifying the membrane protein type among the following 5 categories: (1) type-1 membrane protein, (2) type-2 membrane protein, (3) multipass transmembrane protein, (4) lipid chain-anchored membrane protein, and (5) GPI-anchored membrane protein. These rates are much higher than those obtained by the other methods on the same stringent data set, indicating that the FunD-PseAA predictor may become a useful high throughput tool in bioinformatics and proteomics.
Collapse
Affiliation(s)
- Yu-Dong Cai
- Biomolecular Sciences Department, University of Manchester Institute of Science & Technology, P.O. Box 88, Manchester, M60 1QD, UK.
| | | |
Collapse
|
32
|
Peng K, Vucetic S, Radivojac P, Brown CJ, Dunker AK, Obradovic Z. Optimizing long intrinsic disorder predictors with protein evolutionary information. J Bioinform Comput Biol 2005; 3:35-60. [PMID: 15751111 DOI: 10.1142/s0219720005000886] [Citation(s) in RCA: 380] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2004] [Revised: 02/05/2004] [Accepted: 05/14/2004] [Indexed: 11/18/2022]
Abstract
Protein existing as an ensemble of structures, called intrinsically disordered, has been shown to be responsible for a wide variety of biological functions and to be common in nature. Here we focus on improving sequence-based predictions of long (>30 amino acid residues) regions lacking specific 3-D structure by means of four new neural-network-based Predictors Of Natural Disordered Regions (PONDRs): VL3, VL3H, VL3P, and VL3E. PONDR VL3 used several features from a previously introduced PONDR VL2, but benefitted from optimized predictor models and a slightly larger (152 vs. 145) set of disordered proteins that were cleaned of mislabeling errors found in the smaller set. PONDR VL3H utilized homologues of the disordered proteins in the training stage, while PONDR VL3P used attributes derived from sequence profiles obtained by PSI-BLAST searches. The measure of accuracy was the average between accuracies on disordered and ordered protein regions. By this measure, the 30-fold cross-validation accuracies of VL3, VL3H, and VL3P were, respectively, 83.6 +/- 1.4%, 85.3 +/- 1.4%, and 85.2 +/- 1.5%. By combining VL3H and VL3P, the resulting PONDR VL3E achieved an accuracy of 86.7 +/- 1.4%. This is a significant improvement over our previous PONDRs VLXT (71.6 +/- 1.3%) and VL2 (80.9 +/- 1.4%). The new disorder predictors with the corresponding datasets are freely accessible through the web server at http://www.ist.temple.edu/disprot.
Collapse
Affiliation(s)
- Kang Peng
- Center for Information Science and Technology, Temple University, Philadelphia, PA 19122, USA
| | | | | | | | | | | |
Collapse
|
33
|
Chou KC, Cai YD. Predicting protein structural class by functional domain composition. Biochem Biophys Res Commun 2004; 321:1007-9. [PMID: 15358128 DOI: 10.1016/j.bbrc.2004.07.059] [Citation(s) in RCA: 144] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2004] [Indexed: 11/16/2022]
Abstract
The functional domain composition is introduced to predict the structural class of a protein or domain according to the following classification: all-alpha, all-beta, alpha/beta, alpha+beta, micro (multi-domain), sigma (small protein), and rho (peptide). The advantage by doing so is that both the sequence-order-related features and the function-related features are naturally incorporated in the predictor. As a demonstration, the jackknife cross-validation test was performed on a dataset that consists of proteins and domains with only less than 20% sequence identity to each other in order to get rid of any homologous bias. The overall success rate thus obtained was 98%. In contrast to this, the corresponding rates obtained by the simple geometry approaches based on the amino acid composition were only 36-39%. This indicates that using the functional domain composition to represent the sample of a protein for statistical prediction is very promising, and that the functional type of a domain is closely correlated with its structural class.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, CA 92130, USA.
| | | |
Collapse
|
34
|
Abstract
MOTIVATION With protein sequences entering into databanks at an explosive pace, the early determination of the family or subfamily class for a newly found enzyme molecule becomes important because this is directly related to the detailed information about which specific target it acts on, as well as to its catalytic process and biological function. Unfortunately, it is both time-consuming and costly to do so by experiments alone. In a previous study, the covariant-discriminant algorithm was introduced to identify the 16 subfamily classes of oxidoreductases. Although the results were quite encouraging, the entire prediction process was based on the amino acid composition alone without including any sequence-order information. Therefore, it is worthy of further investigation. RESULTS To incorporate the sequence-order effects into the predictor, the 'amphiphilic pseudo amino acid composition' is introduced to represent the statistical sample of a protein. The novel representation contains 20 + 2lambda discrete numbers: the first 20 numbers are the components of the conventional amino acid composition; the next 2lambda numbers are a set of correlation factors that reflect different hydrophobicity and hydrophilicity distribution patterns along a protein chain. Based on such a concept and formulation scheme, a new predictor is developed. It is shown by the self-consistency test, jackknife test and independent dataset tests that the success rates obtained by the new predictor are all significantly higher than those by the previous predictors. The significant enhancement in success rates also implies that the distribution of hydrophobicity and hydrophilicity of the amino acid residues along a protein chain plays a very important role to its structure and function.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, CA 92130, USA.
| |
Collapse
|
35
|
|
36
|
Meiler J, Baker D. Coupled prediction of protein secondary and tertiary structure. Proc Natl Acad Sci U S A 2003; 100:12105-10. [PMID: 14528006 PMCID: PMC218720 DOI: 10.1073/pnas.1831973100] [Citation(s) in RCA: 147] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2003] [Indexed: 11/18/2022] Open
Abstract
The strong coupling between secondary and tertiary structure formation in protein folding is neglected in most structure prediction methods. In this work we investigate the extent to which nonlocal interactions in predicted tertiary structures can be used to improve secondary structure prediction. The architecture of a neural network for secondary structure prediction that utilizes multiple sequence alignments was extended to accept low-resolution nonlocal tertiary structure information as an additional input. By using this modified network, together with tertiary structure information from native structures, the Q3-prediction accuracy is increased by 7-10% on average and by up to 35% in individual cases for independent test data. By using tertiary structure information from models generated with the ROSETTA de novo tertiary structure prediction method, the Q3-prediction accuracy is improved by 4-5% on average for small and medium-sized single-domain proteins. Analysis of proteins with particularly large improvements in secondary structure prediction using tertiary structure information provides insight into the feedback from tertiary to secondary structure.
Collapse
Affiliation(s)
- Jens Meiler
- Department of Biochemistry, University of Washington, Box 357350, Seattle, WA 98195-7350, USA
| | | |
Collapse
|
37
|
Miyazaki S, Kuroda Y, Yokoyama S. Characterization and prediction of linker sequences of multi-domain proteins by a neural network. JOURNAL OF STRUCTURAL AND FUNCTIONAL GENOMICS 2003; 2:37-51. [PMID: 12836673 DOI: 10.1023/a:1014418700858] [Citation(s) in RCA: 39] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
In this paper, we describe a neural network analysis of sequences connecting two protein domains (domain linkers). The neural network was trained to distinguish between domain linker sequences and non-linker sequences, using a SCOP-defined domain library. The analysis indicated that a significant difference existed between domain linkers and non-linker regions, including intra-domain loop regions. Moreover, the resulting Hinton diagram showed a position-dependent amino acid preference of the domain linker sequences, and implied their non-random nature. We then applied the neural network to predict domain linkers in multi-domain protein sequences. As the result of a Jack-knife test, 58% of the predicted regions matched actual linker regions (specificity), and 36% of the SCOP-derived domain linkers were predicted (sensitivity). This prediction efficiency is superior to simpler methods derived from secondary structure prediction that assume that long loop regions are putative domain linkers. Altogether, these results suggest that domain linkers possess local characteristics different from those of loop regions.
Collapse
Affiliation(s)
- Satoshi Miyazaki
- Department of Biophysics and Biochemistry, Graduate School of Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan
| | | | | |
Collapse
|
38
|
Ahmad S, Gromiha MM, Sarai A. Real value prediction of solvent accessibility from amino acid sequence. Proteins 2003; 50:629-35. [PMID: 12577269 DOI: 10.1002/prot.10328] [Citation(s) in RCA: 159] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The solvent accessibility of amino acid residues has been predicted in the past by classifying them into exposure states with varying thresholds. This classification provides a wide range of values for the accessible surface area (ASA) within which a residue may fall. Thus far, no attempt has been made to predict real values of ASA from the sequence information without a priori classification into exposure states. Here, we present a new method with which to predict real value ASAs for residues, based on neighborhood information. Our real value prediction neural network could estimate the ASA for four different nonhomologous, nonredundant data sets of varying size, with 18.0-19.5% mean absolute error, defined as per residue absolute difference between the predicted and experimental values of relative ASA. Correlation between the predicted and experimental values ranged from 0.47 to 0.50. It was observed that the ASA of a residue could be predicted within a 23.7% mean absolute error, even when no information about its neighbors is included. Prediction of real values answers the issue of arbitrary choice of ASA state thresholds, and carries more information than category prediction. Prediction error for each residue type strongly correlates with the variability in its experimental ASA values.
Collapse
|
39
|
Shepherd AJ, Gorse D, Thornton JM. A novel approach to the recognition of protein architecture from sequence using Fourier analysis and neural networks. Proteins 2003; 50:290-302. [PMID: 12486723 DOI: 10.1002/prot.10290] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
A novel method is presented for the prediction of protein architecture from sequence using neural networks. The method involves the preprocessing of protein sequence data by numerically encoding it and then applying a Fourier transform. The encoded and transformed data are then used to train a neural network to recognize a number of different protein architectures. The method proved significantly better than comparable alternative strategies such as percentage dipeptide frequency, but is still limited by the size of the data set and the input demands of a neural network. Its main potential is as a complement to existing fold recognition techniques, with its ability to identify global symmetries within protein structures its greatest strength.
Collapse
Affiliation(s)
- Adrian J Shepherd
- Department of Biochemistry and Molecular Biology, University College London, London, United Kingdom.
| | | | | |
Collapse
|
40
|
Huang JT, Wang MT. Secondary structural wobble: the limits of protein prediction accuracy. Biochem Biophys Res Commun 2002; 294:621-5. [PMID: 12056813 DOI: 10.1016/s0006-291x(02)00545-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
At present, accuracies of secondary structural prediction scarcely go beyond 70-75%. Secondary structural comparison is carried out among sequence-identified proteins. The results show natural wobble between different secondary structural types is possible in homologous families, and the best prediction accuracy will rarely be 100%. Besides shortcoming of the prediction approaches, secondary structural wobble is found to be responsible for nearly all secondary structural prediction limits. Only average 73.2% of amino acid residue is conserved in secondary structural types. The wobble allows alpha-class/coil and beta-class/coil transitions but not direct alpha-class/beta-class transition. Propensity values representing the statistical occurrence of 20 amino acid residues in secondary structural wobbles are given.
Collapse
Affiliation(s)
- Ji-Tao Huang
- Department of Biochemistry, Tianjin Institute of Technology, Tianjin 300191, China.
| | | |
Collapse
|
41
|
Cai YD, Liu XJ, Xu XB, Chou KC. Prediction of protein structural classes by support vector machines. COMPUTERS & CHEMISTRY 2002; 26:293-6. [PMID: 11868916 DOI: 10.1016/s0097-8485(01)00113-9] [Citation(s) in RCA: 195] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
In this paper, we apply a new machine learning method which is called support vector machine to approach the prediction of protein structural class. The support vector machine method is performed based on the database derived from SCOP which is based upon domains of known structure and the evolutionary relationships and the principles that govern their 3D structure. As a result, high rates of both self-consistency and jackknife test are obtained. This indicates that the structural class of a protein inconsiderably correlated with its amino and composition, and the support vector machine can be referred as a powerful computational tool for predicting the structural classes of proteins.
Collapse
Affiliation(s)
- Yu-Dong Cai
- Shanghai Research Centre of Biotechnology, Chinese Academy of Sciences.
| | | | | | | |
Collapse
|
42
|
Cai YD, Liu XJ, Xu XB, Zhou GP. Support vector machines for predicting protein structural class. BMC Bioinformatics 2001; 2:3. [PMID: 11483157 PMCID: PMC35360 DOI: 10.1186/1471-2105-2-3] [Citation(s) in RCA: 74] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2001] [Accepted: 06/29/2001] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We apply a new machine learning method, the so-called Support Vector Machine method, to predict the protein structural class. Support Vector Machine method is performed based on the database derived from SCOP, in which protein domains are classified based on known structures and the evolutionary relationships and the principles that govern their 3-D structure. RESULTS High rates of both self-consistency and jackknife tests are obtained. The good results indicate that the structural class of a protein is considerably correlated with its amino acid composition. CONCLUSIONS It is expected that the Support Vector Machine method and the elegant component-coupled method, also named as the covariant discrimination algorithm, if complemented with each other, can provide a powerful computational tool for predicting the structural classes of proteins.
Collapse
Affiliation(s)
- Yu-Dong Cai
- Shanghai Research Centre of Biotechnology, Chinese Academy of Sciences, Shanghai, 200233, China
| | - Xiao-Jun Liu
- Institute of Cell, Animal and Population Biology University of Edinburgh, West Mains Road, Edinburgh EH9 3JT, U.K
| | - Xue-biao Xu
- Department of Computing Science, University of Wales, College of Cardiff, Queens Buildings, Newport Road, PO Box 916, Cardiff CF2 3XF, U.K
| | - Guo-Ping Zhou
- Department of Structural Biology, Burnham Institute, La Jolla, California 92037, USA
| |
Collapse
|
43
|
Paci E, Smith LJ, Dobson CM, Karplus M. Exploration of partially unfolded states of human alpha-lactalbumin by molecular dynamics simulation. J Mol Biol 2001; 306:329-47. [PMID: 11237603 DOI: 10.1006/jmbi.2000.4337] [Citation(s) in RCA: 56] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Molecular dynamics simulations are used to probe the properties of non-native states of the protein human alpha-lactalbumin (human alpha-LA) with a detailed atomistic model in an implicit aqueous solvent environment. To sample the conformational space, a biasing force is introduced that increases the radius of gyration relative to the native state and generates a large number of low-energy conformers that differ in terms of their root-mean-square deviation, for a given radius of gyration. The resulting structures are relaxed by unbiased simulations and used as models of the molten globule and partly denatured states of human alpha-LA, based on measured radii of gyration obtained from nuclear magnetic resonance experiments. The ensembles of structures agree in their overall properties with experimental data available for the human alpha-LA molten globule and its more denatured states. In particular, the simulation results show that the native-like fold of the alpha-domain is preserved in the molten globule. Further, a considerable proportion of the antiparallel beta-strand in the beta-domain are present. This indicates that the lack of hydrogen exchange protection found experimentally for the beta-domain is due to rearrangement of the beta-sheet involving transient populations of non-native beta-structures. The simulations also provide details concerning the ensemble of structures that contribute as the molten globule unfolds and shows, in accord with experimental data, that unfolding is not cooperative; i.e. the various structural elements do not unfold simultaneously.
Collapse
Affiliation(s)
- E Paci
- Oxford Centre for Molecular Sciences, New Chemistry Laboratory, University of Oxford, South Parks Road, Oxford, OX1 3QT, UK
| | | | | | | |
Collapse
|
44
|
Elkin CD, Zuccola HJ, Hogle JM, Joseph-McCarthy D. Computational design of D-peptide inhibitors of hepatitis delta antigen dimerization. J Comput Aided Mol Des 2000; 14:705-18. [PMID: 11131965 DOI: 10.1023/a:1008146015629] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Hepatitis delta virus (HDV) encodes a single polypeptide called hepatitis delta antigen (DAg). Dimerization of DAg is required for viral replication. The structure of the dimerization region, residues 12 to 60, consists of an anti-parallel coiled coil [Zuccola et al., Structure, 6(1998)821]. Multiple Copy Simultaneous Searches (MCSS) of the hydrophobic core region formed by the bend in the helix of one monomer of this structure were carried out for many diverse functional groups. Six critical interaction sites were identified. The Protein Data Bank was searched for backbone templates to use in the subsequent design process by matching to these sites. A 14 residue helix expected to bind to the D-isomer of the target structure was selected as the template. Over 200,000 mutant sequences of this peptide were generated based on the MCSS results. A secondary structure prediction algorithm was used to screen all sequences. and in general only those that were predicted to be highly helical were retained. Approximately 100 of these 14-mers were model built as D-peptides and docked with the L-isomer of the target monomer. Based on calculated interaction energies, predicted helicity, and intrahelical salt bridge patterns, a small number of peptides were selected as the most promising candidates. The ligand design approach presented here is the computational analogue of mirror image phage display. The results have been used to characterize the interactions responsible for formation of this model anti-parallel coiled coil and to suggest potential ligands to disrupt it.
Collapse
Affiliation(s)
- C D Elkin
- Committee on Higher Degrees in Biophysics, Harvard University, Cambridge, MA 02139, USA
| | | | | | | |
Collapse
|
45
|
Abstract
A tight turn in protein structure is defined as a site where (i) a polypeptide chain reverses its overall direction, i.e., leads the chain to fold back on itself by nearly 180 degrees, and (ii) the amino acid residues directly involved in forming the turn are no more than six. Tight turns are generally categorized as delta-turn, gamma-turn, beta-turn, alpha-turn, and pi-turn, which are formed by two-, three-, four-, five-, and six-amino-acid residues, respectively. According to the folding mode, each of such tight turns can be further classified into several different types. Tight turns play an important role in globular proteins from both the structural and functional points of view. In view of this, various efforts have been made to predict tight turns and their types. This Review summarizes the development in this area, with an emphasis focused on the most recent work concerned that is featured by the sequence-coupled model. Meanwhile, the future challenge in this area has also been briefly addressed.
Collapse
Affiliation(s)
- K C Chou
- Computer-Aided Drug Discovery, Pharmacia & Upjohn, Kalamazoo, Michigan, 49007-4940, USA
| |
Collapse
|
46
|
Affiliation(s)
- Y Cai
- Shanghai Research Centre of Biotechnolog, Chinese Academy of Sciences, 200233, Shanghai, China.
| | | |
Collapse
|
47
|
Zhang CT, Zhang R. A graphic approach to evaluate algorithms of secondary structure prediction. J Biomol Struct Dyn 2000; 17:829-42. [PMID: 10798528 DOI: 10.1080/07391102.2000.10506572] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Algorithms of secondary structure prediction have undergone the developments of nearly 30 years. However, the problem of how to appropriately evaluate and compare algorithms has not yet completely solved. A graphic method to evaluate algorithms of secondary structure prediction has been proposed here. Traditionally, the performance of an algorithm is evaluated by a number, i.e., accuracy of various definitions. Instead of a number, we use a graph to completely evaluate an algorithm, in which the mapping points are distributed in a three-dimensional space. Each point represents the predictive result of the secondary structure of a protein. Because the distribution of mapping points in the 3D space generally contains more information than a number or a set of numbers, it is expected that algorithms may be evaluated and compared by the proposed graphic method more objectively. Based on the point distribution, six evaluation parameters are proposed, which describe the overall performance of the algorithm evaluated. Furthermore, the graphic method is simple and intuitive. As an example of application, two advanced algorithms, i.e., the PHD and NNpredict methods, are evaluated and compared. It is shown that there is still much room for further improvement for both algorithms. It is pointed out that the accuracy for predicting either the alpha-helix or beta-strand in proteins with higher alpha-helix or beta-strand content, respectively, should be greatly improved for both algorithms.
Collapse
Affiliation(s)
- C T Zhang
- Department of Physics, Tianjin University, China.
| | | |
Collapse
|
48
|
Abstract
Proteins of known structures are usually classified into four structural classes: all-alpha, all-beta, alpha+beta, and alpha/beta type of proteins. A number of methods to predicting the structural class of a protein based on its amino acid composition have been developed during the past few years. Recently, a component-coupled method was developed for predicting protein structural class according to amino acid composition. This method is based on the least Mahalanobis distance principle, and yields much better predicted results in comparison with the previous methods. However, the success rates reported for structural class prediction by different investigators are contradictory. The highest reported accuracies by this method are near 100%, but the lowest one is only about 60%. The goal of this study is to resolve this paradox and to determine the possible upper limit of prediction rate for structural classes. In this paper, based on the normality assumption and the Bayes decision rule for minimum error, a new method is proposed for predicting the structural class of a protein according to its amino acid composition. The detailed theoretical analysis indicates that if the four protein folding classes are governed by the normal distributions, the present method will yield the optimum predictive result in a statistical sense. A non-redundant data set of 1,189 protein domains is used to evaluate the performance of the new method. Our results demonstrate that 60% correctness is the upper limit for a 4-type class prediction from amino acid composition alone for an unknown query protein. The apparent relatively high accuracy level (more than 90%) attained in the previous studies was due to the preselection of test sets, which may not be adequately representative of all unrelated proteins.
Collapse
Affiliation(s)
- Z X Wang
- National Laboratory of Biomacromolecules, Institute of Biophysics, Academia Sinica, Beijing, Peoples Republic of China.
| | | |
Collapse
|
49
|
Abstract
The three-dimensional structure of a protein is uniquely dictated by its primary sequence. However, owing to the very high degenerative nature of the sequence-structure relationship, proteins are generally folded into one of only a few structural classes that are closely correlated with the amino-acid composition. This suggests that the interaction among the components of amino acid composition may play a considerable role in determining the structural class of a protein. To quantitatively test such a hypothesis at a deeper level, three potential functions, U((0)), U((1)), and U((2)), were formulated that respectively represent the 0th-order, 1st-order, and 2nd-order approximations for the interaction among the components of the amino acid composition in a protein. It was observed that the correct rates in recognizing protein structural classes by U((2)) are significantly higher than those by U((0)) and U((1)), indicating that an algorithm that can more completely incorporate the interaction contributions will yield better recognition quality, and hence further demonstrate that the interaction among the components of amino acid composition is an important driving force in determining the structural class of a protein during the sequence folding process.
Collapse
Affiliation(s)
- K C Chou
- Computer-Aided Drug Discovery, Pharmacia and Upjohn, Kalamazoo, Michigan, 49007-4940, USA
| |
Collapse
|
50
|
|