1
|
Long S, Tian P. Protein secondary structure prediction with context convolutional neural network. RSC Adv 2019; 9:38391-38396. [PMID: 35540205 PMCID: PMC9075825 DOI: 10.1039/c9ra05218f] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2019] [Accepted: 11/18/2019] [Indexed: 11/21/2022] Open
Abstract
Protein secondary structure (SS) prediction is important for studying protein structure and function. Both traditional machine learning methods and deep learning neural networks have been utilized and great progress has been achieved in approaching the theoretical limit. Convolutional and recurrent neural networks are two major types of deep learning architectures with comparable prediction accuracy but different training procedures to achieve optimal performance. We are interested in seeking a novel architectural style with competitive performance and in understanding the performance of different architectures with similar training procedures. We constructed a context convolutional neural network (Contextnet) and compared its performance with popular models (e.g. convolutional neural network, recurrent neural network, conditional neural fields…) under similar training procedures on a Jpred dataset. The Contextnet was proven to be highly competitive. Additionally, we retrained the network with the Cullpdb dataset and compared with Jpred, ReportX, Spider3 server and MUFold-SS method, the Contextnet was found to be more Q3 accurate on a CASP13 dataset. Training procedures were found to have significant impact on the accuracy of the Contextnet. Protein secondary structure prediction using context convolutional neural network.![]()
Collapse
Affiliation(s)
| | - Pu Tian
- School of Life Science, School of Artificial Intelligence, Jilin University 2699 Qian-jin Street Changchun China 130012
| |
Collapse
|
2
|
Kandoi G, Leelananda SP, Jernigan RL, Sen TZ. Predicting Protein Secondary Structure Using Consensus Data Mining (CDM) Based on Empirical Statistics and Evolutionary Information. Methods Mol Biol 2017; 1484:35-44. [PMID: 27787818 DOI: 10.1007/978-1-4939-6406-2_4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Predicting the secondary structure of a protein from its sequence still remains a challenging problem. The prediction accuracies remain around 80 %, and for very diverse methods. Using evolutionary information and machine learning algorithms in particular has had the most impact. In this chapter, we will first define secondary structures, then we will review the Consensus Data Mining (CDM) technique based on the robust GOR algorithm and Fragment Database Mining (FDM) approach. GOR V is an empirical method utilizing a sliding window approach to model the secondary structural elements of a protein by making use of generalized evolutionary information. FDM uses data mining from experimental structure fragments, and is able to successfully predict the secondary structure of a protein by combining experimentally determined structural fragments based on sequence similarities of the fragments. The CDM method combines predictions from GOR V and FDM in a hierarchical manner to produce consensus predictions for secondary structure. In other words, if sequence fragment are not available, then it uses GOR V to make the secondary structure prediction. The online server of CDM is available at http://gor.bb.iastate.edu/cdm/ .
Collapse
Affiliation(s)
- Gaurav Kandoi
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, USA
- Department of Electrical and Computer Engineering, Iowa State University, Ames, IA, USA
| | - Sumudu P Leelananda
- Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children's Hospital, Columbus, OH, USA
| | - Robert L Jernigan
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, USA
- Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, USA
| | - Taner Z Sen
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, USA.
- Department of Genetics, Development and Cell Biology, Iowa State University, 1025 Crop Genome Informatics Lab, Ames, IA, 50011, USA.
| |
Collapse
|
3
|
Rashid S, Saraswathi S, Kloczkowski A, Sundaram S, Kolinski A. Protein secondary structure prediction using a small training set (compact model) combined with a Complex-valued neural network approach. BMC Bioinformatics 2016; 17:362. [PMID: 27618812 PMCID: PMC5020447 DOI: 10.1186/s12859-016-1209-0] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2015] [Accepted: 08/25/2016] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Protein secondary structure prediction (SSP) has been an area of intense research interest. Despite advances in recent methods conducted on large datasets, the estimated upper limit accuracy is yet to be reached. Since the predictions of SSP methods are applied as input to higher-level structure prediction pipelines, even small errors may have large perturbations in final models. Previous works relied on cross validation as an estimate of classifier accuracy. However, training on large numbers of protein chains compromises the classifier ability to generalize to new sequences. This prompts a novel approach to training and an investigation into the possible structural factors that lead to poor predictions. Here, a small group of 55 proteins termed the compact model is selected from the CB513 dataset using a heuristics-based approach. In a prior work, all sequences were represented as probability matrices of residues adopting each of Helix, Sheet and Coil states, based on energy calculations using the C-Alpha, C-Beta, Side-chain (CABS) algorithm. The functional relationship between the conformational energies computed with CABS force-field and residue states is approximated using a classifier termed the Fully Complex-valued Relaxation Network (FCRN). The FCRN is trained with the compact model proteins. RESULTS The performance of the compact model is compared with traditional cross-validated accuracies and blind-tested on a dataset of G Switch proteins, obtaining accuracies of ∼81 %. The model demonstrates better results when compared to several techniques in the literature. A comparative case study of the worst performing chain identifies hydrogen bond contacts that lead to Coil ⇔ Sheet misclassifications. Overall, mispredicted Coil residues have a higher propensity to participate in backbone hydrogen bonding than correctly predicted Coils. CONCLUSIONS The implications of these findings are: (i) the choice of training proteins is important in preserving the generalization of a classifier to predict new sequences accurately and (ii) SSP techniques sensitive in distinguishing between backbone hydrogen bonding and side-chain or water-mediated hydrogen bonding might be needed in the reduction of Coil ⇔ Sheet misclassifications.
Collapse
Affiliation(s)
- Shamima Rashid
- School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Ave, Singapore, 639798 Singapore
| | - Saras Saraswathi
- Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children’s Hospital, 700 Children’s Drive, Columbus, USA
- Sidra Medical and Research Center, Al Dafna, Doha, Qatar
| | - Andrzej Kloczkowski
- Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children’s Hospital, 700 Children’s Drive, Columbus, USA
- Department of Paediatrics, College of Medicine, The Ohio State University, 370 W. 9th Avenue, Columbus, USA
| | - Suresh Sundaram
- School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Ave, Singapore, 639798 Singapore
| | - Andrzej Kolinski
- Laboratory of Theory of Biopolymers, Faculty of Chemistry, University of Warsaw, Pasteura 1, Warsaw, 02-093 Poland
| |
Collapse
|
4
|
Maier K, He Y, Esser PR, Thriene K, Sarca D, Kohlhase J, Dengjel J, Martin L, Has C. Single Amino Acid Deletion in Kindlin-1 Results in Partial Protein Degradation Which Can Be Rescued by Chaperone Treatment. J Invest Dermatol 2016; 136:920-929. [DOI: 10.1016/j.jid.2015.12.039] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2015] [Revised: 11/30/2015] [Accepted: 12/19/2015] [Indexed: 10/22/2022]
|
5
|
Pinilla G, Muñoz LC, Salazar LM, Navarrete J, Guevara A. DISEÑO DE PÉPTIDOS BASADO EN LA SECUENCIA ANÁLOGA AL REPRESOR NEGATIVO icaR DE Staphylococcus sp. REVISTA COLOMBIANA DE QUÍMICA 2016. [DOI: 10.15446/rev.colomb.quim.v44n2.55213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
<p>La biopelícula como un mecanismo de virulencia en Staphylococcus involucrada en infecciones intrahospitalarias es regulada por un represor negativo, responsable de la transcripción completa del operón. La búsqueda de dominios funcionales por modulación computacional de permitió hallar las secuencias peptídicas con actividad biológica análoga a la proteína <em>ica</em>R. Mediante biología computacional se diseñaron péptidos empleando el programa de predicción AntiBP (http: //www. imtech.res.in/ raghava/antibp/); la síntesis química se hizo por N<sup>α</sup>-Fmoc y se caracterizaron y purificaron tres moléculas por RP-HPLC y MALDI-TOF. Se evaluó su seguridad biológica mediante ensayo de actividad citotóxica realizada sobre macrófagos murinos de la línea J774 y la actividad hemolítica se determinó mediante el uso de glóbulos rojos. Los tres péptidos caracterizados IR1, IR2 e IR3, presentaron estructura secundaria predominantemente alfa helicoidal, alto grado de pureza y alto score antimicrobiano; además, mostraron baja toxicidad, evidenciada por la actividad citotóxica y hemolítica en las concentraciones ensayadas y en comparación con los controles usados, que permitiría su potencial uso como moléculas candidatas o principios activos con actividad análoga al represor nativo, frente a la biopelícula de los Staphylococcus.</p>
Collapse
|
6
|
Shirani A, Shahbazi Mojarrad J, Mussa Farkhani S, Yari Khosroshahi A, Zakeri-Milani P, Samadi N, Sharifi S, Mohammadi S, Valizadeh H. The Relation Between Thermodynamic and Structural Properties and Cellular Uptake of Peptides Containing Tryptophan and Arginine. Adv Pharm Bull 2015; 5:161-8. [PMID: 26236653 DOI: 10.15171/apb.2015.023] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2014] [Revised: 11/20/2014] [Accepted: 11/24/2014] [Indexed: 01/31/2023] Open
Abstract
PURPOSE Cell-penetrating peptides (CPPs) are used for delivering drugs and other macromolecular cargo into living cells. In this paper, we investigated the relationship between the structural/physicochemical properties of four new synthetic peptides containing arginine-tryptophan in terms of their cell membrane penetration efficiency. METHODS The peptides were prepared using solid phase synthesis procedure using FMOC protected amino acids. Fluorescence-activated cell sorting and fluorescence imaging were used to evaluate uptake efficiency. Prediction of the peptide secondary structure and estimation of physicochemical properties was performed using the GOR V method and MPEx 3.2 software (Wimley-White scale, helical wheel projection and total hydrophobic moment). RESULTS Our data showed that the uptake efficiency of peptides with two tryptophans at the C- and N-terminus were significantly higher (about 4-fold) than that of peptides containing three tryptophans at both ends. The distribution of arginine at both ends also increased the uptake efficiency 2.52- and 7.18-fold, compared with arginine distribution at the middle of peptides. CONCLUSION According to the obtained results the value of transfer free energies of peptides from the aqueous phase to membrane bilayer could be a good predictor for the cellular uptake efficiency of CPPs.
Collapse
Affiliation(s)
- Ali Shirani
- Research Center for Pharmaceutical Nanotechnology and Department of Medical Nanotechnology, Faculty of Advanced Medical Sciences, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Javid Shahbazi Mojarrad
- Biotechnology Research Center and Faculty of Pharmacy, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Samad Mussa Farkhani
- Research Center for Pharmaceutical Nanotechnology and Department of Medical Nanotechnology, Faculty of Advanced Medical Sciences, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Ahmad Yari Khosroshahi
- Biotechnology Research Center and Faculty of Pharmacy, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Parvin Zakeri-Milani
- Liver and Gastrointestinal Diseases Research Center and Faculty of Pharmacy, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Naser Samadi
- Faculty of Advanced Medical Sciences, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Simin Sharifi
- Drug Applied Research Center and Faculty of Pharmacy, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Samaneh Mohammadi
- Research Center for Pharmaceutical Nanotechnology and Department of Medical Nanotechnology, Faculty of Advanced Medical Sciences, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Hadi Valizadeh
- Drug Applied Research Center and Faculty of Pharmacy, Tabriz University of Medical Sciences, Tabriz, Iran
| |
Collapse
|
7
|
Ahmed MH, Kellogg GE, Selley DE, Safo MK, Zhang Y. Predicting the molecular interactions of CRIP1a-cannabinoid 1 receptor with integrated molecular modeling approaches. Bioorg Med Chem Lett 2014; 24:1158-65. [PMID: 24461351 PMCID: PMC4353595 DOI: 10.1016/j.bmcl.2013.12.119] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2013] [Revised: 12/26/2013] [Accepted: 12/29/2013] [Indexed: 12/14/2022]
Abstract
Cannabinoid receptors are a family of G-protein coupled receptors that are involved in a wide variety of physiological processes and diseases. One of the key regulators that are unique to cannabinoid receptors is the cannabinoid receptor interacting proteins (CRIPs). Among them CRIP1a was found to decrease the constitutive activity of the cannabinoid type-1 receptor (CB1R). The aim of this study is to gain an understanding of the interaction between CRIP1a and CB1R through using different computational techniques. The generated model demonstrated several key putative interactions between CRIP1a and CB1R, including the critical involvement of Lys130 in CRIP1a.
Collapse
Affiliation(s)
- Mostafa H Ahmed
- Department of Medicinal Chemistry, Virginia Commonwealth University, Richmond, VA 23298, USA; Institute for Structural Biology and Drug Discovery, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Glen E Kellogg
- Department of Medicinal Chemistry, Virginia Commonwealth University, Richmond, VA 23298, USA; Institute for Structural Biology and Drug Discovery, Virginia Commonwealth University, Richmond, VA 23298, USA; Center for the Study of Biological Complexity, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Dana E Selley
- Department of Pharmacology and Toxicology, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Martin K Safo
- Department of Medicinal Chemistry, Virginia Commonwealth University, Richmond, VA 23298, USA; Institute for Structural Biology and Drug Discovery, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Yan Zhang
- Department of Medicinal Chemistry, Virginia Commonwealth University, Richmond, VA 23298, USA.
| |
Collapse
|
8
|
Saraswathi S, Fernández-Martínez JL, Koliński A, Jernigan RL, Kloczkowski A. Distributions of amino acids suggest that certain residue types more effectively determine protein secondary structure. J Mol Model 2013; 19:4337-48. [PMID: 23907551 DOI: 10.1007/s00894-013-1911-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2013] [Accepted: 06/05/2013] [Indexed: 11/27/2022]
Abstract
Exponential growth in the number of available protein sequences is unmatched by the slower growth in the number of structures. As a result, the development of efficient and fast protein secondary structure prediction methods is essential for the broad comprehension of protein structures. Computational methods that can efficiently determine secondary structure can in turn facilitate protein tertiary structure prediction, since most methods rely initially on secondary structure predictions. Recently, we have developed a fast learning optimized prediction methodology (FLOPRED) for predicting protein secondary structure (Saraswathi et al. in JMM 18:4275, 2012). Data are generated by using knowledge-based potentials combined with structure information from the CATH database. A neural network-based extreme learning machine (ELM) and advanced particle swarm optimization (PSO) are used with this data to obtain better and faster convergence to more accurate secondary structure predicted results. A five-fold cross-validated testing accuracy of 83.8 % and a segment overlap (SOV) score of 78.3 % are obtained in this study. Secondary structure predictions and their accuracy are usually presented for three secondary structure elements: α-helix, β-strand and coil but rarely have the results been analyzed with respect to their constituent amino acids. In this paper, we use the results obtained with FLOPRED to provide detailed behaviors for different amino acid types in the secondary structure prediction. We investigate the influence of the composition, physico-chemical properties and position specific occurrence preferences of amino acids within secondary structure elements. In addition, we identify the correlation between these properties and prediction accuracy. The present detailed results suggest several important ways that secondary structure predictions can be improved in the future that might lead to improved protein design and engineering.
Collapse
Affiliation(s)
- S Saraswathi
- Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children's Hospital, 700 Children's Drive, Columbus, OH, USA
| | | | | | | | | |
Collapse
|
9
|
Yardeni T, Jacobs K, Niethamer TK, Ciccone C, Anikster Y, Kurochkina N, Gahl WA, Huizing M. Murine isoforms of UDP-GlcNAc 2-epimerase/ManNAc kinase: Secondary structures, expression profiles, and response to ManNAc therapy. Glycoconj J 2013; 30:609-18. [PMID: 23266873 PMCID: PMC3622838 DOI: 10.1007/s10719-012-9459-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2012] [Revised: 11/27/2012] [Accepted: 11/28/2012] [Indexed: 11/25/2022]
Abstract
The bifunctional enzyme UDP-GlcNAc 2-epimerase/ManNAc kinase (GNE) catalyzes the first two committed steps in sialic acid synthesis. Non-allosteric GNE gene mutations cause the muscular disorder GNE myopathy (also known as hereditary inclusion body myopathy), whose exact pathology remains unknown. Increased knowledge of GNE regulation, including isoform regulation, may help elucidate the pathology of GNE myopathy. While eight mRNA transcripts encoding human GNE isoforms are described, we only identified two mouse Gne mRNA transcripts, encoding mGne1 and mGne2, homologous to human hGNE1 and hGNE2. Orthologs of the other human isoforms were not identified in mice. mGne1 appeared as the ubiquitously expressed, major mouse isoform. The mGne2 encoding transcript is differentially expressed and may act as a tissue-specific regulator of sialylation. mGne2 expression appeared significantly increased the first 2 days of life, possibly reflecting the high sialic acid demand during this period. Tissues of the knock-in Gne p.M712T mouse model had similar mGne transcript expression levels among genotypes, indicating no effect of the mutation on mRNA expression. However, upon treatment of these mice with N-acetylmannosamine (ManNAc, a Gne substrate, sialic acid precursor, and proposed therapy for GNE myopathy), Gne transcript expression, in particular mGne2, increased significantly, likely resulting in increased Gne enzymatic activities. This dual effect of ManNAc supplementation (increased flux through the sialic acid pathway and increased Gne activity) needs to be considered when treating GNE myopathy patients with ManNAc. In addition, the existence and expression of GNE isoforms needs consideration when designing other therapeutic strategies for GNE myopathy.
Collapse
Affiliation(s)
- Tal Yardeni
- Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20895, USA
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, 69978 Israel
| | - Katherine Jacobs
- Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20895, USA
| | - Terren K. Niethamer
- Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20895, USA
| | - Carla Ciccone
- Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20895, USA
| | - Yair Anikster
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, 69978 Israel
| | - Natalya Kurochkina
- The School of Theoretical Modeling, Department of Biophysics, Chevy Chase, MD 20825, USA
| | - William A. Gahl
- Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20895, USA
| | - Marjan Huizing
- Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20895, USA
| |
Collapse
|
10
|
Gribble KE, Mark Welch DB. The mate recognition protein gene mediates reproductive isolation and speciation in the Brachionus plicatilis cryptic species complex. BMC Evol Biol 2012; 12:134. [PMID: 22852831 PMCID: PMC3495898 DOI: 10.1186/1471-2148-12-134] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2012] [Accepted: 07/23/2012] [Indexed: 12/15/2022] Open
Abstract
Background Chemically mediated prezygotic barriers to reproduction likely play an important role in speciation. In facultatively sexual monogonont rotifers from the Brachionus plicatilis cryptic species complex, mate recognition of females by males is mediated by the Mate Recognition Protein (MRP), a globular glycoprotein on the surface of females, encoded by the mmr-b gene family. In this study, we sequenced mmr-b copies from 27 isolates representing 11 phylotypes of the B. plicatilis species complex, examined the mode of evolution and selection of mmr-b, and determined the relationship between mmr-b genetic distance and mate recognition among isolates. Results Isolates of the B. plicatilis species complex have 1–4 copies of mmr-b, each composed of 2–9 nearly identical tandem repeats. The repeats within a gene copy are generally more similar than are gene copies among phylotypes, suggesting concerted evolution. Compared to housekeeping genes from the same isolates, mmr-b has accumulated only half as many synonymous differences but twice as many non-synonymous differences. Most of the amino acid differences between repeats appear to occur on the outer face of the protein, and these often result in changes in predicted patterns of phosphorylation. However, we found no evidence of positive selection driving these differences. Isolates with the most divergent copies were unable to mate with other isolates and rarely self-crossed. Overall the degree of mate recognition was significantly correlated with the genetic distance of mmr-b. Conclusions Discrimination of compatible mates in the B. plicatilis species complex is determined by proteins encoded by closely related copies of a single gene, mmr-b. While concerted evolution of the tandem repeats in mmr-b may function to maintain identity, it can also lead to the rapid spread of a mutation through all copies in the genome and thus to reproductive isolation. The mmr-b gene is evolving rapidly, and novel alleles may be maintained and increase in frequency via asexual reproduction. Our analyses indicate that mate recognition, controlled by MMR-B, may drive reproductive isolation and allow saltational sympatric speciation within the B. plicatilis cryptic species complex, and that this process may be largely neutral.
Collapse
Affiliation(s)
- Kristin E Gribble
- Marine Biological Laboratory, 7 MBL Street, Woods Hole, MA 02543, USA
| | | |
Collapse
|
11
|
Wei Y, Thompson J, Floudas CA. CONCORD: a consensus method for protein secondary structure prediction via mixed integer linear optimization. Proc Math Phys Eng Sci 2011. [DOI: 10.1098/rspa.2011.0514] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Most of the protein structure prediction methods use a multi-step process, which often includes secondary structure prediction, contact prediction, fragment generation, clustering, etc. For many years, secondary structure prediction has been the workhorse for numerous methods aimed at predicting protein structure and function. This paper presents a new mixed integer linear optimization (MILP)-based consensus method: a Consensus scheme based On a mixed integer liNear optimization method for seCOndary stRucture preDiction (CONCORD). Based on seven secondary structure prediction methods, SSpro, DSC, PROF, PROFphd, PSIPRED, Predator and GorIV, the MILP-based consensus method combines the strengths of different methods, maximizes the number of correctly predicted amino acids and achieves a better prediction accuracy. The method is shown to perform well compared with the seven individual methods when tested on the PDBselect25 training protein set using sixfold cross validation. It also performs well compared with another set of 10 online secondary structure prediction servers (including several recent ones) when tested on the CASP9 targets (
http://predictioncenter.org/casp9/
). The average Q3 prediction accuracy is 83.04 per cent for the sixfold cross validation of the PDBselect25 set and 82.3 per cent for the CASP9 targets. We have developed a MILP-based consensus method for protein secondary structure prediction. A web server, CONCORD, is available to the scientific community at
http://helios.princeton.edu/CONCORD
.
Collapse
Affiliation(s)
- Y. Wei
- Department of Chemical and Biological Engineering, Princeton University, Princeton, NJ 08544, USA
| | - J. Thompson
- Department of Chemical and Biological Engineering, Princeton University, Princeton, NJ 08544, USA
| | - C. A. Floudas
- Department of Chemical and Biological Engineering, Princeton University, Princeton, NJ 08544, USA
| |
Collapse
|
12
|
Yardeni T, Choekyi T, Jacobs K, Ciccone C, Patzel K, Anikster Y, Gahl WA, Kurochkina N, Huizing M. Identification, tissue distribution, and molecular modeling of novel human isoforms of the key enzyme in sialic acid synthesis, UDP-GlcNAc 2-epimerase/ManNAc kinase. Biochemistry 2011; 50:8914-25. [PMID: 21910480 DOI: 10.1021/bi201050u] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
UDP-GlcNAc 2-epimerase/ManNAc kinase (GNE) catalyzes the first two committed steps in sialic acid synthesis. In addition to the three previously described human GNE isoforms (hGNE1-hGNE3), our database and polymerase chain reaction analysis yielded five additional human isoforms (hGNE4-hGNE8). hGNE1 is the ubiquitously expressed major isoform, while the hGNE2-hGNE8 isoforms are differentially expressed and may act as tissue-specific regulators of sialylation. hGNE2 and hGNE7 display a 31-residue N-terminal extension compared to hGNE1. On the basis of similarities to kinases and helicases, this extension does not seem to hinder the epimerase enzymatic active site. hGNE3 and hGNE8 contain a 55-residue N-terminal deletion and a 50-residue N-terminal extension compared to hGNE1. The size and secondary structures of these fragments are similar, and modeling predicted that these modifications do not affect the overall fold compared to that of hGNE1. However, the epimerase enzymatic activity of GNE3 and GNE8 is likely absent, because the deleted fragment contains important substrate binding residues in homologous bacterial epimerases. hGNE5-hGNE8 have a 53-residue deletion, which was assigned a role in substrate (UDP-GlcNAc) binding. Deletion of this fragment likely eliminates epimerase enzymatic activity. Our findings imply that GNE is subject to evolutionary mechanisms to improve cellular functions, without increasing the number of genes. Our expression and modeling data contribute to elucidation of the complex functional and regulatory mechanisms of human GNE and may contribute to further elucidating the pathology and treatment strategies of the human GNE-opathies sialuria and hereditary inclusion body myopathy.
Collapse
Affiliation(s)
- Tal Yardeni
- Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
13
|
Saidemberg DM, Baptista-Saidemberg NB, Palma MS. Chemometric analysis of Hymenoptera toxins and defensins: A model for predicting the biological activity of novel peptides from venoms and hemolymph. Peptides 2011; 32:1924-33. [PMID: 21855589 DOI: 10.1016/j.peptides.2011.08.001] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/10/2011] [Revised: 07/29/2011] [Accepted: 08/01/2011] [Indexed: 11/22/2022]
Abstract
When searching for prospective novel peptides, it is difficult to determine the biological activity of a peptide based only on its sequence. The "trial and error" approach is generally laborious, expensive and time consuming due to the large number of different experimental setups required to cover a reasonable number of biological assays. To simulate a virtual model for Hymenoptera insects, 166 peptides were selected from the venoms and hemolymphs of wasps, bees and ants and applied to a mathematical model of multivariate analysis, with nine different chemometric components: GRAVY, aliphaticity index, number of disulfide bonds, total residues, net charge, pI value, Boman index, percentage of alpha helix, and flexibility prediction. Principal component analysis (PCA) with non-linear iterative projections by alternating least-squares (NIPALS) algorithm was performed, without including any information about the biological activity of the peptides. This analysis permitted the grouping of peptides in a way that strongly correlated to the biological function of the peptides. Six different groupings were observed, which seemed to correspond to the following groups: chemotactic peptides, mastoparans, tachykinins, kinins, antibiotic peptides, and a group of long peptides with one or two disulfide bonds and with biological activities that are not yet clearly defined. The partial overlap between the mastoparans group and the chemotactic peptides, tachykinins, kinins and antibiotic peptides in the PCA score plot may be used to explain the frequent reports in the literature about the multifunctionality of some of these peptides. The mathematical model used in the present investigation can be used to predict the biological activities of novel peptides in this system, and it may also be easily applied to other biological systems.
Collapse
Affiliation(s)
- Daniel M Saidemberg
- Center of Study of Social Insects (CEIS)/Dept. Biology, Institute of Biosciences of Rio Claro, São Paulo State University (UNESP), Rio Claro, SP 13506-900, Brazil
| | | | | |
Collapse
|
14
|
Estimating the acidity of singly and multiply substituted benzoic acids via electrostatic potential at the nucleus. Chem Phys Lett 2011. [DOI: 10.1016/j.cplett.2011.07.038] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
15
|
Burger SK, Liu S, Ayers PW. Practical Calculation of Molecular Acidity with the Aid of a Reference Molecule. J Phys Chem A 2011; 115:1293-304. [DOI: 10.1021/jp111148q] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Steven K. Burger
- Department of Chemistry and Chemical Biology, McMaster University, 1280 Main Street West, Hamilton, Ontario, Canada L8S 4L8
| | - Shubin Liu
- Research Computing Center, University of North Carolina, Chapel Hill, North Carolina 27599-3420, United States
| | - Paul W. Ayers
- Department of Chemistry and Chemical Biology, McMaster University, 1280 Main Street West, Hamilton, Ontario, Canada L8S 4L8
| |
Collapse
|
16
|
Lin HN, Sung TY, Ho SY, Hsu WL. Improving protein secondary structure prediction based on short subsequences with local structure similarity. BMC Genomics 2010; 11 Suppl 4:S4. [PMID: 21143813 PMCID: PMC3005913 DOI: 10.1186/1471-2164-11-s4-s4] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND When characterizing the structural topology of proteins, protein secondary structure (PSS) plays an important role in analyzing and modeling protein structures because it represents the local conformation of amino acids into regular structures. Although PSS prediction has been studied for decades, the prediction accuracy reaches a bottleneck at around 80%, and further improvement is very difficult. RESULTS In this paper, we present an improved dictionary-based PSS prediction method called SymPred, and a meta-predictor called SymPsiPred. We adopt the concept behind natural language processing techniques and propose synonymous words to capture local sequence similarities in a group of similar proteins. A synonymous word is an n-gram pattern of amino acids that reflects the sequence variation in a protein's evolution. We generate a protein-dependent synonymous dictionary from a set of protein sequences for PSS prediction.On a large non-redundant dataset of 8,297 protein chains (DsspNr-25), the average Q3 of SymPred and SymPsiPred are 81.0% and 83.9% respectively. On the two latest independent test sets (EVA Set_1 and EVA_Set2), the average Q3 of SymPred is 78.8% and 79.2% respectively. SymPred outperforms other existing methods by 1.4% to 5.4%. We study two factors that may affect the performance of SymPred and find that it is very sensitive to the number of proteins of both known and unknown structures. This finding implies that SymPred and SymPsiPred have the potential to achieve higher accuracy as the number of protein sequences in the NCBInr and PDB databases increases. CONCLUSIONS Our experiment results show that local similarities in protein sequences typically exhibit conserved structures, which can be used to improve the accuracy of secondary structure prediction. For the application of synonymous words, we demonstrate an example of a sequence alignment which is generated by the distribution of shared synonymous words of a pair of protein sequences. We can align the two sequences nearly perfectly which are very dissimilar at the sequence level but very similar at the structural level. The SymPred and SymPsiPred prediction servers are available at http://bio-cluster.iis.sinica.edu.tw/SymPred/.
Collapse
Affiliation(s)
- Hsin-Nan Lin
- Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei, Taiwan
| | | | | | | |
Collapse
|
17
|
Cheng H, Sen TZ, Jernigan RL, Kloczkowski A. Consensus Data Mining (CDM) Protein Secondary Structure Prediction Server: combining GOR V and Fragment Database Mining (FDM). Bioinformatics 2007; 23:2628-30. [PMID: 17660202 PMCID: PMC2553684 DOI: 10.1093/bioinformatics/btm379] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
One of the challenges in protein secondary structure prediction is to overcome the cross-validated 80% prediction accuracy barrier. Here, we propose a novel approach to surpass this barrier. Instead of using a single algorithm that relies on a limited data set for training, we combine two complementary methods having different strengths: Fragment Database Mining (FDM) and GOR V. FDM harnesses the availability of the known protein structures in the Protein Data Bank and provides highly accurate secondary structure predictions when sequentially similar structural fragments are identified. In contrast, the GOR V algorithm is based on information theory, Bayesian statistics, and PSI-BLAST multiple sequence alignments to predict the secondary structure of residues inside a sliding window along a protein chain. A combination of these two different methods benefits from the large number of structures in the PDB and significantly improves the secondary structure prediction accuracy, resulting in Q3 ranging from 67.5 to 93.2%, depending on the availability of highly similar fragments in the Protein Data Bank.
Collapse
Affiliation(s)
- Haitao Cheng
- Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA 50011, USA
| | | | | | | |
Collapse
|
18
|
Bondugula R, Xu D. MUPRED: a tool for bridging the gap between template based methods and sequence profile based methods for protein secondary structure prediction. Proteins 2007; 66:664-70. [PMID: 17109407 DOI: 10.1002/prot.21177] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Predicting secondary structures from a protein sequence is an important step for characterizing the structural properties of a protein. Existing methods for protein secondary structure prediction can be broadly classified into template based or sequence profile based methods. We propose a novel framework that bridges the gap between the two fundamentally different approaches. Our framework integrates the information from the fuzzy k-nearest neighbor algorithm and position-specific scoring matrices using a neural network. It combines the strengths of the two methods and has a better potential to use the information in both the sequence and structure databases than existing methods. We implemented the framework into a software system MUPRED. MUPRED has achieved three-state prediction accuracy (Q3) ranging from 79.2 to 80.14%, depending on which benchmark dataset is used. A higher Q3 can be achieved if a query protein has a significant sequence identity (>25%) to a template in PDB. MUPRED also estimates the prediction accuracy at the individual residue level more quantitatively than existing methods. The MUPRED web server and executables are freely available at http://digbio.missouri.edu/mupred.
Collapse
Affiliation(s)
- Rajkumar Bondugula
- Digital Biology Laboratory, Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, Missouri 65211, USA
| | | |
Collapse
|
19
|
Zhang N, Ruan J, Wu J, Zhang T. SHEETSPAIR: A Database of Amino Acid Pairs in Protein Sheet Structures. DATA SCIENCE JOURNAL 2007. [DOI: 10.2481/dsj.6.s589] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
20
|
Sen TZ, Cheng H, Kloczkowski A, Jernigan RL. A Consensus Data Mining secondary structure prediction by combining GOR V and Fragment Database Mining. Protein Sci 2006; 15:2499-506. [PMID: 17001039 PMCID: PMC2242411 DOI: 10.1110/ps.062125306] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
The major aim of tertiary structure prediction is to obtain protein models with the highest possible accuracy. Fold recognition, homology modeling, and de novo prediction methods typically use predicted secondary structures as input, and all of these methods may significantly benefit from more accurate secondary structure predictions. Although there are many different secondary structure prediction methods available in the literature, their cross-validated prediction accuracy is generally <80%. In order to increase the prediction accuracy, we developed a novel hybrid algorithm called Consensus Data Mining (CDM) that combines our two previous successful methods: (1) Fragment Database Mining (FDM), which exploits the Protein Data Bank structures, and (2) GOR V, which is based on information theory, Bayesian statistics, and multiple sequence alignments (MSA). In CDM, the target sequence is dissected into smaller fragments that are compared with fragments obtained from related sequences in the PDB. For fragments with a sequence identity above a certain sequence identity threshold, the FDM method is applied for the prediction. The remainder of the fragments are predicted by GOR V. The results of the CDM are provided as a function of the upper sequence identities of aligned fragments and the sequence identity threshold. We observe that the value 50% is the optimum sequence identity threshold, and that the accuracy of the CDM method measured by Q(3) ranges from 67.5% to 93.2%, depending on the availability of known structural fragments with sufficiently high sequence identity. As the Protein Data Bank grows, it is anticipated that this consensus method will improve because it will rely more upon the structural fragments.
Collapse
Affiliation(s)
- Taner Z Sen
- Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, Iowa 50011-3020, USA.
| | | | | | | |
Collapse
|
21
|
Sen TZ, Jernigan RL, Garnier J, Kloczkowski A. GOR V server for protein secondary structure prediction. Bioinformatics 2005; 21:2787-8. [PMID: 15797907 PMCID: PMC2553678 DOI: 10.1093/bioinformatics/bti408] [Citation(s) in RCA: 146] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
SUMMARY We have created the GOR V web server for protein secondary structure prediction. The GOR V algorithm combines information theory, Bayesian statistics and evolutionary information. In its fifth version, the GOR method reached (with the full jack-knife procedure) an accuracy of prediction Q3 of 73.5%. Although GOR V has been among the most successful methods, its online unavailability has been a deterrent to its popularity. Here, we remedy this situation by creating the GOR V server.
Collapse
Affiliation(s)
- Taner Z Sen
- L.H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University Ames, IA 50011, USA
| | | | | | | |
Collapse
|