1
|
Jang YJ, Qin QQ, Huang SY, Peter ATJ, Ding XM, Kornmann B. Accurate prediction of protein function using statistics-informed graph networks. Nat Commun 2024; 15:6601. [PMID: 39097570 PMCID: PMC11297950 DOI: 10.1038/s41467-024-50955-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 07/15/2024] [Indexed: 08/05/2024] Open
Abstract
Understanding protein function is pivotal in comprehending the intricate mechanisms that underlie many crucial biological activities, with far-reaching implications in the fields of medicine, biotechnology, and drug development. However, more than 200 million proteins remain uncharacterized, and computational efforts heavily rely on protein structural information to predict annotations of varying quality. Here, we present a method that utilizes statistics-informed graph networks to predict protein functions solely from its sequence. Our method inherently characterizes evolutionary signatures, allowing for a quantitative assessment of the significance of residues that carry out specific functions. PhiGnet not only demonstrates superior performance compared to alternative approaches but also narrows the sequence-function gap, even in the absence of structural information. Our findings indicate that applying deep learning to evolutionary data can highlight functional sites at the residue level, providing valuable support for interpreting both existing properties and new functionalities of proteins in research and biomedicine.
Collapse
Affiliation(s)
- Yaan J Jang
- Department of Biochemistry, University of Oxford, Oxford, UK.
- AmoAi Technologies, Oxford, UK.
| | - Qi-Qi Qin
- AmoAi Technologies, Oxford, UK
- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
| | - Si-Yu Huang
- AmoAi Technologies, Oxford, UK
- Oxford Martin School, University of Oxford, Oxford, UK
- School of Systems Science, Beijing Normal University, Beijing, China
| | | | - Xue-Ming Ding
- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
| | - Benoît Kornmann
- Department of Biochemistry, University of Oxford, Oxford, UK.
| |
Collapse
|
2
|
Nordquist E, Zhang G, Barethiya S, Ji N, White KM, Han L, Jia Z, Shi J, Cui J, Chen J. Incorporating physics to overcome data scarcity in predictive modeling of protein function: A case study of BK channels. PLoS Comput Biol 2023; 19:e1011460. [PMID: 37713443 PMCID: PMC10529646 DOI: 10.1371/journal.pcbi.1011460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2023] [Revised: 09/27/2023] [Accepted: 08/24/2023] [Indexed: 09/17/2023] Open
Abstract
Machine learning has played transformative roles in numerous chemical and biophysical problems such as protein folding where large amount of data exists. Nonetheless, many important problems remain challenging for data-driven machine learning approaches due to the limitation of data scarcity. One approach to overcome data scarcity is to incorporate physical principles such as through molecular modeling and simulation. Here, we focus on the big potassium (BK) channels that play important roles in cardiovascular and neural systems. Many mutants of BK channel are associated with various neurological and cardiovascular diseases, but the molecular effects are unknown. The voltage gating properties of BK channels have been characterized for 473 site-specific mutations experimentally over the last three decades; yet, these functional data by themselves remain far too sparse to derive a predictive model of BK channel voltage gating. Using physics-based modeling, we quantify the energetic effects of all single mutations on both open and closed states of the channel. Together with dynamic properties derived from atomistic simulations, these physical descriptors allow the training of random forest models that could reproduce unseen experimentally measured shifts in gating voltage, ∆V1/2, with a RMSE ~ 32 mV and correlation coefficient of R ~ 0.7. Importantly, the model appears capable of uncovering nontrivial physical principles underlying the gating of the channel, including a central role of hydrophobic gating. The model was further evaluated using four novel mutations of L235 and V236 on the S5 helix, mutations of which are predicted to have opposing effects on V1/2 and suggest a key role of S5 in mediating voltage sensor-pore coupling. The measured ∆V1/2 agree quantitatively with prediction for all four mutations, with a high correlation of R = 0.92 and RMSE = 18 mV. Therefore, the model can capture nontrivial voltage gating properties in regions where few mutations are known. The success of predictive modeling of BK voltage gating demonstrates the potential of combining physics and statistical learning for overcoming data scarcity in nontrivial protein function prediction.
Collapse
Affiliation(s)
- Erik Nordquist
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
| | - Guohui Zhang
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, United States of America
| | - Shrishti Barethiya
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
| | - Nathan Ji
- Department of Biology, Boston College, Chestnut Hill, Massachusetts, United States of America
| | - Kelli M. White
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, United States of America
| | - Lu Han
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, United States of America
| | - Zhiguang Jia
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
| | - Jingyi Shi
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, United States of America
| | - Jianmin Cui
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, United States of America
| | - Jianhan Chen
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
| |
Collapse
|
3
|
Nordquist E, Zhang G, Barethiya S, Ji N, White KM, Han L, Jia Z, Shi J, Cui J, Chen J. Incorporating physics to overcome data scarcity in predictive modeling of protein function: a case study of BK channels. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.24.546384. [PMID: 37425916 PMCID: PMC10327070 DOI: 10.1101/2023.06.24.546384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/11/2023]
Abstract
Machine learning has played transformative roles in numerous chemical and biophysical problems such as protein folding where large amount of data exists. Nonetheless, many important problems remain challenging for data-driven machine learning approaches due to the limitation of data scarcity. One approach to overcome data scarcity is to incorporate physical principles such as through molecular modeling and simulation. Here, we focus on the big potassium (BK) channels that play important roles in cardiovascular and neural systems. Many mutants of BK channel are associated with various neurological and cardiovascular diseases, but the molecular effects are unknown. The voltage gating properties of BK channels have been characterized for 473 site-specific mutations experimentally over the last three decades; yet, these functional data by themselves remain far too sparse to derive a predictive model of BK channel voltage gating. Using physics-based modeling, we quantify the energetic effects of all single mutations on both open and closed states of the channel. Together with dynamic properties derived from atomistic simulations, these physical descriptors allow the training of random forest models that could reproduce unseen experimentally measured shifts in gating voltage, ΔV 1/2 , with a RMSE ∼ 32 mV and correlation coefficient of R ∼ 0.7. Importantly, the model appears capable of uncovering nontrivial physical principles underlying the gating of the channel, including a central role of hydrophobic gating. The model was further evaluated using four novel mutations of L235 and V236 on the S5 helix, mutations of which are predicted to have opposing effects on V 1/2 and suggest a key role of S5 in mediating voltage sensor-pore coupling. The measured ΔV 1/2 agree quantitatively with prediction for all four mutations, with a high correlation of R = 0.92 and RMSE = 18 mV. Therefore, the model can capture nontrivial voltage gating properties in regions where few mutations are known. The success of predictive modeling of BK voltage gating demonstrates the potential of combining physics and statistical learning for overcoming data scarcity in nontrivial protein function prediction. Author Summary Deep machine learning has brought many exciting breakthroughs in chemistry, physics and biology. These models require large amount of training data and struggle when the data is scarce. The latter is true for predictive modeling of the function of complex proteins such as ion channels, where only hundreds of mutational data may be available. Using the big potassium (BK) channel as a biologically important model system, we demonstrate that a reliable predictive model of its voltage gating property could be derived from only 473 mutational data by incorporating physics-derived features, which include dynamic properties from molecular dynamics simulations and energetic quantities from Rosetta mutation calculations. We show that the final random forest model captures key trends and hotspots in mutational effects of BK voltage gating, such as the important role of pore hydrophobicity. A particularly curious prediction is that mutations of two adjacent residues on the S5 helix would always have opposite effects on the gating voltage, which was confirmed by experimental characterization of four novel mutations. The current work demonstrates the importance and effectiveness of incorporating physics in predictive modeling of protein function with scarce data.
Collapse
Affiliation(s)
- Erik Nordquist
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| | - Guohui Zhang
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Shrishti Barethiya
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| | - Nathan Ji
- Department of Biology, Boston College, Chestnut Hill, Massachusetts, USA
| | - Kelli M White
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Lu Han
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Zhiguang Jia
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| | - Jingyi Shi
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Jianmin Cui
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Jianhan Chen
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| |
Collapse
|
4
|
Winker M, Chauveau A, Smieško M, Potterat O, Areesanan A, Zimmermann-Klemd A, Gründemann C. Immunological evaluation of herbal extracts commonly used for treatment of mental diseases during pregnancy. Sci Rep 2023; 13:9630. [PMID: 37316493 DOI: 10.1038/s41598-023-35952-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2022] [Accepted: 05/26/2023] [Indexed: 06/16/2023] Open
Abstract
Nonpsychotic mental diseases (NMDs) affect approximately 15% of pregnant women in the US. Herbal preparations are perceived a safe alternative to placenta-crossing antidepressants or benzodiazepines in the treatment of nonpsychotic mental diseases. But are these drugs really safe for mother and foetus? This question is of great relevance to physicians and patients. Therefore, this study investigates the influence of St. John's wort, valerian, hops, lavender, and California poppy and their compounds hyperforin and hypericin, protopine, valerenic acid, and valtrate, as well as linalool, on immune modulating effects in vitro. For this purpose a variety of methods was applied to assess the effects on viability and function of human primary lymphocytes. Viability was assessed via spectrometric assessment, flow cytometric detection of cell death markers and comet assay for possible genotoxicity. Functional assessment was conducted via flow cytometric assessment of proliferation, cell cycle and immunophenotyping. For California poppy, lavender, hops, and the compounds protopine and linalool, and valerenic acid, no effect was found on the viability, proliferation, and function of primary human lymphocytes. However, St. John's wort and valerian inhibited the proliferation of primary human lymphocytes. Hyperforin, hypericin, and valtrate inhibited viability, induced apoptosis, and inhibited cell division. Calculated maximum concentration of compounds in the body fluid, as well as calculated concentrations based on pharmacokinetic data from the literature, were low and supported that the observed effects in vitro would probably have no relevance on patients. In-silico analyses comparing the structure of studied substances with the structure of relevant control substances and known immunosuppressants revealed structural similarities of hyperforin and valerenic acid to the glucocorticoids. Valtrate showed structural similarities to the T cells signaling modulating drugs.
Collapse
Affiliation(s)
- Moritz Winker
- Translational Complementary Medicine, Department of Pharmaceutical Sciences, University of Basel, Basel, Switzerland
| | - Antoine Chauveau
- Division of Pharmaceutical Biology, Department of Pharmaceutical Sciences, University of Basel, Basel, Switzerland
| | - Martin Smieško
- Computational Pharmacy, Department of Pharmaceutical Sciences, University of Basel, Basel, Switzerland
| | - Olivier Potterat
- Division of Pharmaceutical Biology, Department of Pharmaceutical Sciences, University of Basel, Basel, Switzerland
| | - Alexander Areesanan
- Translational Complementary Medicine, Department of Pharmaceutical Sciences, University of Basel, Basel, Switzerland
| | - Amy Zimmermann-Klemd
- Translational Complementary Medicine, Department of Pharmaceutical Sciences, University of Basel, Basel, Switzerland.
| | - Carsten Gründemann
- Translational Complementary Medicine, Department of Pharmaceutical Sciences, University of Basel, Basel, Switzerland.
| |
Collapse
|
5
|
Lysine Methyltransferase EhPKMT2 Is Involved in the In Vitro Virulence of Entamoeba histolytica. Pathogens 2023; 12:pathogens12030474. [PMID: 36986396 PMCID: PMC10058465 DOI: 10.3390/pathogens12030474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Revised: 03/06/2023] [Accepted: 03/11/2023] [Indexed: 03/19/2023] Open
Abstract
Lysine methylation, a posttranslational modification catalyzed by protein lysine methyltransferases (PKMTs), is involved in epigenetics and several signaling pathways, including cell growth, cell migration and stress response, which in turn may participate in virulence of protozoa parasites. Entamoeba histolytica, the etiologic agent of human amebiasis, has four PKMTs (EhPKMT1 to EhPKMT4), but their role in parasite biology is unknown. Here, to obtain insight into the role of EhPKMT2, we analyzed its expression level and localization in trophozoites subjected to heat shock and during phagocytosis, two events that are related to amoeba virulence. Moreover, the effect of EhPKMT2 knockdown on those activities and on cell growth, migration and cytopathic effect was investigated. The results indicate that this enzyme participates in all these cellular events, suggesting that it could be a potential target for development of novel therapeutic strategies against amebiasis.
Collapse
|
6
|
Sicilia C, Corral-Lugo A, Smialowski P, McConnell MJ, Martín-Galiano AJ. Unsupervised Machine Learning Organization of the Functional Dark Proteome of Gram-Negative "Superbugs": Six Protein Clusters Amenable for Distinct Scientific Applications. ACS OMEGA 2022; 7:46131-46145. [PMID: 36570227 PMCID: PMC9774411 DOI: 10.1021/acsomega.2c04076] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 10/06/2022] [Indexed: 06/17/2023]
Abstract
Uncharacterized proteins have been underutilized as targets for the development of novel therapeutics for difficult-to-treat bacterial infections. To facilitate the exploration of these proteins, 2819 predicted, uncharacterized proteins (19.1% of the total) from reference strains of multidrug Acinetobacter baumannii, Klebsiella pneumoniae, and Pseudomonas aeruginosa species were organized using an unsupervised k-means machine learning algorithm. Classification using normalized values for protein length, pI, hydrophobicity, degree of conservation, structural disorder, and %AT of the coding gene rendered six natural clusters. Cluster proteins showed different trends regarding operon membership, expression, presence of unknown function domains, and interactomic relevance. Clusters 2, 4, and 5 were enriched with highly disordered proteins, nonworkable membrane proteins, and likely spurious proteins, respectively. Clusters 1, 3, and 6 showed closer distances to known antigens, antibiotic targets, and virulence factors. Up to 21.8% of proteins in these clusters were structurally covered by modeling, which allowed assessment of druggability and discontinuous B-cell epitopes. Five proteins (4 in Cluster 1) were potential druggable targets for antibiotherapy. Eighteen proteins (11 in Cluster 6) were strong B-cell and T-cell immunogen candidates for vaccine development. Conclusively, we provide a feature-based schema to fractionate the functional dark proteome of critical pathogens for fundamental and biomedical purposes.
Collapse
Affiliation(s)
- Carlos Sicilia
- Intrahospital
Infections Laboratory, National Centre for Microbiology, Instituto de Salud Carlos III (ISCIII), Majadahonda, 28220 Madrid, Spain
| | - Andrés Corral-Lugo
- Intrahospital
Infections Laboratory, National Centre for Microbiology, Instituto de Salud Carlos III (ISCIII), Majadahonda, 28220 Madrid, Spain
| | - Pawel Smialowski
- Core
Facility Bioinformatics, Biomedical Center Munich, Faculty of Medicine, Ludwig Maximilians Universität München, Munich 80539, Germany
- Institute
of Stem Cell Research, Helmholtz Center Munich, Planegg-Martinsried 82152, Germany
| | - Michael J. McConnell
- Intrahospital
Infections Laboratory, National Centre for Microbiology, Instituto de Salud Carlos III (ISCIII), Majadahonda, 28220 Madrid, Spain
| | - Antonio J. Martín-Galiano
- Intrahospital
Infections Laboratory, National Centre for Microbiology, Instituto de Salud Carlos III (ISCIII), Majadahonda, 28220 Madrid, Spain
| |
Collapse
|
7
|
In Silico Evaluation of Nonsynonymous SNPs in Human ADAM33: The Most Common Form of Genetic Association to Asthma Susceptibility. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:1089722. [DOI: 10.1155/2022/1089722] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/17/2022] [Revised: 09/09/2022] [Accepted: 10/07/2022] [Indexed: 11/13/2022]
Abstract
ADAM33 is a zinc-dependent metalloprotease of the ADAM family, which plays a vital biological role as an activator of Th2 cytokines and growth factors. Moreover, this protein is crucial for the normal development of the lung in the fetus two months after gestation leading to determining lung functions all over life. In this regard, mutations in ADAM33 have been linked with asthma risk factors. Consequently, identifying ADAM33 pathogenic nonsynonymous single-nucleotide polymorphisms (nsSNPs) can be very important in asthma treatment. In the present study, 1055 nsSNPs of human ADAM33 were analyzed using biocomputational software, 31 of which were found to be detrimental mutations. Precise structural and stability analysis revealed D219V, C669G, and C606S as the most destabilizing SNPs. Furthermore, MD simulations disclosed higher overall fluctuation and alteration in intramolecular interactions compared with the wild-type structure. Overall, the results suggest D219V, C669G, and C606S detrimental mutations as a starting point for further case-control studies on the ADAM33 protein as well as an essential source for future targeted mechanisms.
Collapse
|
8
|
Sengupta K, Saha S, Halder AK, Chatterjee P, Nasipuri M, Basu S, Plewczynski D. PFP-GO: Integrating protein sequence, domain and protein-protein interaction information for protein function prediction using ranked GO terms. Front Genet 2022; 13:969915. [PMID: 36246645 PMCID: PMC9556876 DOI: 10.3389/fgene.2022.969915] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Accepted: 08/31/2022] [Indexed: 11/13/2022] Open
Abstract
Protein function prediction is gradually emerging as an essential field in biological and computational studies. Though the latter has clinched a significant footprint, it has been observed that the application of computational information gathered from multiple sources has more significant influence than the one derived from a single source. Considering this fact, a methodology, PFP-GO, is proposed where heterogeneous sources like Protein Sequence, Protein Domain, and Protein-Protein Interaction Network have been processed separately for ranking each individual functional GO term. Based on this ranking, GO terms are propagated to the target proteins. While Protein sequence enriches the sequence-based information, Protein Domain and Protein-Protein Interaction Networks embed structural/functional and topological based information, respectively, during the phase of GO ranking. Performance analysis of PFP-GO is also based on Precision, Recall, and F-Score. The same was found to perform reasonably better when compared to the other existing state-of-art. PFP-GO has achieved an overall Precision, Recall, and F-Score of 0.67, 0.58, and 0.62, respectively. Furthermore, we check some of the top-ranked GO terms predicted by PFP-GO through multilayer network propagation that affect the 3D structure of the genome. The complete source code of PFP-GO is freely available at https://sites.google.com/view/pfp-go/.
Collapse
Affiliation(s)
- Kaustav Sengupta
- Laboratory of Functional and Structural Genomics, Center of New Technologies, University of Warsaw, Warsaw, Poland
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
| | - Sovan Saha
- Department of Computer Science and Engineering, Institute of Engineering and Management, Kolkata, West Bengal, India
| | - Anup Kumar Halder
- Laboratory of Functional and Structural Genomics, Center of New Technologies, University of Warsaw, Warsaw, Poland
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
| | - Piyali Chatterjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, India
| | - Mita Nasipuri
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
| | - Subhadip Basu
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
- *Correspondence: Subhadip Basu, Dariusz Plewczynski,
| | - Dariusz Plewczynski
- Laboratory of Functional and Structural Genomics, Center of New Technologies, University of Warsaw, Warsaw, Poland
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
- *Correspondence: Subhadip Basu, Dariusz Plewczynski,
| |
Collapse
|
9
|
Lee J, Song SB, Chung YK, Jang JH, Huh J. BoostSweet: Learning molecular perceptual representations of sweeteners. Food Chem 2022; 383:132435. [PMID: 35182866 DOI: 10.1016/j.foodchem.2022.132435] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2021] [Revised: 09/16/2021] [Accepted: 02/09/2022] [Indexed: 11/28/2022]
Abstract
The development of safe artificial sweeteners has attracted considerable interest in the food industry. Previous machine learning (ML) studies based on quantitative structure-activity relationships have provided some molecular principles for predicting sweetness, but these models can be improved via the chemical recognition of sweetness active factors. Our ML model, a soft-vote ensemble model that has a light gradient boosting machine and uses both layered fingerprints and alvaDesc molecular descriptor features, demonstrates state-of-the-art performance, with an AUROC score of 0.961. Based on an analysis of feature importance and dataset, we identified that the number of nitrogen atoms that serve as hydrogen bond donors in molecules can play an essential role in determining sweetness. These results potentially provide an advanced understanding of the relationship between molecular structure and sweetness, which can be used to design new sweeteners based on molecular structural dependence.
Collapse
Affiliation(s)
- Junho Lee
- Department of Chemistry, Sungkyunkwan University, Suwon 16419, Republic of Korea; SKKU Advanced Institute of Nanotechnology (SAINT), Sungkyunkwan University, Suwon 16419, Republic of Korea
| | - Seon Bin Song
- Department of Chemistry, Sungkyunkwan University, Suwon 16419, Republic of Korea
| | - You Kyoung Chung
- Department of Chemistry, Sungkyunkwan University, Suwon 16419, Republic of Korea
| | - Jee Hwan Jang
- Ucaretron Inc., Anyang 14057, Gyeonggi-do, Republic of Korea; School of Advanced Materials Science and Engineering, Sungkyunkwan University, Suwon 16419, Republic of Korea.
| | - Joonsuk Huh
- Department of Chemistry, Sungkyunkwan University, Suwon 16419, Republic of Korea; SKKU Advanced Institute of Nanotechnology (SAINT), Sungkyunkwan University, Suwon 16419, Republic of Korea; Institute of Quantum Biophysics, Sungkyunkwan University, Suwon 16419, Republic of Korea.
| |
Collapse
|
10
|
Bajaj P, Manjunath K, Varadarajan R. Structural and functional determinants inferred from deep mutational scans. Protein Sci 2022; 31:e4357. [PMID: 35762712 PMCID: PMC9202547 DOI: 10.1002/pro.4357] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Revised: 04/04/2022] [Accepted: 05/11/2022] [Indexed: 11/08/2022]
Abstract
Mutations that affect protein binding to a cognate partner primarily occur either at buried residues or at exposed residues directly involved in partner binding. Distinguishing between these two categories based solely on mutational phenotypes is challenging. The bacterial toxin CcdB kills cells by binding to DNA Gyrase. Cell death is prevented by binding to its cognate antitoxin CcdA, at an extended interface that partially overlaps with the GyrA binding site. Using the CcdAB toxin-antitoxin (TA) system as a model, a comprehensive site-saturation mutagenesis library of CcdB was generated in its native operonic context. The mutational sensitivity of each mutant was estimated by evaluating the relative abundance of each mutant in two strains, one resistant and the other sensitive to the toxic activity of the CcdB toxin, through deep sequencing. The ability to bind CcdA was inferred through a RelE reporter gene assay, since the CcdAB complex binds to its own promoter, repressing transcription. By analyzing mutant phenotypes in the CcdB-sensitive, CcdB-resistant, and RelE reporter strains, it was possible to assign residues to buried, CcdA interacting or GyrA interacting sites. A few mutants were individually constructed, expressed, and biophysically characterized to validate molecular mechanisms responsible for the observed phenotypes. Residues inferred to be important for antitoxin binding, are also likely to be important for rejuvenating CcdB from the CcdB-Gyrase complex. Therefore, even in the absence of structural information, when coupled to appropriate genetic screens, such high-throughput strategies can be deployed for predicting structural and functional determinants of proteins.
Collapse
Affiliation(s)
- Priyanka Bajaj
- Molecular Biophysics UnitIndian Institute of ScienceBangaloreIndia
| | - Kavyashree Manjunath
- Centre for Chemical Biology and TherapeuticsInstitute for Stem Cell Science and Regenerative MedicineBangaloreIndia
| | | |
Collapse
|
11
|
Xia C, Feng SH, Xia Y, Pan X, Shen HB. Fast protein structure comparison through effective representation learning with contrastive graph neural networks. PLoS Comput Biol 2022; 18:e1009986. [PMID: 35324898 PMCID: PMC8982879 DOI: 10.1371/journal.pcbi.1009986] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 04/05/2022] [Accepted: 03/03/2022] [Indexed: 12/03/2022] Open
Abstract
Protein structure alignment algorithms are often time-consuming, resulting in challenges for large-scale protein structure similarity-based retrieval. There is an urgent need for more efficient structure comparison approaches as the number of protein structures increases rapidly. In this paper, we propose an effective graph-based protein structure representation learning method, GraSR, for fast and accurate structure comparison. In GraSR, a graph is constructed based on the intra-residue distance derived from the tertiary structure. Then, deep graph neural networks (GNNs) with a short-cut connection learn graph representations of the tertiary structures under a contrastive learning framework. To further improve GraSR, a novel dynamic training data partition strategy and length-scaling cosine distance are introduced. We objectively evaluate our method GraSR on SCOPe v2.07 and a new released independent test set from PDB database with a designed comprehensive performance metric. Compared with other state-of-the-art methods, GraSR achieves about 7%-10% improvement on two benchmark datasets. GraSR is also much faster than alignment-based methods. We dig into the model and observe that the superiority of GraSR is mainly brought by the learned discriminative residue-level and global descriptors. The web-server and source code of GraSR are freely available at www.csbio.sjtu.edu.cn/bioinf/GraSR/ for academic use. The size and shape of protein structures vary considerably. Accurate protein structure comparison usually relies on structure alignment algorithms. However, superimposing two protein structures is relatively time-consuming, which makes it inappropriate for large-scale protein structure retrieval. Alignment-free algorithms are proposed for efficient protein structure comparison over the last few decades. These algorithms first transform the coordinates of atoms in two proteins to fixed-length vectors. Then, the comparison can be done by measuring the distance or similarity between two vectors, which is much faster than alignment. In this study, we propose a novel protein structure representation method for efficient structure comparison. Compared with other state-of-the-art alignment-free methods, our method achieves better performance on both ranking and multi-class classification tasks due to the powerful representation ability of deep graph neural networks. We dig into the model and observe that the superiority of our method is mainly brought by the learned discriminative residue-level and global descriptors.
Collapse
Affiliation(s)
- Chunqiu Xia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| | - Shi-Hao Feng
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| | - Ying Xia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
- * E-mail: (XP); (HS)
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
- * E-mail: (XP); (HS)
| |
Collapse
|
12
|
Mabonga L, Masamba P, Kappo AP. Inhibitory potential of a benzoxazole derivative, 4FI against SNRPG∼RING finger domain protein complex as a lead compound in the discovery of anti-cancer drugs: A molecular dynamics simulation approach. INFORMATICS IN MEDICINE UNLOCKED 2022. [DOI: 10.1016/j.imu.2022.100993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022] Open
|
13
|
Questing functions and structures of hypothetical proteins from Campylobacter jejuni: a computer-aided approach. Biosci Rep 2021; 40:225019. [PMID: 32458979 PMCID: PMC7284324 DOI: 10.1042/bsr20193939] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2019] [Revised: 05/17/2020] [Accepted: 05/26/2020] [Indexed: 12/12/2022] Open
Abstract
Campylobacter jejuni (C. jejuni) is considered to be one of the most frequent causes of bacterial gastroenteritis globally, especially in young children. The genome of C. jejuni contains many proteins with unknown functions termed as hypothetical proteins (HPs). These proteins might have essential biological role to show the full spectrum of this bacterium. Hence, our study aimed to determine the functions of HPs, pertaining to the genome of C. jejuni. An in-silico work flow integrating various tools were performed for functional assignment, three-dimensional structure determination, domain architecture predictors, subcellular localization, physicochemical characterization, and protein-protein interactions (PPIs). Sequences of 267 HPs of C. jejuni were analyzed and successfully attributed the function of 49 HPs with higher confidence. Here, we found proteins with enzymatic activity, transporters, binding and regulatory proteins as well as proteins with biotechnological interest. Assessment of the performance of various tools used in this analysis revealed an accuracy of 95% using receiver operating characteristic (ROC) curve analysis. Functional and structural predictions and the results from ROC analyses provided the validity of in-silico tools used in the present study. The approach used for this analysis leads us to assign the function of unknown proteins and relate them with the functions that have already been described in previous literature.
Collapse
|
14
|
Bhasin M, Varadarajan R. Prediction of Function Determining and Buried Residues Through Analysis of Saturation Mutagenesis Datasets. Front Mol Biosci 2021; 8:635425. [PMID: 33778004 PMCID: PMC7991590 DOI: 10.3389/fmolb.2021.635425] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Accepted: 01/25/2021] [Indexed: 11/13/2022] Open
Abstract
Mutational scanning can be used to probe effects of large numbers of point mutations on protein function. Positions affected by mutation are primarily at either buried or at exposed residues directly involved in function, hereafter designated as active-site residues. In the absence of prior structural information, it has not been easy to distinguish between these two categories of residues. We curated and analyzed a set of twelve published deep mutational scanning datasets. The analysis revealed differential patterns of mutational sensitivity and substitution preferences at buried and exposed positions. Prediction of buried-sites solely from the mutational sensitivity data was facilitated by incorporating predicted sequence-based accessibility values. For active-site residues we observed mean sensitivity, specificity and accuracy of 61, 90 and 88% respectively. For buried residues the corresponding figures were 59, 90 and 84% while for exposed non active-site residues these were 98, 44 and 82% respectively. We also identified positions which did not follow these general trends and might require further experimental re-validation. This analysis highlights the ability of deep mutational scans to provide important structural and functional insights, even in the absence of three-dimensional structures determined using conventional structure determination techniques, and also discuss some limitations of the methodology.
Collapse
Affiliation(s)
- Munmun Bhasin
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
| | - Raghavan Varadarajan
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
- Jawaharlal Nehru Centre for Advanced Scientific Research, Bangalore, India
| |
Collapse
|
15
|
Zohra Smaili F, Tian S, Roy A, Alazmi M, Arold ST, Mukherjee S, Scott Hefty P, Chen W, Gao X. QAUST: Protein Function Prediction Using Structure Similarity, Protein Interaction, and Functional Motifs. GENOMICS PROTEOMICS & BIOINFORMATICS 2021; 19:998-1011. [PMID: 33631427 PMCID: PMC9403031 DOI: 10.1016/j.gpb.2021.02.001] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2018] [Revised: 04/03/2019] [Accepted: 05/17/2019] [Indexed: 11/25/2022]
Abstract
The number of available protein sequences in public databases is increasing exponentially. However, a significant percentage of these sequences lack functional annotation, which is essential for the understanding of how biological systems operate. Here, we propose a novel method, Quantitative Annotation of Unknown STructure (QAUST), to infer protein functions, specifically Gene Ontology (GO) terms and Enzyme Commission (EC) numbers. QAUST uses three sources of information: structure information encoded by global and local structure similarity search, biological network information inferred by protein–protein interaction data, and sequence information extracted from functionally discriminative sequence motifs. These three pieces of information are combined by consensus averaging to make the final prediction. Our approach has been tested on 500 protein targets from the Critical Assessment of Functional Annotation (CAFA) benchmark set. The results show that our method provides accurate functional annotation and outperforms other prediction methods based on sequence similarity search or threading. We further demonstrate that a previously unknown function of human tripartite motif-containing 22 (TRIM22) protein predicted by QAUST can be experimentally validated.
Collapse
Affiliation(s)
- Fatima Zohra Smaili
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Shuye Tian
- Department of Biology, Southern University of Science and Technology of China (SUSTC), Shenzhen 518055, China
| | - Ambrish Roy
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Meshari Alazmi
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia; College of Computer Science and Engineering, University of Hail, Hail 55476, Saudi Arabia
| | - Stefan T Arold
- Biological and Environmental Sciences and Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Srayanta Mukherjee
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - P Scott Hefty
- Department of Molecular Bioscience, University of Kansas, Lawrence, KS 66047, USA
| | - Wei Chen
- Department of Biology, Southern University of Science and Technology of China (SUSTC), Shenzhen 518055, China.
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia.
| |
Collapse
|
16
|
Supo-Escalante RR, Médico A, Gushiken E, Olivos-Ramírez GE, Quispe Y, Torres F, Zamudio M, Antiparra R, Amzel LM, Gilman RH, Sheen P, Zimic M. Prediction of Mycobacterium tuberculosis pyrazinamidase function based on structural stability, physicochemical and geometrical descriptors. PLoS One 2020; 15:e0235643. [PMID: 32735615 PMCID: PMC7394417 DOI: 10.1371/journal.pone.0235643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Accepted: 06/19/2020] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Pyrazinamide is an important drug against the latent stage of tuberculosis and is used in both first- and second-line treatment regimens. Pyrazinamide-susceptibility test usually takes a week to have a diagnosis to guide initial therapy, implying a delay in receiving appropriate therapy. The continued increase in multi-drug resistant tuberculosis and the prevalence of pyrazinamide resistance in several countries makes the development of assays for prompt identification of resistance necessary. The main cause of pyrazinamide resistance is the impairment of pyrazinamidase function attributed to mutations in the promoter and/or pncA coding gene. However, not all pncA mutations necessarily affect the pyrazinamidase function. OBJECTIVE To develop a methodology to predict pyrazinamidase function from detected mutations in the pncA gene. METHODS We measured the catalytic constant (kcat), KM, enzymatic efficiency, and enzymatic activity of 35 recombinant mutated pyrazinamidase and the wild type (Protein Data Bank ID = 3pl1). From all the 3D modeled structures, we extracted several predictors based on three categories: structural stability (estimated by normal mode analysis and molecular dynamics), physicochemical, and geometrical characteristics. We used a stepwise Akaike's information criterion forward multiple log-linear regression to model each kinetic parameter with each category of predictors. We also developed weighted models combining the three categories of predictive models for each kinetic parameter. We tested the robustness of the predictive ability of each model by 6-fold cross-validation against random models. RESULTS The stability, physicochemical, and geometrical descriptors explained most of the variability (R2) of the kinetic parameters. Our models are best suited to predict kcat, efficiency, and activity based on the root-mean-square error of prediction of the 6-fold cross-validation. CONCLUSIONS This study shows a quick approach to predict the pyrazinamidase function only from the pncA sequence when point mutations are present. This can be an important tool to detect pyrazinamide resistance.
Collapse
Affiliation(s)
- Rydberg Roman Supo-Escalante
- Laboratorio de Bioinformática, Biología Molecular y Desarrollos Tecnológicos, Facultad de Ciencias y Filosofía, Universidad Peruana Cayetano Heredia, Lima, Peru
| | - Aldhair Médico
- Laboratorio de Bioinformática, Biología Molecular y Desarrollos Tecnológicos, Facultad de Ciencias y Filosofía, Universidad Peruana Cayetano Heredia, Lima, Peru
| | - Eduardo Gushiken
- Laboratorio de Bioinformática, Biología Molecular y Desarrollos Tecnológicos, Facultad de Ciencias y Filosofía, Universidad Peruana Cayetano Heredia, Lima, Peru
| | - Gustavo E. Olivos-Ramírez
- Laboratorio de Bioinformática, Biología Molecular y Desarrollos Tecnológicos, Facultad de Ciencias y Filosofía, Universidad Peruana Cayetano Heredia, Lima, Peru
| | - Yaneth Quispe
- Laboratorio de Bioinformática, Biología Molecular y Desarrollos Tecnológicos, Facultad de Ciencias y Filosofía, Universidad Peruana Cayetano Heredia, Lima, Peru
| | - Fiorella Torres
- Laboratorio de Bioinformática, Biología Molecular y Desarrollos Tecnológicos, Facultad de Ciencias y Filosofía, Universidad Peruana Cayetano Heredia, Lima, Peru
| | - Melissa Zamudio
- Laboratorio de Bioinformática, Biología Molecular y Desarrollos Tecnológicos, Facultad de Ciencias y Filosofía, Universidad Peruana Cayetano Heredia, Lima, Peru
| | - Ricardo Antiparra
- Laboratorio de Bioinformática, Biología Molecular y Desarrollos Tecnológicos, Facultad de Ciencias y Filosofía, Universidad Peruana Cayetano Heredia, Lima, Peru
| | - L. Mario Amzel
- Department of Biophysics and Biophysical Chemistry, Johns Hopkins University, Baltimore, MD, United States of America
| | - Robert H. Gilman
- International Health Department, Johns Hopkins School of Public Health, Baltimore, MD, United States of America
| | - Patricia Sheen
- Laboratorio de Bioinformática, Biología Molecular y Desarrollos Tecnológicos, Facultad de Ciencias y Filosofía, Universidad Peruana Cayetano Heredia, Lima, Peru
| | - Mirko Zimic
- Laboratorio de Bioinformática, Biología Molecular y Desarrollos Tecnológicos, Facultad de Ciencias y Filosofía, Universidad Peruana Cayetano Heredia, Lima, Peru
| |
Collapse
|
17
|
In Silico Elucidation of Deleterious Non-synonymous SNPs in SHANK3, the Autism Spectrum Disorder Gene. J Mol Neurosci 2020; 70:1649-1667. [DOI: 10.1007/s12031-020-01552-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2019] [Accepted: 04/13/2020] [Indexed: 12/11/2022]
|
18
|
Pu L, Govindaraj RG, Lemoine JM, Wu HC, Brylinski M. DeepDrug3D: Classification of ligand-binding pockets in proteins with a convolutional neural network. PLoS Comput Biol 2019; 15:e1006718. [PMID: 30716081 PMCID: PMC6375647 DOI: 10.1371/journal.pcbi.1006718] [Citation(s) in RCA: 64] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2018] [Revised: 02/14/2019] [Accepted: 12/16/2018] [Indexed: 01/19/2023] Open
Abstract
Comprehensive characterization of ligand-binding sites is invaluable to infer molecular functions of hypothetical proteins, trace evolutionary relationships between proteins, engineer enzymes to achieve a desired substrate specificity, and develop drugs with improved selectivity profiles. These research efforts pose significant challenges owing to the fact that similar pockets are commonly observed across different folds, leading to the high degree of promiscuity of ligand-protein interactions at the system-level. On that account, novel algorithms to accurately classify binding sites are needed. Deep learning is attracting a significant attention due to its successful applications in a wide range of disciplines. In this communication, we present DeepDrug3D, a new approach to characterize and classify binding pockets in proteins with deep learning. It employs a state-of-the-art convolutional neural network in which biomolecular structures are represented as voxels assigned interaction energy-based attributes. The current implementation of DeepDrug3D, trained to detect and classify nucleotide- and heme-binding sites, not only achieves a high accuracy of 95%, but also has the ability to generalize to unseen data as demonstrated for steroid-binding proteins and peptidase enzymes. Interestingly, the analysis of strongly discriminative regions of binding pockets reveals that this high classification accuracy arises from learning the patterns of specific molecular interactions, such as hydrogen bonds, aromatic and hydrophobic contacts. DeepDrug3D is available as an open-source program at https://github.com/pulimeng/DeepDrug3D with the accompanying TOUGH-C1 benchmarking dataset accessible from https://osf.io/enz69/.
Collapse
Affiliation(s)
- Limeng Pu
- Division of Electrical & Computer Engineering, Louisiana State University, Baton Rouge, LA, United States of America
| | - Rajiv Gandhi Govindaraj
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, United States of America
| | - Jeffrey Mitchell Lemoine
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, United States of America
- Division of Computer Science and Engineering, Louisiana State University, Baton Rouge, LA, United States of America
| | - Hsiao-Chun Wu
- Division of Electrical & Computer Engineering, Louisiana State University, Baton Rouge, LA, United States of America
| | - Michal Brylinski
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, United States of America
- Center for Computation & Technology, Louisiana State University, Baton Rouge, LA, United States of America
- * E-mail:
| |
Collapse
|
19
|
Han M, Song Y, Qian J, Ming D. Sequence-based prediction of physicochemical interactions at protein functional sites using a function-and-interaction-annotated domain profile database. BMC Bioinformatics 2018; 19:204. [PMID: 29859055 PMCID: PMC5984826 DOI: 10.1186/s12859-018-2206-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2017] [Accepted: 05/15/2018] [Indexed: 01/16/2023] Open
Abstract
Background Identifying protein functional sites (PFSs) and, particularly, the physicochemical interactions at these sites is critical to understanding protein functions and the biochemical reactions involved. Several knowledge-based methods have been developed for the prediction of PFSs; however, accurate methods for predicting the physicochemical interactions associated with PFSs are still lacking. Results In this paper, we present a sequence-based method for the prediction of physicochemical interactions at PFSs. The method is based on a functional site and physicochemical interaction-annotated domain profile database, called fiDPD, which was built using protein domains found in the Protein Data Bank. This method was applied to 13 target proteins from the very recent Critical Assessment of Structure Prediction (CASP10/11), and our calculations gave a Matthews correlation coefficient (MCC) value of 0.66 for PFS prediction and an 80% recall in the prediction of the associated physicochemical interactions. Conclusions Our results show that, in addition to the PFSs, the physical interactions at these sites are also conserved in the evolution of proteins. This work provides a valuable sequence-based tool for rational drug design and side-effect assessment. The method is freely available and can be accessed at http://202.119.249.49.
Collapse
Affiliation(s)
- Min Han
- Department of Physiology and Biophysics, School of Life Science, Fudan University, Shanghai, 200438, People's Republic of China
| | - Yifan Song
- Department of Physiology and Biophysics, School of Life Science, Fudan University, Shanghai, 200438, People's Republic of China
| | - Jiaqiang Qian
- Department of Physiology and Biophysics, School of Life Science, Fudan University, Shanghai, 200438, People's Republic of China
| | - Dengming Ming
- College of Biotechnology and Pharmaceutical Engineering, Nanjing Tech University, Biotech Building Room B1-404, 30 South Puzhu Road, Jiangsu, 211816, Nanjing, People's Republic of China.
| |
Collapse
|
20
|
Mills CL, Garg R, Lee JS, Tian L, Suciu A, Cooperman GD, Beuning PJ, Ondrechen MJ. Functional classification of protein structures by local structure matching in graph representation. Protein Sci 2018; 27:1125-1135. [PMID: 29604149 PMCID: PMC5980557 DOI: 10.1002/pro.3416] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2017] [Revised: 03/21/2018] [Accepted: 03/26/2018] [Indexed: 11/08/2022]
Abstract
As a result of high‐throughput protein structure initiatives, over 14,400 protein structures have been solved by Structural Genomics (SG) centers and participating research groups. While the totality of SG data represents a tremendous contribution to genomics and structural biology, reliable functional information for these proteins is generally lacking. Better functional predictions for SG proteins will add substantial value to the structural information already obtained. Our method described herein, Graph Representation of Active Sites for Prediction of Function (GRASP‐Func), predicts quickly and accurately the biochemical function of proteins by representing residues at the predicted local active site as graphs rather than in Cartesian coordinates. We compare the GRASP‐Func method to our previously reported method, Structurally Aligned Local Sites of Activity (SALSA), using the Ribulose Phosphate Binding Barrel (RPBB), 6‐Hairpin Glycosidase (6‐HG), and Concanavalin A‐like Lectins/Glucanase (CAL/G) superfamilies as test cases. In each of the superfamilies, SALSA and the much faster method GRASP‐Func yield similar correct classification of previously characterized proteins, providing a validated benchmark for the new method. In addition, we analyzed SG proteins using our SALSA and GRASP‐Func methods to predict function. Forty‐one SG proteins in the RPBB superfamily, nine SG proteins in the 6‐HG superfamily, and one SG protein in the CAL/G superfamily were successfully classified into one of the functional families in their respective superfamily by both methods. This improved, faster, validated computational method can yield more reliable predictions of function that can be used for a wide variety of applications by the community.
Collapse
Affiliation(s)
- Caitlyn L Mills
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, Massachusetts
| | - Rohan Garg
- College of Computer and Information Science, Northeastern University, Boston, Massachusetts
| | - Joslynn S Lee
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, Massachusetts
| | - Liang Tian
- Department of Mathematics, Northeastern University, Boston, Massachusetts
| | - Alexandru Suciu
- Department of Mathematics, Northeastern University, Boston, Massachusetts
| | - Gene D Cooperman
- College of Computer and Information Science, Northeastern University, Boston, Massachusetts
| | - Penny J Beuning
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, Massachusetts
| | - Mary Jo Ondrechen
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, Massachusetts
| |
Collapse
|
21
|
Abstract
The increasing number of protein structures with uncharacterized function necessitates the development of in silico prediction methods for functional annotations on proteins. In this chapter, different kinds of computational approaches are briefly introduced to predict DNA-binding residues on surface of DNA-binding proteins, and the merits and limitations of these methods are mainly discussed. This chapter focuses on the structure-based approaches and mainly discusses the framework of machine learning methods in application to DNA-binding prediction task.
Collapse
|
22
|
Jiao D, Han W, Ye Y. Functional association prediction by community profiling. Methods 2017; 129:8-17. [PMID: 28454776 PMCID: PMC5643221 DOI: 10.1016/j.ymeth.2017.04.018] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2017] [Revised: 03/31/2017] [Accepted: 04/20/2017] [Indexed: 11/27/2022] Open
Abstract
Recent years have witnessed unprecedented accumulation of DNA sequences and therefore protein sequences (predicted from DNA sequences), due to the advances of sequencing technology. One of the major sources of the hypothetical proteins is the metagenomics research. Current annotation of metagenomes (collections of short metagenomic sequences or assemblies) relies on similarity searches against known gene/protein families, based on which functional profiles of microbial communities can be built. This practice, however, leaves out the hypothetical proteins, which may outnumber the known proteins for many microbial communities. On the other hand, we may ask: what can we gain from the large number of metagenomes made available by the metagenomic studies, for the annotation of metagenomic sequences as well as functional annotation of hypothetical proteins in general? Here we propose a community profiling approach for predicting functional associations between proteins: two proteins are predicted to be associated if they share similar presence and absence profiles (called community profiles) across microbial communities. Community profiling is conceptually similar to the phylogenetic profiling approach to functional prediction, however with fundamental differences. We tested different profile construction methods, the selection of reference metagenomes, and correlation metrics, among others, to optimize the performance of this new approach. We demonstrated that the community profiling approach alone slightly outperforms the phylogenetic profiling approach for associating proteins in species that are well represented by sequenced genomes, and combining phylogenetic and community profiling further improves (though only marginally) the prediction of functional association. Further we showed that community profiling method significantly outperforms phylogenetic profiling, revealing more functional associations, when applied to a more recently sequenced bacterial genome.
Collapse
Affiliation(s)
- Dazhi Jiao
- Indiana University, 150 S. Woodlawn Ave, Bloomington, IN 47405, United States
| | - Wontack Han
- Indiana University, 150 S. Woodlawn Ave, Bloomington, IN 47405, United States
| | - Yuzhen Ye
- Indiana University, 150 S. Woodlawn Ave, Bloomington, IN 47405, United States.
| |
Collapse
|
23
|
Zinati Z, Alemzadeh A, KayvanJoo AH. Computational approaches for classification and prediction of P-type ATPase substrate specificity in Arabidopsis. PHYSIOLOGY AND MOLECULAR BIOLOGY OF PLANTS : AN INTERNATIONAL JOURNAL OF FUNCTIONAL PLANT BIOLOGY 2016; 22:163-174. [PMID: 27186030 PMCID: PMC4840148 DOI: 10.1007/s12298-016-0351-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/2015] [Revised: 03/15/2016] [Accepted: 03/28/2016] [Indexed: 06/05/2023]
Abstract
As an extended gamut of integral membrane (extrinsic) proteins, and based on their transporting specificities, P-type ATPases include five subfamilies in Arabidopsis, inter alia, P4ATPases (phospholipid-transporting ATPase), P3AATPases (plasma membrane H(+) pumps), P2A and P2BATPases (Ca(2+) pumps) and P1B ATPases (heavy metal pumps). Although, many different computational methods have been developed to predict substrate specificity of unknown proteins, further investigation needs to improve the efficiency and performance of the predicators. In this study, various attribute weighting and supervised clustering algorithms were employed to identify the main amino acid composition attributes, which can influence the substrate specificity of ATPase pumps, classify protein pumps and predict the substrate specificity of uncharacterized ATPase pumps. The results of this study indicate that both non-reduced coefficients pertaining to absorption and Cys extinction within 280 nm, the frequencies of hydrogen, Ala, Val, carbon, hydrophilic residues, the counts of Val, Asn, Ser, Arg, Phe, Tyr, hydrophilic residues, Phe-Phe, Ala-Ile, Phe-Leu, Val-Ala and length are specified as the most important amino acid attributes through applying the whole attribute weighting models. Here, learning algorithms engineered in a predictive machine (Naive Bays) is proposed to foresee the Q9LVV1 and O22180 substrate specificities (P-type ATPase like proteins) with 100 % prediction confidence. For the first time, our analysis demonstrated promising application of bioinformatics algorithms in classifying ATPases pumps. Moreover, we suggest the predictive systems that can assist towards the prediction of the substrate specificity of any new ATPase pumps with the maximum possible prediction confidence.
Collapse
Affiliation(s)
- Zahra Zinati
- />Department of Agroecology, College of Agriculture and Natural Resources of Darab, Shiraz University, Shiraz, Iran
| | - Abbas Alemzadeh
- />Department of Crop Production and Plant Breeding, College of Agriculture, Shiraz University, Shiraz, Iran
| | - Amir Hossein KayvanJoo
- />Bonn-Aachen International Center for Information Technology B-IT, University of Bonn, Bonn, Germany
| |
Collapse
|
24
|
Parasuram R, Mills CL, Wang Z, Somasundaram S, Beuning PJ, Ondrechen MJ. Local structure based method for prediction of the biochemical function of proteins: Applications to glycoside hydrolases. Methods 2016; 93:51-63. [DOI: 10.1016/j.ymeth.2015.11.010] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Revised: 11/05/2015] [Accepted: 11/09/2015] [Indexed: 01/07/2023] Open
|
25
|
Carvalho HF, Roque ACA, Iranzo O, Branco RJF. Comparison of the Internal Dynamics of Metalloproteases Provides New Insights on Their Function and Evolution. PLoS One 2015; 10:e0138118. [PMID: 26397984 PMCID: PMC4580569 DOI: 10.1371/journal.pone.0138118] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2015] [Accepted: 08/25/2015] [Indexed: 11/20/2022] Open
Abstract
Metalloproteases have evolved in a vast number of biological systems, being one of the most diverse types of proteases and presenting a wide range of folds and catalytic metal ions. Given the increasing understanding of protein internal dynamics and its role in enzyme function, we are interested in assessing how the structural heterogeneity of metalloproteases translates into their dynamics. Therefore, the dynamical profile of the clan MA type protein thermolysin, derived from an Elastic Network Model of protein structure, was evaluated against those obtained from a set of experimental structures and molecular dynamics simulation trajectories. A close correspondence was obtained between modes derived from the coarse-grained model and the subspace of functionally-relevant motions observed experimentally, the later being shown to be encoded in the internal dynamics of the protein. This prompted the use of dynamics-based comparison methods that employ such coarse-grained models in a representative set of clan members, allowing for its quantitative description in terms of structural and dynamical variability. Although members show structural similarity, they nonetheless present distinct dynamical profiles, with no apparent correlation between structural and dynamical relatedness. However, previously unnoticed dynamical similarity was found between the relevant members Carboxypeptidase Pfu, Leishmanolysin, and Botulinum Neurotoxin Type A, despite sharing no structural similarity. Inspection of the respective alignments shows that dynamical similarity has a functional basis, namely the need for maintaining proper intermolecular interactions with the respective substrates. These results suggest that distinct selective pressure mechanisms act on metalloproteases at structural and dynamical levels through the course of their evolution. This work shows how new insights on metalloprotease function and evolution can be assessed with comparison schemes that incorporate information on protein dynamics. The integration of these newly developed tools, if applied to other protein families, can lead to more accurate and descriptive protein classification systems.
Collapse
Affiliation(s)
- Henrique F. Carvalho
- UCIBIO-REQUIMTE, Department of Chemistry, Faculty of Science and Technology, Universidade NOVA de Lisboa, 2829-516 Caparica, Portugal
- Instituto de Tecnologia Química e Biológica António Xavier, Universidade Nova de Lisboa, Av. da República, 2780–157 Oeiras, Portugal
| | - Ana C. A. Roque
- UCIBIO-REQUIMTE, Department of Chemistry, Faculty of Science and Technology, Universidade NOVA de Lisboa, 2829-516 Caparica, Portugal
| | - Olga Iranzo
- Aix Marseille Université, Centrale Marseille, CNRS, iSm2 UMR 7313, 13397, Marseille, France
| | - Ricardo J. F. Branco
- UCIBIO-REQUIMTE, Department of Chemistry, Faculty of Science and Technology, Universidade NOVA de Lisboa, 2829-516 Caparica, Portugal
| |
Collapse
|
26
|
Khan IK, Wei Q, Chapman S, KC DB, Kihara D. The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches. Gigascience 2015; 4:43. [PMID: 26380077 PMCID: PMC4570625 DOI: 10.1186/s13742-015-0083-4] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2014] [Accepted: 08/27/2015] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Functional annotation of novel proteins is one of the central problems in bioinformatics. With the ever-increasing development of genome sequencing technologies, more and more sequence information is becoming available to analyze and annotate. To achieve fast and automatic function annotation, many computational (automated) function prediction (AFP) methods have been developed. To objectively evaluate the performance of such methods on a large scale, community-wide assessment experiments have been conducted. The second round of the Critical Assessment of Function Annotation (CAFA) experiment was held in 2013-2014. Evaluation of participating groups was reported in a special interest group meeting at the Intelligent Systems in Molecular Biology (ISMB) conference in Boston in 2014. Our group participated in both CAFA1 and CAFA2 using multiple, in-house AFP methods. Here, we report benchmark results of our methods obtained in the course of preparation for CAFA2 prior to submitting function predictions for CAFA2 targets. RESULTS For CAFA2, we updated the annotation databases used by our methods, protein function prediction (PFP) and extended similarity group (ESG), and benchmarked their function prediction performances using the original (older) and updated databases. Performance evaluation for PFP with different settings and ESG are discussed. We also developed two ensemble methods that combine function predictions from six independent, sequence-based AFP methods. We further analyzed the performances of our prediction methods by enriching the predictions with prior distribution of gene ontology (GO) terms. Examples of predictions by the ensemble methods are discussed. CONCLUSIONS Updating the annotation database was successful, improving the Fmax prediction accuracy score for both PFP and ESG. Adding the prior distribution of GO terms did not make much improvement. Both of the ensemble methods we developed improved the average Fmax score over all individual component methods except for ESG. Our benchmark results will not only complement the overall assessment that will be done by the CAFA organizers, but also help elucidate the predictive powers of sequence-based function prediction methods in general.
Collapse
Affiliation(s)
- Ishita K. Khan
- Department of Computer Sciences, Purdue University, West Lafayette, IN 47907 USA
| | - Qing Wei
- Department of Computer Sciences, Purdue University, West Lafayette, IN 47907 USA
| | - Samuel Chapman
- Department of Computational Science and Engineering, North Carolina A & T State University, Greensboro, NC 27411 USA
| | - Dukka B. KC
- Department of Computational Science and Engineering, North Carolina A & T State University, Greensboro, NC 27411 USA
| | - Daisuke Kihara
- Department of Computer Sciences, Purdue University, West Lafayette, IN 47907 USA
- Department of Biological Sciences, Purdue University, West Lafayette, IN 47907 USA
| |
Collapse
|
27
|
Mudgal R, Sandhya S, Chandra N, Srinivasan N. De-DUFing the DUFs: Deciphering distant evolutionary relationships of Domains of Unknown Function using sensitive homology detection methods. Biol Direct 2015; 10:38. [PMID: 26228684 PMCID: PMC4520260 DOI: 10.1186/s13062-015-0069-2] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2015] [Accepted: 07/20/2015] [Indexed: 12/23/2022] Open
Abstract
Background In the post-genomic era where sequences are being determined at a rapid rate, we are highly reliant on computational methods for their tentative biochemical characterization. The Pfam database currently contains 3,786 families corresponding to “Domains of Unknown Function” (DUF) or “Uncharacterized Protein Family” (UPF), of which 3,087 families have no reported three-dimensional structure, constituting almost one-fourth of the known protein families in search for both structure and function. Results We applied a ‘computational structural genomics’ approach using five state-of-the-art remote similarity detection methods to detect the relationship between uncharacterized DUFs and domain families of known structures. The association with a structural domain family could serve as a start point in elucidating the function of a DUF. Amongst these five methods, searches in SCOP-NrichD database have been applied for the first time. Predictions were classified into high, medium and low- confidence based on the consensus of results from various approaches and also annotated with enzyme and Gene ontology terms. 614 uncharacterized DUFs could be associated with a known structural domain, of which high confidence predictions, involving at least four methods, were made for 54 families. These structure-function relationships for the 614 DUF families can be accessed on-line at http://proline.biochem.iisc.ernet.in/RHD_DUFS/. For potential enzymes in this set, we assessed their compatibility with the associated fold and performed detailed structural and functional annotation by examining alignments and extent of conservation of functional residues. Detailed discussion is provided for interesting assignments for DUF3050, DUF1636, DUF1572, DUF2092 and DUF659. Conclusions This study provides insights into the structure and potential function for nearly 20 % of the DUFs. Use of different computational approaches enables us to reliably recognize distant relationships, especially when they converge to a common assignment because the methods are often complementary. We observe that while pointers to the structural domain can offer the right clues to the function of a protein, recognition of its precise functional role is still ‘non-trivial’ with many DUF domains conserving only some of the critical residues. It is not clear whether these are functional vestiges or instances involving alternate substrates and interacting partners. Reviewers This article was reviewed by Drs Eugene Koonin, Frank Eisenhaber and Srikrishna Subramanian. Electronic supplementary material The online version of this article (doi:10.1186/s13062-015-0069-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Richa Mudgal
- IISc Mathematics Initiative, Indian Institute of Science, Bangalore, 560 012, India.
| | - Sankaran Sandhya
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, 560 012, India.
| | - Nagasuma Chandra
- Department of Biochemistry, Indian Institute of Science, Bangalore, 560 012, India.
| | | |
Collapse
|
28
|
Mills CL, Beuning PJ, Ondrechen MJ. Biochemical functional predictions for protein structures of unknown or uncertain function. Comput Struct Biotechnol J 2015; 13:182-91. [PMID: 25848497 PMCID: PMC4372640 DOI: 10.1016/j.csbj.2015.02.003] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2014] [Revised: 02/06/2015] [Accepted: 02/11/2015] [Indexed: 01/07/2023] Open
Abstract
With the exponential growth in the determination of protein sequences and structures via genome sequencing and structural genomics efforts, there is a growing need for reliable computational methods to determine the biochemical function of these proteins. This paper reviews the efforts to address the challenge of annotating the function at the molecular level of uncharacterized proteins. While sequence- and three-dimensional-structure-based methods for protein function prediction have been reviewed previously, the recent trends in local structure-based methods have received less attention. These local structure-based methods are the primary focus of this review. Computational methods have been developed to predict the residues important for catalysis and the local spatial arrangements of these residues can be used to identify protein function. In addition, the combination of different types of methods can help obtain more information and better predictions of function for proteins of unknown function. Global initiatives, including the Enzyme Function Initiative (EFI), COMputational BRidges to EXperiments (COMBREX), and the Critical Assessment of Function Annotation (CAFA), are evaluating and testing the different approaches to predicting the function of proteins of unknown function. These initiatives and global collaborations will increase the capability and reliability of methods to predict biochemical function computationally and will add substantial value to the current volume of structural genomics data by reducing the number of absent or inaccurate functional annotations.
Collapse
Affiliation(s)
- Caitlyn L Mills
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115, United States
| | - Penny J Beuning
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115, United States
| | - Mary Jo Ondrechen
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115, United States
| |
Collapse
|
29
|
Bianchi V, Mangone I, Ferrè F, Helmer-Citterich M, Ausiello G. webPDBinder: a server for the identification of ligand binding sites on protein structures. Nucleic Acids Res 2013; 41:W308-13. [PMID: 23737450 PMCID: PMC3692056 DOI: 10.1093/nar/gkt457] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
Abstract
The webPDBinder (http://pdbinder.bio.uniroma2.it/PDBinder) is a web server for the identification of small ligand-binding sites in a protein structure. webPDBinder searches a protein structure against a library of known binding sites and a collection of control non-binding pockets. The number of similarities identified with the residues in the two sets is then used to derive a propensity value for each residue of the query protein associated to the likelihood that the residue is part of a ligand binding site. The predicted binding residues can be further refined using conservation scores derived from the multiple alignment of the PFAM protein family. webPDBinder correctly identifies residues belonging to the binding site in 77% of the cases and is able to identify binding pockets starting from holo or apo structures with comparable performances. This is important for all the real world cases where the query protein has been crystallized without a ligand and is also difficult to obtain clear similarities with bound pockets from holo pocket libraries. The input is either a PDB code or a user-submitted structure. The output is a list of predicted binding pocket residues with propensity and conservation values both in text and graphical format.
Collapse
Affiliation(s)
- Valerio Bianchi
- Centre for Molecular Bioinformatics, Department of Biology, University of Rome Tor Vergata, Via della Ricerca Scientifica snc, 00133 Rome, Italy
| | | | | | | | | |
Collapse
|
30
|
Chitale M, Khan IK, Kihara D. In-depth performance evaluation of PFP and ESG sequence-based function prediction methods in CAFA 2011 experiment. BMC Bioinformatics 2013; 14 Suppl 3:S2. [PMID: 23514353 PMCID: PMC3584938 DOI: 10.1186/1471-2105-14-s3-s2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Many Automatic Function Prediction (AFP) methods were developed to cope with an increasing growth of the number of gene sequences that are available from high throughput sequencing experiments. To support the development of AFP methods, it is essential to have community wide experiments for evaluating performance of existing AFP methods. Critical Assessment of Function Annotation (CAFA) is one such community experiment. The meeting of CAFA was held as a Special Interest Group (SIG) meeting at the Intelligent Systems in Molecular Biology (ISMB) conference in 2011. Here, we perform a detailed analysis of two sequence-based function prediction methods, PFP and ESG, which were developed in our lab, using the predictions submitted to CAFA. RESULTS We evaluate PFP and ESG using four different measures in comparison with BLAST, Prior, and GOtcha. In addition to the predictions submitted to CAFA, we further investigate performance of a different scoring function to rank order predictions by PFP as well as PFP/ESG predictions enriched with Priors that simply adds frequently occurring Gene Ontology terms as a part of predictions. Prediction accuracies of each method were also evaluated separately for different functional categories. Successful and unsuccessful predictions by PFP and ESG are also discussed in comparison with BLAST. CONCLUSION The in-depth analysis discussed here will complement the overall assessment by the CAFA organizers. Since PFP and ESG are based on sequence database search results, our analyses are not only useful for PFP and ESG users but will also shed light on the relationship of the sequence similarity space and functions that can be inferred from the sequences.
Collapse
Affiliation(s)
- Meghana Chitale
- Department of Computer Science, Purdue University, 305 N, University Street, West Lafayette, Indiana 47907, USA
| | | | | |
Collapse
|
31
|
Nam HJ, Han SK, Bowie JU, Kim S. Rampant exchange of the structure and function of extramembrane domains between membrane and water soluble proteins. PLoS Comput Biol 2013; 9:e1002997. [PMID: 23555228 PMCID: PMC3605051 DOI: 10.1371/journal.pcbi.1002997] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2012] [Accepted: 02/04/2013] [Indexed: 11/19/2022] Open
Abstract
Of the membrane proteins of known structure, we found that a remarkable 67% of the water soluble domains are structurally similar to water soluble proteins of known structure. Moreover, 41% of known water soluble protein structures share a domain with an already known membrane protein structure. We also found that functional residues are frequently conserved between extramembrane domains of membrane and soluble proteins that share structural similarity. These results suggest membrane and soluble proteins readily exchange domains and their attendant functionalities. The exchanges between membrane and soluble proteins are particularly frequent in eukaryotes, indicating that this is an important mechanism for increasing functional complexity. The high level of structural overlap between the two classes of proteins provides an opportunity to employ the extensive information on soluble proteins to illuminate membrane protein structure and function, for which much less is known. To this end, we employed structure guided sequence alignment to elucidate the functions of membrane proteins in the human genome. Our results bridge the gap of fold space between membrane and water soluble proteins and provide a resource for the prediction of membrane protein function. A database of predicted structural and functional relationships for proteins in the human genome is provided at sbi.postech.ac.kr/emdmp.
Collapse
Affiliation(s)
- Hyun-Jun Nam
- School of Interdisciplinary Bioscience and Bioengineering, Department of Life Science, Division of IT Convergence Engineering, Pohang University of Science and Technology, Pohang, Korea
| | - Seong Kyu Han
- School of Interdisciplinary Bioscience and Bioengineering, Department of Life Science, Division of IT Convergence Engineering, Pohang University of Science and Technology, Pohang, Korea
| | - James U. Bowie
- Department of Chemistry and Biochemistry, UCLA-DOE Institute of Genomics and Proteomics, Molecular Biology Institute, University of California Los Angeles, Los Angeles, California, United States of America
- * E-mail: (JB); (SK)
| | - Sanguk Kim
- School of Interdisciplinary Bioscience and Bioengineering, Department of Life Science, Division of IT Convergence Engineering, Pohang University of Science and Technology, Pohang, Korea
- Department of Chemistry and Biochemistry, UCLA-DOE Institute of Genomics and Proteomics, Molecular Biology Institute, University of California Los Angeles, Los Angeles, California, United States of America
- * E-mail: (JB); (SK)
| |
Collapse
|
32
|
Khan I, Chitale M, Rayon C, Kihara D. Evaluation of function predictions by PFP, ESG,and PSI-BLAST for moonlighting proteins. BMC Proc 2012; 6 Suppl 7:S5. [PMID: 23173871 PMCID: PMC3504920 DOI: 10.1186/1753-6561-6-s7-s5] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Background Advancements in function prediction algorithms are enabling large scale computational annotation for newly sequenced genomes. With the increase in the number of functionally well characterized proteins it has been observed that there are many proteins involved in more than one function. These proteins characterized as moonlighting proteins show varied functional behavior depending on the cell type, localization in the cell, oligomerization, multiple binding sites, etc. The functional diversity shown by moonlighting proteins may have significant impact on the traditional sequence based function prediction methods. Here we investigate how well diverse functions of moonlighting proteins can be predicted by some existing function prediction methods. Results We have analyzed the performances of three major sequence based function prediction methods, PSI-BLAST, the Protein Function Prediction (PFP), and the Extended Similarity Group (ESG) on predicting diverse functions of moonlighting proteins. In predicting discrete functions of a set of 19 experimentally identified moonlighting proteins, PFP showed overall highest recall among the three methods. Although ESG showed the highest precision, its recall was lower than PSI-BLAST. Recall by PSI-BLAST greatly improved when BLOSUM45 was used instead of BLOSUM62. Conclusion We have analyzed the performances of PFP, ESG, and PSI-BLAST in predicting the functional diversity of moonlighting proteins. PFP shows overall better performance in predicting diverse moonlighting functions as compared with PSI-BLAST and ESG. Recall by PSI-BLAST greatly improved when BLOSUM45 was used. This analysis indicates that considering weakly similar sequences in prediction enhances the performance of sequence based AFP methods in predicting functional diversity of moonlighting proteins. The current study will also motivate development of novel computational frameworks for automatic identification of such proteins.
Collapse
Affiliation(s)
- Ishita Khan
- Department of Computer Science, College of Science, Purdue University, West Lafayette, IN 47907, USA.
| | | | | | | |
Collapse
|
33
|
Wass MN, Barton G, Sternberg MJE. CombFunc: predicting protein function using heterogeneous data sources. Nucleic Acids Res 2012; 40:W466-70. [PMID: 22641853 PMCID: PMC3394346 DOI: 10.1093/nar/gks489] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Only a small fraction of known proteins have been functionally characterized, making protein function prediction essential to propose annotations for uncharacterized proteins. In recent years many function prediction methods have been developed using various sources of biological data from protein sequence and structure to gene expression data. Here we present the CombFunc web server, which makes Gene Ontology (GO)-based protein function predictions. CombFunc incorporates ConFunc, our existing function prediction method, with other approaches for function prediction that use protein sequence, gene expression and protein–protein interaction data. In benchmarking on a set of 1686 proteins CombFunc obtains precision and recall of 0.71 and 0.64 respectively for gene ontology molecular function terms. For biological process GO terms precision of 0.74 and recall of 0.41 is obtained. CombFunc is available at http://www.sbg.bio.ic.ac.uk/combfunc.
Collapse
Affiliation(s)
- Mark N Wass
- Centre for Bioinformatics, Imperial College London, London, SW7 2AZ, UK.
| | | | | |
Collapse
|
34
|
Roy A, Yang J, Zhang Y. COFACTOR: an accurate comparative algorithm for structure-based protein function annotation. Nucleic Acids Res 2012; 40:W471-7. [PMID: 22570420 PMCID: PMC3394312 DOI: 10.1093/nar/gks372] [Citation(s) in RCA: 460] [Impact Index Per Article: 38.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We have developed a new COFACTOR webserver for automated structure-based protein function annotation. Starting from a structural model, given by either experimental determination or computational modeling, COFACTOR first identifies template proteins of similar folds and functional sites by threading the target structure through three representative template libraries that have known protein-ligand binding interactions, Enzyme Commission number or Gene Ontology terms. The biological function insights in these three aspects are then deduced from the functional templates, the confidence of which is evaluated by a scoring function that combines both global and local structural similarities. The algorithm has been extensively benchmarked by large-scale benchmarking tests and demonstrated significant advantages compared to traditional sequence-based methods. In the recent community-wide CASP9 experiment, COFACTOR was ranked as the best method for protein-ligand binding site predictions. The COFACTOR sever and the template libraries are freely available at http://zhanglab.ccmb.med.umich.edu/COFACTOR.
Collapse
Affiliation(s)
- Ambrish Roy
- Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI 48109-2218, USA
| | | | | |
Collapse
|
35
|
Bianchi V, Gherardini PF, Helmer-Citterich M, Ausiello G. Identification of binding pockets in protein structures using a knowledge-based potential derived from local structural similarities. BMC Bioinformatics 2012; 13 Suppl 4:S17. [PMID: 22536963 PMCID: PMC3434446 DOI: 10.1186/1471-2105-13-s4-s17] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
Background The identification of ligand binding sites is a key task in the annotation of proteins with known structure but uncharacterized function. Here we describe a knowledge-based method exploiting the observation that unrelated binding sites share small structural motifs that bind the same chemical fragments irrespective of the nature of the ligand as a whole. Results PDBinder compares a query protein against a library of binding and non-binding protein surface regions derived from the PDB. The results of the comparison are used to derive a propensity value for each residue which is correlated with the likelihood that the residue is part of a ligand binding site. The method was applied to two different problems: i) the prediction of ligand binding residues and ii) the identification of which surface cleft harbours the binding site. In both cases PDBinder performed consistently better than existing methods. PDBinder has been trained on a non-redundant set of 1356 high-quality protein-ligand complexes and tested on a set of 239 holo and apo complex pairs. We obtained an MCC of 0.313 on the holo set with a PPV of 0.413 while on the apo set we achieved an MCC of 0.271 and a PPV of 0.372. Conclusions We show that PDBinder performs better than existing methods. The good performance on the unbound proteins is extremely important for real-world applications where the location of the binding site is unknown. Moreover, since our approach is orthogonal to those used in other programs, the PDBinder propensity value can be integrated in other algorithms further increasing the final performance.
Collapse
Affiliation(s)
- Valerio Bianchi
- Centre for Molecular Bioinformatics, Department of Biology, University of Rome Tor Vergata, Via della Ricerca Scientifica snc, Rome 00133, Italy
| | | | | | | |
Collapse
|
36
|
Sehnal D, Vařeková RS, Huber HJ, Geidl S, Ionescu CM, Wimmerová M, Koča J. SiteBinder: an improved approach for comparing multiple protein structural motifs. J Chem Inf Model 2012; 52:343-59. [PMID: 22296449 DOI: 10.1021/ci200444d] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
There is a paramount need to develop new techniques and tools that will extract as much information as possible from the ever growing repository of protein 3D structures. We report here on the development of a software tool for the multiple superimposition of large sets of protein structural motifs. Our superimposition methodology performs a systematic search for the atom pairing that provides the best fit. During this search, the RMSD values for all chemically relevant pairings are calculated by quaternion algebra. The number of evaluated pairings is markedly decreased by using PDB annotations for atoms. This approach guarantees that the best fit will be found and can be applied even when sequence similarity is low or does not exist at all. We have implemented this methodology in the Web application SiteBinder, which is able to process up to thousands of protein structural motifs in a very short time, and which provides an intuitive and user-friendly interface. Our benchmarking analysis has shown the robustness, efficiency, and versatility of our methodology and its implementation by the successful superimposition of 1000 experimentally determined structures for each of 32 eukaryotic linear motifs. We also demonstrate the applicability of SiteBinder using three case studies. We first compared the structures of 61 PA-IIL sugar binding sites containing nine different sugars, and we found that the sugar binding sites of PA-IIL and its mutants have a conserved structure despite their binding different sugars. We then superimposed over 300 zinc finger central motifs and revealed that the molecular structure in the vicinity of the Zn atom is highly conserved. Finally, we superimposed 12 BH3 domains from pro-apoptotic proteins. Our findings come to support the hypothesis that there is a structural basis for the functional segregation of BH3-only proteins into activators and enablers.
Collapse
Affiliation(s)
- David Sehnal
- National Centre for Biomolecular Research, Faculty of Science and CEITEC-Central European Institute of Technology, Masaryk University Brno, Kamenice 5, 62500 Brno-Bohunice, Czech Republic
| | | | | | | | | | | | | |
Collapse
|
37
|
Sael L, Chitale M, Kihara D. Structure- and sequence-based function prediction for non-homologous proteins. ACTA ACUST UNITED AC 2012; 13:111-23. [PMID: 22270458 DOI: 10.1007/s10969-012-9126-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2011] [Accepted: 01/10/2012] [Indexed: 01/14/2023]
Abstract
The structural genomics projects have been accumulating an increasing number of protein structures, many of which remain functionally unknown. In parallel effort to experimental methods, computational methods are expected to make a significant contribution for functional elucidation of such proteins. However, conventional computational methods that transfer functions from homologous proteins do not help much for these uncharacterized protein structures because they do not have apparent structural or sequence similarity with the known proteins. Here, we briefly review two avenues of computational function prediction methods, i.e. structure-based methods and sequence-based methods. The focus is on our recent developments of local structure-based and sequence-based methods, which can effectively extract function information from distantly related proteins. Two structure-based methods, Pocket-Surfer and Patch-Surfer, identify similar known ligand binding sites for pocket regions in a query protein without using global protein fold similarity information. Two sequence-based methods, protein function prediction and extended similarity group, make use of weakly similar sequences that are conventionally discarded in homology based function annotation. Combined together with experimental methods we hope that computational methods will make leading contribution in functional elucidation of the protein structures.
Collapse
Affiliation(s)
- Lee Sael
- Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA
| | | | | |
Collapse
|
38
|
Schmidt T, Haas J, Gallo Cassarino T, Schwede T. Assessment of ligand-binding residue predictions in CASP9. Proteins 2011; 79 Suppl 10:126-36. [PMID: 21987472 DOI: 10.1002/prot.23174] [Citation(s) in RCA: 67] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2011] [Revised: 07/29/2011] [Accepted: 08/04/2011] [Indexed: 11/06/2022]
Abstract
Interactions between proteins and their ligands play central roles in many physiological processes. The structural details for most of these interactions, however, have not yet been characterized experientially. Therefore, various computational tools have been developed to predict the location of binding sites and the amino acid residues interacting with ligands. In this manuscript, we assess the performance of 33 methods participating in the ligand-binding site prediction category in CASP9. The overall accuracy of ligand-binding site predictions in CASP9 appears rather high (average Matthews correlation coefficient of 0.62 for the 10 top performing groups) and compared to previous experiments more groups performed equally well. However, this should be seen in context of a strong bias in the test data toward easy template-based models. Overall, the top performing methods have converged to a similar approach using ligand-binding site inference from related homologous structures, which limits their applicability for difficult de novo prediction targets. Here, we present the results of the CASP9 assessment of the ligand-binding site category, discuss examples for successful and challenging prediction targets in CASP9, and finally suggest changes in the format of the experiment to overcome the current limitations of the assessment.
Collapse
Affiliation(s)
- Tobias Schmidt
- Biozentrum, University of Basel, SIB Swiss Institute of Bioinformatics, Klingelbergstrasse 50-70, Basel, Switzerland
| | | | | | | |
Collapse
|
39
|
Gamliel R, Kedem K, Kolodny R, Keasar C. A library of protein surface patches discriminates between native structures and decoys generated by structure prediction servers. BMC STRUCTURAL BIOLOGY 2011; 11:20. [PMID: 21542935 PMCID: PMC3114701 DOI: 10.1186/1472-6807-11-20] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/03/2010] [Accepted: 05/04/2011] [Indexed: 11/10/2022]
Abstract
Background Protein surfaces serve as an interface with the molecular environment and are thus tightly bound to protein function. On the surface, geometric and chemical complementarity to other molecules provides interaction specificity for ligand binding, docking of bio-macromolecules, and enzymatic catalysis. As of today, there is no accepted general scheme to represent protein surfaces. Furthermore, most of the research on protein surface focuses on regions of specific interest such as interaction, ligand binding, and docking sites. We present a first step toward a general purpose representation of protein surfaces: a novel surface patch library that represents most surface patches (~98%) in a data set regardless of their functional roles. Results Surface patches, in this work, are small fractions of the protein surface. Using a measure of inter-patch distance, we clustered patches extracted from a data set of high quality, non-redundant, proteins. The surface patch library is the collection of all the cluster centroids; thus, each of the data set patches is close to one of the elements in the library. We demonstrate the biological significance of our method through the ability of the library to capture surface characteristics of native protein structures as opposed to those of decoy sets generated by state-of-the-art protein structure prediction methods. The patches of the decoys are significantly less compatible with the library than their corresponding native structures, allowing us to reliably distinguish native models from models generated by servers. This trend, however, does not extend to the decoys themselves, as their similarity to the native structures does not correlate with compatibility with the library. Conclusions We expect that this high-quality, generic surface patch library will add a new perspective to the description of protein structures and improve our ability to predict them. In particular, we expect that it will help improve the prediction of surface features that are apparently neglected by current techniques. The surface patch libraries are publicly available at http://www.cs.bgu.ac.il/~keasar/patchLibrary.
Collapse
Affiliation(s)
- Roi Gamliel
- Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | | | | | | |
Collapse
|
40
|
Bertran AGM, Oliveira AS, Nagata T, Resende RO. Molecular characterization of the RNA-dependent RNA polymerase from groundnut ringspot virus (genus Tospovirus, family Bunyaviridae). Arch Virol 2011; 156:1425-9. [PMID: 21442231 DOI: 10.1007/s00705-011-0973-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2011] [Accepted: 03/07/2011] [Indexed: 11/26/2022]
Abstract
Groundnut ringspot virus is a negative-sense single-stranded RNA virus that belongs to the genus Tospovirus and is the prevalent member of this genus in Brazil. This work presents the nucleotide sequence of the L RNA, with a single open reading frame of 2873 amino acids in the complementary strand corresponding to the RNA-dependent RNA polymerase (L protein), as well as the characterization of conserved domains of the L protein by in silico analysis. Phylogenetic analysis of different L protein domains confirmed that GRSV is a member of the American clade, and comparison with a N-protein indicates that phylogeny based on L protein sequences may be more reliable than that based on the N protein.
Collapse
Affiliation(s)
- A G M Bertran
- Laboratory of Plant Virology, Department of Cellular Biology, Biological Sciences Institute, University of Brasília, Campus Universitário Darcy Ribeiro, Asa Norte, Brasília, Distrito Federal, 70910-900, Brazil
| | | | | | | |
Collapse
|
41
|
Somarowthu S, Yang H, Hildebrand DG, Ondrechen MJ. High-performance prediction of functional residues in proteins with machine learning and computed input features. Biopolymers 2011; 95:390-400. [DOI: 10.1002/bip.21589] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
42
|
Grant MA. INTEGRATING COMPUTATIONAL PROTEIN FUNCTION PREDICTION INTO DRUG DISCOVERY INITIATIVES. Drug Dev Res 2010; 72:4-16. [PMID: 25530654 DOI: 10.1002/ddr.20397] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Pharmaceutical researchers must evaluate vast numbers of protein sequences and formulate innovative strategies for identifying valid targets and discovering leads against them as a way of accelerating drug discovery. The ever increasing number and diversity of novel protein sequences identified by genomic sequencing projects and the success of worldwide structural genomics initiatives have spurred great interest and impetus in the development of methods for accurate, computationally empowered protein function prediction and active site identification. Previously, in the absence of direct experimental evidence, homology-based protein function annotation remained the gold-standard for in silico analysis and prediction of protein function. However, with the continued exponential expansion of sequence databases, this approach is not always applicable, as fewer query protein sequences demonstrate significant homology to protein gene products of known function. As a result, several non-homology based methods for protein function prediction that are based on sequence features, structure, evolution, biochemical and genetic knowledge have emerged. Herein, we review current bioinformatic programs and approaches for protein function prediction/annotation and discuss their integration into drug discovery initiatives. The development of such methods to annotate protein functional sites and their application to large protein functional families is crucial to successfully utilizing the vast amounts of genomic sequence information available to drug discovery and development processes.
Collapse
Affiliation(s)
- Marianne A Grant
- Division of Molecular and Vascular Medicine and Center for Vascular Biology Research, Beth Israel Deaconess Medical Center, Department of Medicine, Harvard Medical School, Boston, Massachusetts, 02215
| |
Collapse
|
43
|
Venner E, Lisewski AM, Erdin S, Ward RM, Amin SR, Lichtarge O. Accurate protein structure annotation through competitive diffusion of enzymatic functions over a network of local evolutionary similarities. PLoS One 2010; 5:e14286. [PMID: 21179190 PMCID: PMC3001439 DOI: 10.1371/journal.pone.0014286] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2010] [Accepted: 11/10/2010] [Indexed: 12/24/2022] Open
Abstract
High-throughput Structural Genomics yields many new protein structures without known molecular function. This study aims to uncover these missing annotations by globally comparing select functional residues across the structural proteome. First, Evolutionary Trace Annotation, or ETA, identifies which proteins have local evolutionary and structural features in common; next, these proteins are linked together into a proteomic network of ETA similarities; then, starting from proteins with known functions, competing functional labels diffuse link-by-link over the entire network. Every node is thus assigned a likelihood z-score for every function, and the most significant one at each node wins and defines its annotation. In high-throughput controls, this competitive diffusion process recovered enzyme activity annotations with 99% and 97% accuracy at half-coverage for the third and fourth Enzyme Commission (EC) levels, respectively. This corresponds to false positive rates 4-fold lower than nearest-neighbor and 5-fold lower than sequence-based annotations. In practice, experimental validation of the predicted carboxylesterase activity in a protein from Staphylococcus aureus illustrated the effectiveness of this approach in the context of an increasingly drug-resistant microbe. This study further links molecular function to a small number of evolutionarily important residues recognizable by Evolutionary Tracing and it points to the specificity and sensitivity of functional annotation by competitive global network diffusion. A web server is at http://mammoth.bcm.tmc.edu/networks.
Collapse
Affiliation(s)
- Eric Venner
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
- Graduate Program in Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, Houston, Texas, United States of America
- W. M. Keck Center for Interdisciplinary Bioscience Training, Houston, Texas, United States of America
| | - Andreas Martin Lisewski
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Serkan Erdin
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
- W. M. Keck Center for Interdisciplinary Bioscience Training, Houston, Texas, United States of America
| | - R. Matthew Ward
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
- Graduate Program in Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, Houston, Texas, United States of America
- W. M. Keck Center for Interdisciplinary Bioscience Training, Houston, Texas, United States of America
| | - Shivas R. Amin
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
- Graduate Program in Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, Houston, Texas, United States of America
- W. M. Keck Center for Interdisciplinary Bioscience Training, Houston, Texas, United States of America
- * E-mail:
| |
Collapse
|
44
|
Moll M, Bryant DH, Kavraki LE. The LabelHash algorithm for substructure matching. BMC Bioinformatics 2010; 11:555. [PMID: 21070651 PMCID: PMC2996407 DOI: 10.1186/1471-2105-11-555] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2010] [Accepted: 11/11/2010] [Indexed: 08/30/2023] Open
Abstract
Background There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Results We present LabelHash, a novel algorithm for matching substructural motifs to large collections of protein structures. The algorithm consists of two phases. In the first phase the proteins are preprocessed in a fashion that allows for instant lookup of partial matches to any motif. In the second phase, partial matches for a given motif are expanded to complete matches. The general applicability of the algorithm is demonstrated with three different case studies. First, we show that we can accurately identify members of the enolase superfamily with a single motif. Next, we demonstrate how LabelHash can complement SOIPPA, an algorithm for motif identification and pairwise substructure alignment. Finally, a large collection of Catalytic Site Atlas motifs is used to benchmark the performance of the algorithm. LabelHash runs very efficiently in parallel; matching a motif against all proteins in the 95% sequence identity filtered non-redundant Protein Data Bank typically takes no more than a few minutes. The LabelHash algorithm is available through a web server and as a suite of standalone programs at http://labelhash.kavrakilab.org. The output of the LabelHash algorithm can be further analyzed with Chimera through a plugin that we developed for this purpose. Conclusions LabelHash is an efficient, versatile algorithm for large-scale substructure matching. When LabelHash is running in parallel, motifs can typically be matched against the entire PDB on the order of minutes. The algorithm is able to identify functional homologs beyond the twilight zone of sequence identity and even beyond fold similarity. The three case studies presented in this paper illustrate the versatility of the algorithm.
Collapse
Affiliation(s)
- Mark Moll
- Department of Computer Science, Rice University, Houston, TX 77005, USA.
| | | | | |
Collapse
|
45
|
Parca L, Gherardini PF, Helmer-Citterich M, Ausiello G. Phosphate binding sites identification in protein structures. Nucleic Acids Res 2010; 39:1231-42. [PMID: 20974634 PMCID: PMC3045618 DOI: 10.1093/nar/gkq987] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Nearly half of known protein structures interact with phosphate-containing ligands, such as nucleotides and other cofactors. Many methods have been developed for the identification of metal ions-binding sites and some for bigger ligands such as carbohydrates, but none is yet available for the prediction of phosphate-binding sites. Here we describe Pfinder, a method that predicts binding sites for phosphate groups, both in the form of ions or as parts of other non-peptide ligands, in proteins of known structure. Pfinder uses the Query3D local structural comparison algorithm to scan a protein structure for the presence of a number of structural motifs identified for their ability to bind the phosphate chemical group. Pfinder has been tested on a data set of 52 proteins for which both the apo and holo forms were available. We obtained at least one correct prediction in 63% of the holo structures and in 62% of the apo. The ability of Pfinder to recognize a phosphate-binding site in unbound protein structures makes it an ideal tool for functional annotation and for complementing docking and drug design methods. The Pfinder program is available at http://pdbfun.uniroma2.it/pfinder.
Collapse
Affiliation(s)
- Luca Parca
- Department of Biology, Centre for Molecular Bioinformatics, University of Rome Tor Vergata, Via della Ricerca Scientifica snc, 00133 Rome, Italy
| | | | | | | |
Collapse
|
46
|
Unmet challenges of structural genomics. Curr Opin Struct Biol 2010; 20:587-97. [PMID: 20810277 DOI: 10.1016/j.sbi.2010.08.001] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2010] [Revised: 07/30/2010] [Accepted: 08/03/2010] [Indexed: 11/22/2022]
Abstract
Structural genomics (SG) programs have developed during the last decade many novel methodologies for faster and more accurate structure determination. These new tools and approaches led to the determination of thousands of protein structures. The generation of enormous amounts of experimental data resulted in significant improvements in the understanding of many biological processes at molecular levels. However, the amount of data collected so far is so large that traditional analysis methods are limiting the rate of extraction of biological and biochemical information from 3D models. This situation has prompted us to review the challenges that remain unmet by SG, as well as the areas in which the potential impact of SG could exceed what has been achieved so far.
Collapse
|
47
|
Li GH, Huang JF. CMASA: an accurate algorithm for detecting local protein structural similarity and its application to enzyme catalytic site annotation. BMC Bioinformatics 2010; 11:439. [PMID: 20796320 PMCID: PMC2936402 DOI: 10.1186/1471-2105-11-439] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2009] [Accepted: 08/27/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The rapid development of structural genomics has resulted in many "unknown function" proteins being deposited in Protein Data Bank (PDB), thus, the functional prediction of these proteins has become a challenge for structural bioinformatics. Several sequence-based and structure-based methods have been developed to predict protein function, but these methods need to be improved further, such as, enhancing the accuracy, sensitivity, and the computational speed. Here, an accurate algorithm, the CMASA (Contact MAtrix based local Structural Alignment algorithm), has been developed to predict unknown functions of proteins based on the local protein structural similarity. This algorithm has been evaluated by building a test set including 164 enzyme families, and also been compared to other methods. RESULTS The evaluation of CMASA shows that the CMASA is highly accurate (0.96), sensitive (0.86), and fast enough to be used in the large-scale functional annotation. Comparing to both sequence-based and global structure-based methods, not only the CMASA can find remote homologous proteins, but also can find the active site convergence. Comparing to other local structure comparison-based methods, the CMASA can obtain the better performance than both FFF (a method using geometry to predict protein function) and SPASM (a local structure alignment method); and the CMASA is more sensitive than PINTS and is more accurate than JESS (both are local structure alignment methods). The CMASA was applied to annotate the enzyme catalytic sites of the non-redundant PDB, and at least 166 putative catalytic sites have been suggested, these sites can not be observed by the Catalytic Site Atlas (CSA). CONCLUSIONS The CMASA is an accurate algorithm for detecting local protein structural similarity, and it holds several advantages in predicting enzyme active sites. The CMASA can be used in large-scale enzyme active site annotation. The CMASA can be available by the mail-based server (http://159.226.149.45/other1/CMASA/CMASA.htm).
Collapse
Affiliation(s)
- Gong-Hua Li
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan, China
| | | |
Collapse
|
48
|
Gherardini PF, Ausiello G, Helmer-Citterich M. Superpose3D: a local structural comparison program that allows for user-defined structure representations. PLoS One 2010; 5:e11988. [PMID: 20700534 PMCID: PMC2916828 DOI: 10.1371/journal.pone.0011988] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2010] [Accepted: 07/08/2010] [Indexed: 11/19/2022] Open
Abstract
Local structural comparison methods can be used to find structural similarities involving functional protein patches such as enzyme active sites and ligand binding sites. The outcome of such analyses is critically dependent on the representation used to describe the structure. Indeed different categories of functional sites may require the comparison program to focus on different characteristics of the protein residues. We have therefore developed superpose3D, a novel structural comparison software that lets users specify, with a powerful and flexible syntax, the structure description most suited to the requirements of their analysis. Input proteins are processed according to the user's directives and the program identifies sets of residues (or groups of atoms) that have a similar 3D position in the two structures. The advantages of using such a general purpose program are demonstrated with several examples. These test cases show that no single representation is appropriate for every analysis, hence the usefulness of having a flexible program that can be tailored to different needs. Moreover we also discuss how to interpret the results of a database screening where a known structural motif is searched against a large ensemble of structures. The software is written in C++ and is released under the open source GPL license. Superpose3D does not require any external library, runs on Linux, Mac OSX, Windows and is available at http://cbm.bio.uniroma2.it/superpose3D.
Collapse
Affiliation(s)
- Pier Federico Gherardini
- Centre for Molecular Bioinformatics, Department of Biology, University of Rome “Tor Vergata”, Rome, Italy
| | - Gabriele Ausiello
- Centre for Molecular Bioinformatics, Department of Biology, University of Rome “Tor Vergata”, Rome, Italy
- * E-mail:
| | - Manuela Helmer-Citterich
- Centre for Molecular Bioinformatics, Department of Biology, University of Rome “Tor Vergata”, Rome, Italy
| |
Collapse
|
49
|
Wass MN, Kelley LA, Sternberg MJE. 3DLigandSite: predicting ligand-binding sites using similar structures. Nucleic Acids Res 2010; 38:W469-73. [PMID: 20513649 PMCID: PMC2896164 DOI: 10.1093/nar/gkq406] [Citation(s) in RCA: 457] [Impact Index Per Article: 32.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
Abstract
3DLigandSite is a web server for the prediction of ligand-binding sites. It is based upon successful manual methods used in the eighth round of the Critical Assessment of techniques for protein Structure Prediction (CASP8). 3DLigandSite utilizes protein-structure prediction to provide structural models for proteins that have not been solved. Ligands bound to structures similar to the query are superimposed onto the model and used to predict the binding site. In benchmarking against the CASP8 targets 3DLigandSite obtains a Matthew’s correlation co-efficient (MCC) of 0.64, and coverage and accuracy of 71 and 60%, respectively, similar results to our manual performance in CASP8. In further benchmarking using a large set of protein structures, 3DLigandSite obtains an MCC of 0.68. The web server enables users to submit either a query sequence or structure. Predictions are visually displayed via an interactive Jmol applet. 3DLigandSite is available for use at http://www.sbg.bio.ic.ac.uk/3dligandsite.
Collapse
Affiliation(s)
- Mark N Wass
- Structural Bioinformatics Group, Centre for Bioinformatics, Imperial College London, London, SW7 2AZ, UK
| | | | | |
Collapse
|
50
|
Cilia E, Passerini A. Automatic prediction of catalytic residues by modeling residue structural neighborhood. BMC Bioinformatics 2010; 11:115. [PMID: 20199672 PMCID: PMC2844391 DOI: 10.1186/1471-2105-11-115] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2009] [Accepted: 03/03/2010] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND Prediction of catalytic residues is a major step in characterizing the function of enzymes. In its simpler formulation, the problem can be cast into a binary classification task at the residue level, by predicting whether the residue is directly involved in the catalytic process. The task is quite hard also when structural information is available, due to the rather wide range of roles a functional residue can play and to the large imbalance between the number of catalytic and non-catalytic residues. RESULTS We developed an effective representation of structural information by modeling spherical regions around candidate residues, and extracting statistics on the properties of their content such as physico-chemical properties, atomic density, flexibility, presence of water molecules. We trained an SVM classifier combining our features with sequence-based information and previously developed 3D features, and compared its performance with the most recent state-of-the-art approaches on different benchmark datasets. We further analyzed the discriminant power of the information provided by the presence of heterogens in the residue neighborhood. CONCLUSIONS Our structure-based method achieves consistent improvements on all tested datasets over both sequence-based and structure-based state-of-the-art approaches. Structural neighborhood information is shown to be responsible for such results, and predicting the presence of nearby heterogens seems to be a promising direction for further improvements.
Collapse
Affiliation(s)
- Elisa Cilia
- Information Engineering and Computer Science Department, via Sommarive 14 - I38100 (Povo) Trento, Italy
| | - Andrea Passerini
- Information Engineering and Computer Science Department, via Sommarive 14 - I38100 (Povo) Trento, Italy
| |
Collapse
|