1
|
Wei J, Xiao J, Chen S, Zong L, Gao X, Li Y. ProNet DB: a proteome-wise database for protein surface property representations and RNA-binding profiles. Database (Oxford) 2024; 2024:baae012. [PMID: 38557634 PMCID: PMC10984565 DOI: 10.1093/database/baae012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Revised: 01/08/2024] [Accepted: 02/17/2024] [Indexed: 04/04/2024]
Abstract
The rapid growth in the number of experimental and predicted protein structures and more complicated protein structures poses a significant challenge for computational biology in leveraging structural information and accurate representation of protein surface properties. Recently, AlphaFold2 released the comprehensive proteomes of various species, and protein surface property representation plays a crucial role in protein-molecule interaction predictions, including those involving proteins, nucleic acids and compounds. Here, we proposed the first extensive database, namely ProNet DB, that integrates multiple protein surface representations and RNA-binding landscape for 326 175 protein structures. This collection encompasses the 16 model organism proteomes from the AlphaFold Protein Structure Database and experimentally validated structures from the Protein Data Bank. For each protein, ProNet DB provides access to the original protein structures along with the detailed surface property representations encompassing hydrophobicity, charge distribution and hydrogen bonding potential as well as interactive features such as the interacting face and RNA-binding sites and preferences. To facilitate an intuitive interpretation of these properties and the RNA-binding landscape, ProNet DB incorporates visualization tools like Mol* and an Online 3D Viewer, allowing for the direct observation and analysis of these representations on protein surfaces. The availability of pre-computed features enables instantaneous access for users, significantly advancing computational biology research in areas such as molecular mechanism elucidation, geometry-based drug discovery and the development of novel therapeutic approaches. Database URL: https://proj.cse.cuhk.edu.hk/aihlab/pronet/.
Collapse
Affiliation(s)
- Junkang Wei
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Chung Chi Rd, Ma Liu Shui, Hong Kong SAR 999077, China
| | - Jin Xiao
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Chung Chi Rd, Ma Liu Shui, Hong Kong SAR 999077, China
| | - Siyuan Chen
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal 23955, Kingdom of Saudi Arabia
| | - Licheng Zong
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Chung Chi Rd, Ma Liu Shui, Hong Kong SAR 999077, China
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal 23955, Kingdom of Saudi Arabia
| | - Yu Li
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Chung Chi Rd, Ma Liu Shui, Hong Kong SAR 999077, China
- The CUHK Shenzhen Research Institute, 4 Gaoxin Ave Nanshan, Shenzhen 518057, China
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 45 Carleton Street, Cambridge, MA 02142, USA
- Wyss Institute for Biologically Inspired Engineering, Harvard University, 201 Brookline Avenue, Boston, MA 02215, USA
- Broad Institute of MIT and Harvard, Merkin Building, 415 Main Street, Cambridge, MA 02142, USA
| |
Collapse
|
2
|
Draizen EJ, Readey J, Mura C, Bourne PE. Prop3D: A flexible, Python-based platform for machine learning with protein structural properties and biophysical data. BMC Bioinformatics 2024; 25:11. [PMID: 38177985 PMCID: PMC10768222 DOI: 10.1186/s12859-023-05586-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Accepted: 11/27/2023] [Indexed: 01/06/2024] Open
Abstract
BACKGROUND Machine learning (ML) has a rich history in structural bioinformatics, and modern approaches, such as deep learning, are revolutionizing our knowledge of the subtle relationships between biomolecular sequence, structure, function, dynamics and evolution. As with any advance that rests upon statistical learning approaches, the recent progress in biomolecular sciences is enabled by the availability of vast volumes of sufficiently-variable data. To be useful, such data must be well-structured, machine-readable, intelligible and manipulable. These and related requirements pose challenges that become especially acute at the computational scales typical in ML. Furthermore, in structural bioinformatics such data generally relate to protein three-dimensional (3D) structures, which are inherently more complex than sequence-based data. A significant and recurring challenge concerns the creation of large, high-quality, openly-accessible datasets that can be used for specific training and benchmarking tasks in ML pipelines for predictive modeling projects, along with reproducible splits for training and testing. RESULTS Here, we report 'Prop3D', a platform that allows for the creation, sharing and extensible reuse of libraries of protein domains, featurized with biophysical and evolutionary properties that can range from detailed, atomically-resolved physicochemical quantities (e.g., electrostatics) to coarser, residue-level features (e.g., phylogenetic conservation). As a community resource, we also supply a 'Prop3D-20sf' protein dataset, obtained by applying our approach to CATH . We have developed and deployed the Prop3D framework, both in the cloud and on local HPC resources, to systematically and reproducibly create comprehensive datasets via the Highly Scalable Data Service ( HSDS ). Our datasets are freely accessible via a public HSDS instance, or they can be used with accompanying Python wrappers for popular ML frameworks. CONCLUSION Prop3D and its associated Prop3D-20sf dataset can be of broad utility in at least three ways. Firstly, the Prop3D workflow code can be customized and deployed on various cloud-based compute platforms, with scalability achieved largely by saving the results to distributed HDF5 files via HSDS . Secondly, the linked Prop3D-20sf dataset provides a hand-crafted, already-featurized dataset of protein domains for 20 highly-populated CATH families; importantly, provision of this pre-computed resource can aid the more efficient development (and reproducible deployment) of ML pipelines. Thirdly, Prop3D-20sf's construction explicitly takes into account (in creating datasets and data-splits) the enigma of 'data leakage', stemming from the evolutionary relationships between proteins.
Collapse
Affiliation(s)
- Eli J Draizen
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA.
- School of Data Science, University of Virginia, Charlottesville, VA, USA.
| | | | - Cameron Mura
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA.
- School of Data Science, University of Virginia, Charlottesville, VA, USA.
| | - Philip E Bourne
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA
- School of Data Science, University of Virginia, Charlottesville, VA, USA
| |
Collapse
|
3
|
Xia Y, Xia C, Pan X, Shen H. BindWeb: A web server for ligand binding residue and pocket prediction from protein structures. Protein Sci 2022; 31:e4462. [PMID: 36190332 PMCID: PMC9667820 DOI: 10.1002/pro.4462] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 09/27/2022] [Accepted: 09/28/2022] [Indexed: 12/13/2022]
Abstract
Knowledge of protein-ligand interactions is beneficial for biological process analysis and drug design. Given the complexity of the interactions and the inadequacy of experimental data, accurate ligand binding residue and pocket prediction remains challenging. In this study, we introduce an easy-to-use web server BindWeb for ligand-specific and ligand-general binding residue and pocket prediction from protein structures. BindWeb integrates a graph neural network GraphBind with a hybrid convolutional neural network and bidirectional long short-term memory network DELIA to identify binding residues. Furthermore, BindWeb clusters the predicted binding residues to binding pockets with mean shift clustering. The experimental results and case study demonstrate that BindWeb benefits from the complementarity of two base methods. BindWeb is freely available for academic use at http://www.csbio.sjtu.edu.cn/bioinf/BindWeb/.
Collapse
Affiliation(s)
- Ying Xia
- Institute of Image Processing and Pattern RecognitionShanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of ChinaShanghaiChina
| | - Chunqiu Xia
- Institute of Image Processing and Pattern RecognitionShanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of ChinaShanghaiChina
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern RecognitionShanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of ChinaShanghaiChina
| | - Hong‐Bin Shen
- Institute of Image Processing and Pattern RecognitionShanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of ChinaShanghaiChina
| |
Collapse
|
4
|
Esperante S, Alvarez-Paggi D, Salgueiro M, Desimone M, de Oliveira G, Arán M, García-Pardo J, Aptekmann A, Ventura S, Alonso L, de Prat-Gay G. A finely tuned interplay between calcium binding, ionic strength and pH modulates conformational and oligomerization equilibria in the Respiratory Syncytial Virus Matrix (M) protein. Arch Biochem Biophys 2022; 731:109424. [DOI: 10.1016/j.abb.2022.109424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 09/14/2022] [Accepted: 09/29/2022] [Indexed: 11/30/2022]
|
5
|
Yang L, He W, Yun Y, Gao Y, Zhu Z, Teng M, Liang Z, Niu L. Defining A Global Map of Functional Group-based 3D Ligand-binding Motifs. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:765-779. [PMID: 35288344 PMCID: PMC9881048 DOI: 10.1016/j.gpb.2021.08.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/22/2020] [Revised: 06/30/2021] [Accepted: 09/27/2021] [Indexed: 01/31/2023]
Abstract
Uncovering conserved 3D protein-ligand binding patterns on the basis of functional groups (FGs) shared by a variety of small molecules can greatly expand our knowledge of protein-ligand interactions. Despite that conserved binding patterns for a few commonly used FGs have been reported in the literature, large-scale identification and evaluation of FG-based 3D binding motifs are still lacking. Here, we propose a computational method, Automatic FG-based Three-dimensional Motif Extractor (AFTME), for automatic mapping of 3D motifs to different FGs of a specific ligand. Applying our method to 233 naturally-occurring ligands, we define 481 FG-binding motifs that are highly conserved across different ligand-binding pockets. Systematic analysis further reveals four main classes of binding motifs corresponding to distinct sets of FGs. Combinations of FG-binding motifs facilitate the binding of proteins to a wide spectrum of ligands with various binding affinities. Finally, we show that our FG-motif map can be used to nominate FGs that potentially bind to specific drug targets, thus providing useful insights and guidance for rational design of small-molecule drugs.
Collapse
Affiliation(s)
- Liu Yang
- School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230026, China; Division of Molecular and Cellular Biophysics, Hefei National Laboratory for Physical Sciences at the Microscale, Hefei 230026, China
| | - Wei He
- School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230026, China; Division of Molecular and Cellular Biophysics, Hefei National Laboratory for Physical Sciences at the Microscale, Hefei 230026, China.
| | - Yuehui Yun
- School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230026, China; Division of Molecular and Cellular Biophysics, Hefei National Laboratory for Physical Sciences at the Microscale, Hefei 230026, China
| | - Yongxiang Gao
- School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230026, China; Division of Molecular and Cellular Biophysics, Hefei National Laboratory for Physical Sciences at the Microscale, Hefei 230026, China
| | - Zhongliang Zhu
- School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230026, China; Division of Molecular and Cellular Biophysics, Hefei National Laboratory for Physical Sciences at the Microscale, Hefei 230026, China
| | - Maikun Teng
- School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230026, China; Division of Molecular and Cellular Biophysics, Hefei National Laboratory for Physical Sciences at the Microscale, Hefei 230026, China
| | - Zhi Liang
- School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230026, China; Division of Molecular and Cellular Biophysics, Hefei National Laboratory for Physical Sciences at the Microscale, Hefei 230026, China.
| | - Liwen Niu
- School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230026, China; Division of Molecular and Cellular Biophysics, Hefei National Laboratory for Physical Sciences at the Microscale, Hefei 230026, China.
| |
Collapse
|
6
|
Wei J, Chen S, Zong L, Gao X, Li Y. Protein-RNA interaction prediction with deep learning: structure matters. Brief Bioinform 2022; 23:bbab540. [PMID: 34929730 PMCID: PMC8790951 DOI: 10.1093/bib/bbab540] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Revised: 11/14/2021] [Accepted: 11/22/2021] [Indexed: 12/11/2022] Open
Abstract
Protein-RNA interactions are of vital importance to a variety of cellular activities. Both experimental and computational techniques have been developed to study the interactions. Because of the limitation of the previous database, especially the lack of protein structure data, most of the existing computational methods rely heavily on the sequence data, with only a small portion of the methods utilizing the structural information. Recently, AlphaFold has revolutionized the entire protein and biology field. Foreseeably, the protein-RNA interaction prediction will also be promoted significantly in the upcoming years. In this work, we give a thorough review of this field, surveying both the binding site and binding preference prediction problems and covering the commonly used datasets, features and models. We also point out the potential challenges and opportunities in this field. This survey summarizes the development of the RNA-binding protein-RNA interaction field in the past and foresees its future development in the post-AlphaFold era.
Collapse
Affiliation(s)
- Junkang Wei
- Department of Computer Science and Engineering (CSE), The Chinese
University of Hong Kong (CUHK), 999077, Hong Kong SAR, China
| | - Siyuan Chen
- Computational Bioscience Research Center (CBRC),
King Abdullah University of Science and Technology (KAUST),
23955-6900, Thuwal, Saudi Arabia
| | - Licheng Zong
- Department of Computer Science and Engineering (CSE), The Chinese
University of Hong Kong (CUHK), 999077, Hong Kong SAR, China
| | - Xin Gao
- Computational Bioscience Research Center (CBRC),
King Abdullah University of Science and Technology (KAUST),
23955-6900, Thuwal, Saudi Arabia
| | - Yu Li
- Department of Computer Science and Engineering (CSE), The Chinese
University of Hong Kong (CUHK), 999077, Hong Kong SAR, China
- The CUHK Shenzhen Research Institute, Hi-Tech Park, 518057,
Shenzhen, China
| |
Collapse
|
7
|
Bromberg Y, Aptekmann AA, Mahlich Y, Cook L, Senn S, Miller M, Nanda V, Ferreiro DU, Falkowski PG. Quantifying structural relationships of metal-binding sites suggests origins of biological electron transfer. SCIENCE ADVANCES 2022; 8:eabj3984. [PMID: 35030025 PMCID: PMC8759750 DOI: 10.1126/sciadv.abj3984] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Accepted: 11/22/2021] [Indexed: 06/07/2023]
Abstract
Biological redox reactions drive planetary biogeochemical cycles. Using a novel, structure-guided sequence analysis of proteins, we explored the patterns of evolution of enzymes responsible for these reactions. Our analysis reveals that the folds that bind transition metal–containing ligands have similar structural geometry and amino acid sequences across the full diversity of proteins. Similarity across folds reflects the availability of key transition metals over geological time and strongly suggests that transition metal–ligand binding had a small number of common peptide origins. We observe that structures central to our similarity network come primarily from oxidoreductases, suggesting that ancestral peptides may have also facilitated electron transfer reactions. Last, our results reveal that the earliest biologically functional peptides were likely available before the assembly of fully functional protein domains over 3.8 billion years ago.Thus, life is a special, very complex form of motion of matter, but this form did not always exist, and it is not separated from inorganic nature by an impassable abyss; rather, it arose from inorganic nature as a new property in the process of evolution of the world. We must study the history of this evolution if we want to solve the problem of the origin of life. [A. I. Oparin (1)]
Collapse
Affiliation(s)
- Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ 08873, USA
| | - Ariel A. Aptekmann
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ 08873, USA
| | - Yannick Mahlich
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ 08873, USA
| | - Linda Cook
- Program in Applied and Computational Math, Princeton University, Princeton, NJ 08540, USA
| | - Stefan Senn
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ 08873, USA
| | - Maximillian Miller
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ 08873, USA
| | - Vikas Nanda
- Department of Biochemistry and Molecular Biology, Robert Wood Johnson Medical School, and Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, NJ 08854, USA
| | - Diego U. Ferreiro
- Protein Physiology Lab, Departamento de Química Biológica, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN-CONICET), Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Paul G. Falkowski
- Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, NJ 08901, USA
| |
Collapse
|
8
|
Shah HA, Liu J, Yang Z, Feng J. Review of Machine Learning Methods for the Prediction and Reconstruction of Metabolic Pathways. Front Mol Biosci 2021; 8:634141. [PMID: 34222327 PMCID: PMC8247443 DOI: 10.3389/fmolb.2021.634141] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Accepted: 06/01/2021] [Indexed: 11/13/2022] Open
Abstract
Prediction and reconstruction of metabolic pathways play significant roles in many fields such as genetic engineering, metabolic engineering, drug discovery, and are becoming the most active research topics in synthetic biology. With the increase of related data and with the development of machine learning techniques, there have many machine leaning based methods been proposed for prediction or reconstruction of metabolic pathways. Machine learning techniques are showing state-of-the-art performance to handle the rapidly increasing volume of data in synthetic biology. To support researchers in this field, we briefly review the research progress of metabolic pathway reconstruction and prediction based on machine learning. Some challenging issues in the reconstruction of metabolic pathways are also discussed in this paper.
Collapse
Affiliation(s)
- Hayat Ali Shah
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, Wuhan, China
| | - Juan Liu
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, Wuhan, China
| | - Zhihui Yang
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, Wuhan, China
| | - Jing Feng
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, Wuhan, China
| |
Collapse
|
9
|
Toth JM, DePietro PJ, Haas J, McLaughlin WA. ResiRole: residue-level functional site predictions to gauge the accuracies of protein structure prediction techniques. Bioinformatics 2021; 37:351-359. [PMID: 32780798 PMCID: PMC8058773 DOI: 10.1093/bioinformatics/btaa712] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Revised: 07/31/2020] [Accepted: 08/05/2020] [Indexed: 11/25/2022] Open
Abstract
Motivation Methods to assess the quality of protein structure models are needed for user applications. To aid with the selection of structure models and further inform the development of structure prediction techniques, we describe the ResiRole method for the assessment of the quality of structure models. Results Structure prediction techniques are ranked according to the results of round-robin, head-to-head comparisons using difference scores. Each difference score was defined as the absolute value of the cumulative probability for a functional site prediction made with the FEATURE program for the reference structure minus that for the structure model. Overall, the difference scores correlate well with other model quality metrics; and based on benchmarking studies with NaïveBLAST, they are found to detect additional local structural similarities between the structure models and reference structures. Availabilityand implementation Automated analyses of models addressed in CAMEO are available via the ResiRole server, URL http://protein.som.geisinger.edu/ResiRole/. Interactive analyses with user-provided models and reference structures are also enabled. Code is available at github.com/wamclaughlin/ResiRole. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Joshua M Toth
- Department of Medical Education, Geisinger Commonwealth School of Medicine, Scranton, PA 18510, USA
| | - Paul J DePietro
- Department of Medical Education, Geisinger Commonwealth School of Medicine, Scranton, PA 18510, USA
| | - Juergen Haas
- Biozentrum, University of Basel and SIB Swiss Institute of Bioinformatics, CH-4056 Basel, Switzerland
| | - William A McLaughlin
- Department of Medical Education, Geisinger Commonwealth School of Medicine, Scranton, PA 18510, USA
| |
Collapse
|
10
|
Babbi G, Savojardo C, Martelli PL, Casadio R. Huntingtin: A Protein with a Peculiar Solvent Accessible Surface. Int J Mol Sci 2021; 22:ijms22062878. [PMID: 33809039 PMCID: PMC8001614 DOI: 10.3390/ijms22062878] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Revised: 03/04/2021] [Accepted: 03/04/2021] [Indexed: 11/30/2022] Open
Abstract
Taking advantage of the last cryogenic electron microscopy structure of human huntingtin, we explored with computational methods its physicochemical properties, focusing on the solvent accessible surface of the protein and highlighting a quite interesting mix of hydrophobic and hydrophilic patterns, with the prevalence of the latter ones. We then evaluated the probability of exposed residues to be in contact with other proteins, discovering that they tend to cluster in specific regions of the protein. We then found that the remaining portions of the protein surface can contain calcium-binding sites that we propose here as putative mediators for the protein to interact with membranes. Our findings are justified in relation to the present knowledge of huntingtin functional annotation.
Collapse
Affiliation(s)
- Giulia Babbi
- Biocomputing Group, University of Bologna, Via San Giacomo 9/2, 40126 Bologna, Italy; (G.B.); (C.S.); (R.C.)
| | - Castrense Savojardo
- Biocomputing Group, University of Bologna, Via San Giacomo 9/2, 40126 Bologna, Italy; (G.B.); (C.S.); (R.C.)
| | - Pier Luigi Martelli
- Biocomputing Group, University of Bologna, Via San Giacomo 9/2, 40126 Bologna, Italy; (G.B.); (C.S.); (R.C.)
- Correspondence: ; Tel.: +39-051-2094005
| | - Rita Casadio
- Biocomputing Group, University of Bologna, Via San Giacomo 9/2, 40126 Bologna, Italy; (G.B.); (C.S.); (R.C.)
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council, Via Giovanni Amendola 122/O, 70126 Bari, Italy
| |
Collapse
|
11
|
Ponzoni L, Peñaherrera DA, Oltvai ZN, Bahar I. Rhapsody: predicting the pathogenicity of human missense variants. Bioinformatics 2020; 36:3084-3092. [PMID: 32101277 PMCID: PMC7214033 DOI: 10.1093/bioinformatics/btaa127] [Citation(s) in RCA: 48] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2019] [Revised: 12/27/2019] [Accepted: 02/21/2020] [Indexed: 12/22/2022] Open
Abstract
MOTIVATION The biological effects of human missense variants have been studied experimentally for decades but predicting their effects in clinical molecular diagnostics remains challenging. Available computational tools are usually based on the analysis of sequence conservation and structural properties of the mutant protein. We recently introduced a new machine learning method that demonstrated for the first time the significance of protein dynamics in determining the pathogenicity of missense variants. RESULTS Here, we present a new interface (Rhapsody) that enables fully automated assessment of pathogenicity, incorporating both sequence coevolution data and structure- and dynamics-based features. Benchmarked against a dataset of about 20 000 annotated variants, the methodology is shown to outperform well-established and/or advanced prediction tools. We illustrate the utility of Rhapsody by in silico saturation mutagenesis studies of human H-Ras, phosphatase and tensin homolog and thiopurine S-methyltransferase. AVAILABILITY AND IMPLEMENTATION The new tool is available both as an online webserver at http://rhapsody.csb.pitt.edu and as an open-source Python package (GitHub repository: https://github.com/prody/rhapsody; PyPI package installation: pip install prody-rhapsody). Links to additional resources, tutorials and package documentation are provided in the 'Python package' section of the website. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Luca Ponzoni
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15260, USA
| | - Daniel A Peñaherrera
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15260, USA
| | - Zoltán N Oltvai
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15260, USA.,Department of Pathology, University of Pittsburgh, Pittsburgh, PA 15261, USA.,Department of Laboratory Medicine and Pathology, University of Minnesota, Minneapolis, MN 55455, USA
| | - Ivet Bahar
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15260, USA
| |
Collapse
|
12
|
Du Z, He Y, Li J, Uversky VN. DeepAdd: Protein function prediction from k-mer embedding and additional features. Comput Biol Chem 2020; 89:107379. [PMID: 33011616 DOI: 10.1016/j.compbiolchem.2020.107379] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2019] [Revised: 09/15/2020] [Accepted: 09/17/2020] [Indexed: 10/23/2022]
Abstract
With the application of new high throughput sequencing technology, a large number of protein sequences is becoming available. Determination of the functional characteristics of these proteins by experiments is an expensive endeavor that requires a lot of time. Furthermore, at the organismal level, such kind of experimental functional analyses can be conducted only for a very few selected model organisms. Computational function prediction methods can be used to fill this gap. The functions of proteins are classified by Gene Ontology (GO), which contains more than 40,000 classifications in three domains, Molecular Function (MF), Biological Process (BP), and Cellular Component (CC). Additionally, since proteins have many functions, function prediction represents a multi-label and multi-class problem. We developed a new method to predict protein function from sequence. To this end, natural language model was used to generate word embedding of sequence and learn features from it by deep learning, and additional features to locate every protein. Our method uses the dependencies between GO classes as background information to construct a deep learning model. We evaluate our method using the standards established by the Computational Assessment of Function Annotation (CAFA) and have noticeable improvement over several algorithms, such as FFPred, DeepGO, GoFDR and other methods compared on the CAFA3 datasets.
Collapse
Affiliation(s)
- Zhihua Du
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Guangdong Province, PR China.
| | - Yufeng He
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Guangdong Province, PR China
| | - Jianqiang Li
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Guangdong Province, PR China
| | - Vladimir N Uversky
- Department of Molecular Medicine, Morsani College of Medicine, University of South Florida, 12901 Bruce B. Downs Blvd. MDC07, Tampa, FL, USA; USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, 12901 Bruce B. Downs Blvd. MDC07, Tampa, FL, USA; Laboratory of New Methods in Biology, Institute for Biological Instrumentation, Russian Academy of Sciences, Institutskaya Str., 7, Pushchino, Moscow Region, 142290, Russia.
| |
Collapse
|
13
|
A deep learning framework to predict binding preference of RNA constituents on protein surface. Nat Commun 2019; 10:4941. [PMID: 31666519 PMCID: PMC6821705 DOI: 10.1038/s41467-019-12920-0] [Citation(s) in RCA: 58] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Accepted: 10/08/2019] [Indexed: 12/21/2022] Open
Abstract
Protein-RNA interaction plays important roles in post-transcriptional regulation. However, the task of predicting these interactions given a protein structure is difficult. Here we show that, by leveraging a deep learning model NucleicNet, attributes such as binding preference of RNA backbone constituents and different bases can be predicted from local physicochemical characteristics of protein structure surface. On a diverse set of challenging RNA-binding proteins, including Fem-3-binding-factor 2, Argonaute 2 and Ribonuclease III, NucleicNet can accurately recover interaction modes discovered by structural biology experiments. Furthermore, we show that, without seeing any in vitro or in vivo assay data, NucleicNet can still achieve consistency with experiments, including RNAcompete, Immunoprecipitation Assay, and siRNA Knockdown Benchmark. NucleicNet can thus serve to provide quantitative fitness of RNA sequences for given binding pockets or to predict potential binding pockets and binding RNAs for previously unknown RNA binding proteins.
Collapse
|
14
|
Gao R, Wang M, Zhou J, Fu Y, Liang M, Guo D, Nie J. Prediction of Enzyme Function Based on Three Parallel Deep CNN and Amino Acid Mutation. Int J Mol Sci 2019; 20:E2845. [PMID: 31212665 PMCID: PMC6600291 DOI: 10.3390/ijms20112845] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2019] [Revised: 06/03/2019] [Accepted: 06/04/2019] [Indexed: 01/28/2023] Open
Abstract
During the past decade, due to the number of proteins in PDB database being increased gradually, traditional methods cannot better understand the function of newly discovered enzymes in chemical reactions. Computational models and protein feature representation for predicting enzymatic function are more important. Most of existing methods for predicting enzymatic function have used protein geometric structure or protein sequence alone. In this paper, the functions of enzymes are predicted from many-sided biological information including sequence information and structure information. Firstly, we extract the mutation information from amino acids sequence by the position scoring matrix and express structure information with amino acids distance and angle. Then, we use histogram to show the extracted sequence and structural features respectively. Meanwhile, we establish a network model of three parallel Deep Convolutional Neural Networks (DCNN) to learn three features of enzyme for function prediction simultaneously, and the outputs are fused through two different architectures. Finally, The proposed model was investigated on a large dataset of 43,843 enzymes from the PDB and achieved 92.34% correct classification when sequence information is considered, demonstrating an improvement compared with the previous result.
Collapse
Affiliation(s)
- Ruibo Gao
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| | - Mengmeng Wang
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| | - Jiaoyan Zhou
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| | - Yuhang Fu
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| | - Meng Liang
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| | - Dongliang Guo
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| | - Junlan Nie
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| |
Collapse
|
15
|
Romano JD, Tatonetti NP. Informatics and Computational Methods in Natural Product Drug Discovery: A Review and Perspectives. Front Genet 2019; 10:368. [PMID: 31114606 PMCID: PMC6503039 DOI: 10.3389/fgene.2019.00368] [Citation(s) in RCA: 59] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Accepted: 04/05/2019] [Indexed: 12/17/2022] Open
Abstract
The discovery of new pharmaceutical drugs is one of the preeminent tasks-scientifically, economically, and socially-in biomedical research. Advances in informatics and computational biology have increased productivity at many stages of the drug discovery pipeline. Nevertheless, drug discovery has slowed, largely due to the reliance on small molecules as the primary source of novel hypotheses. Natural products (such as plant metabolites, animal toxins, and immunological components) comprise a vast and diverse source of bioactive compounds, some of which are supported by thousands of years of traditional medicine, and are largely disjoint from the set of small molecules used commonly for discovery. However, natural products possess unique characteristics that distinguish them from traditional small molecule drug candidates, requiring new methods and approaches for assessing their therapeutic potential. In this review, we investigate a number of state-of-the-art techniques in bioinformatics, cheminformatics, and knowledge engineering for data-driven drug discovery from natural products. We focus on methods that aim to bridge the gap between traditional small-molecule drug candidates and different classes of natural products. We also explore the current informatics knowledge gaps and other barriers that need to be overcome to fully leverage these compounds for drug discovery. Finally, we conclude with a "road map" of research priorities that seeks to realize this goal.
Collapse
Affiliation(s)
- Joseph D. Romano
- Department of Biomedical Informatics, Columbia University, New York, NY, United States
- Department of Systems Biology, Columbia University, New York, NY, United States
- Department of Medicine, Columbia University, New York, NY, United States
- Data Science Institute, Columbia University, New York, NY, United States
| | - Nicholas P. Tatonetti
- Department of Biomedical Informatics, Columbia University, New York, NY, United States
- Department of Systems Biology, Columbia University, New York, NY, United States
- Department of Medicine, Columbia University, New York, NY, United States
- Data Science Institute, Columbia University, New York, NY, United States
| |
Collapse
|
16
|
ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network. Molecules 2017; 22:molecules22101732. [PMID: 29039790 PMCID: PMC6151571 DOI: 10.3390/molecules22101732] [Citation(s) in RCA: 114] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Revised: 10/11/2017] [Accepted: 10/11/2017] [Indexed: 11/25/2022] Open
Abstract
With the development of next generation sequencing techniques, it is fast and cheap to determine protein sequences but relatively slow and expensive to extract useful information from protein sequences because of limitations of traditional biological experimental techniques. Protein function prediction has been a long standing challenge to fill the gap between the huge amount of protein sequences and the known function. In this paper, we propose a novel method to convert the protein function problem into a language translation problem by the new proposed protein sequence language “ProLan” to the protein function language “GOLan”, and build a neural machine translation model based on recurrent neural networks to translate “ProLan” language to “GOLan” language. We blindly tested our method by attending the latest third Critical Assessment of Function Annotation (CAFA 3) in 2016, and also evaluate the performance of our methods on selected proteins whose function was released after CAFA competition. The good performance on the training and testing datasets demonstrates that our new proposed method is a promising direction for protein function prediction. In summary, we first time propose a method which converts the protein function prediction problem to a language translation problem and applies a neural machine translation model for protein function prediction.
Collapse
|
17
|
Torng W, Altman RB. 3D deep convolutional neural networks for amino acid environment similarity analysis. BMC Bioinformatics 2017; 18:302. [PMID: 28615003 PMCID: PMC5472009 DOI: 10.1186/s12859-017-1702-0] [Citation(s) in RCA: 75] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2017] [Accepted: 05/22/2017] [Indexed: 01/08/2023] Open
Abstract
Background Central to protein biology is the understanding of how structural elements give rise to observed function. The surfeit of protein structural data enables development of computational methods to systematically derive rules governing structural-functional relationships. However, performance of these methods depends critically on the choice of protein structural representation. Most current methods rely on features that are manually selected based on knowledge about protein structures. These are often general-purpose but not optimized for the specific application of interest. In this paper, we present a general framework that applies 3D convolutional neural network (3DCNN) technology to structure-based protein analysis. The framework automatically extracts task-specific features from the raw atom distribution, driven by supervised labels. As a pilot study, we use our network to analyze local protein microenvironments surrounding the 20 amino acids, and predict the amino acids most compatible with environments within a protein structure. To further validate the power of our method, we construct two amino acid substitution matrices from the prediction statistics and use them to predict effects of mutations in T4 lysozyme structures. Results Our deep 3DCNN achieves a two-fold increase in prediction accuracy compared to models that employ conventional hand-engineered features and successfully recapitulates known information about similar and different microenvironments. Models built from our predictions and substitution matrices achieve an 85% accuracy predicting outcomes of the T4 lysozyme mutation variants. Our substitution matrices contain rich information relevant to mutation analysis compared to well-established substitution matrices. Finally, we present a visualization method to inspect the individual contributions of each atom to the classification decisions. Conclusions End-to-end trained deep learning networks consistently outperform methods using hand-engineered features, suggesting that the 3DCNN framework is well suited for analysis of protein microenvironments and may be useful for other protein structural analyses. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1702-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wen Torng
- Deparment of Bioengineering, Stanford University, Stanford, CA, 94305, USA
| | - Russ B Altman
- Deparment of Bioengineering, Stanford University, Stanford, CA, 94305, USA. .,Department of Genetics, Stanford University, Stanford, CA, 94305, USA.
| |
Collapse
|
18
|
Moll M, Finn PW, Kavraki LE. Structure-guided selection of specificity determining positions in the human Kinome. BMC Genomics 2016; 17 Suppl 4:431. [PMID: 27556159 PMCID: PMC5001202 DOI: 10.1186/s12864-016-2790-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Background The human kinome contains many important drug targets. It is well-known that inhibitors of protein kinases bind with very different selectivity profiles. This is also the case for inhibitors of many other protein families. The increased availability of protein 3D structures has provided much information on the structural variation within a given protein family. However, the relationship between structural variations and binding specificity is complex and incompletely understood. We have developed a structural bioinformatics approach which provides an analysis of key determinants of binding selectivity as a tool to enhance the rational design of drugs with a specific selectivity profile. Results We propose a greedy algorithm that computes a subset of residue positions in a multiple sequence alignment such that structural and chemical variation in those positions helps explain known binding affinities. By providing this information, the main purpose of the algorithm is to provide experimentalists with possible insights into how the selectivity profile of certain inhibitors is achieved, which is useful for lead optimization. In addition, the algorithm can also be used to predict binding affinities for structures whose affinity for a given inhibitor is unknown. The algorithm’s performance is demonstrated using an extensive dataset for the human kinome. Conclusion We show that the binding affinity of 38 different kinase inhibitors can be explained with consistently high precision and accuracy using the variation of at most six residue positions in the kinome binding site. We show for several inhibitors that we are able to identify residues that are known to be functionally important.
Collapse
Affiliation(s)
- Mark Moll
- Department of Computer Science, Rice University, PO Box 1892, Houston, 77251, TX, USA.
| | - Paul W Finn
- University of Buckingham, Hunter St, Buckingham, UK
| | - Lydia E Kavraki
- Department of Computer Science, Rice University, PO Box 1892, Houston, 77251, TX, USA
| |
Collapse
|
19
|
Meng J, Wekesa JS, Shi GL, Luan YS. Protein function prediction based on data fusion and functional interrelationship. Math Biosci 2016; 274:25-32. [DOI: 10.1016/j.mbs.2016.02.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2015] [Revised: 01/08/2016] [Accepted: 02/01/2016] [Indexed: 10/22/2022]
|
20
|
Zhou W, Tang GW, Altman RB. High Resolution Prediction of Calcium-Binding Sites in 3D Protein Structures Using FEATURE. J Chem Inf Model 2015; 55:1663-72. [PMID: 26226489 PMCID: PMC4731830 DOI: 10.1021/acs.jcim.5b00367] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Metal-binding proteins are ubiquitous in biological systems ranging from enzymes to cell surface receptors. Among the various biologically active metal ions, calcium plays a large role in regulating cellular and physiological changes. With the increasing number of high-quality crystal structures of proteins associated with their metal ion ligands, many groups have built models to identify Ca(2+) sites in proteins, utilizing information such as structure, geometry, or homology to do the inference. We present a FEATURE-based approach in building such a model and show that our model is able to discriminate between nonsites and calcium-binding sites with a very high precision of more than 98%. We demonstrate the high specificity of our model by applying it to test sets constructed from other ions. We also introduce an algorithm to convert high scoring regions into specific site predictions and demonstrate the usage by scanning a test set of 91 calcium-binding protein structures (190 calcium sites). The algorithm has a recall of more than 93% on the test set with predictions found within 3 Å of the actual sites.
Collapse
Affiliation(s)
- Weizhuang Zhou
- Department of Bioengineering, Stanford University , 443 Via Ortega, Stanford, California 94305-4145, United States
| | - Grace W Tang
- Department of Bioengineering, Stanford University , 443 Via Ortega, Stanford, California 94305-4145, United States
| | - Russ B Altman
- Department of Bioengineering, Stanford University , 443 Via Ortega, Stanford, California 94305-4145, United States
| |
Collapse
|
21
|
Chartier M, Najmanovich R. Detection of Binding Site Molecular Interaction Field Similarities. J Chem Inf Model 2015; 55:1600-15. [PMID: 26158641 DOI: 10.1021/acs.jcim.5b00333] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
Protein binding-site similarity detection methods can be used to predict protein function and understand molecular recognition, as a tool in drug design for drug repurposing and polypharmacology, and for the prediction of the molecular determinants of drug toxicity. Here, we present IsoMIF, a method able to identify binding site molecular interaction field similarities across protein families. IsoMIF utilizes six chemical probes and the detection of subgraph isomorphisms to identify geometrically and chemically equivalent sections of protein cavity pairs. The method is validated using six distinct data sets, four of those previously used in the validation of other methods. The mean area under the receiver operator curve (AUC) obtained across data sets for IsoMIF is higher than those of other methods. Furthermore, while IsoMIF obtains consistently high AUC values across data sets, other methods perform more erratically across data sets. IsoMIF can be used to predict function from structure, to detect potential cross-reactivity or polypharmacology targets, and to help suggest bioisosteric replacements to known binding molecules. Given that IsoMIF detects spatial patterns of molecular interaction field similarities, its predictions are directly related to pharmacophores and may be readily translated into modeling decisions in structure-based drug design. IsoMIF may in principle detect similar binding sites with distinct amino acid arrangements that lead to equivalent interactions within the cavity. The source code to calculate and visualize MIFs and MIF similarities are freely available.
Collapse
Affiliation(s)
- Matthieu Chartier
- Department of Biochemistry, Faculty of Medicine and Health Sciences, University of Sherbrooke , 12e Avenue Nord, Sherbrooke, J1H 5N4 Québec, Canada
| | - Rafael Najmanovich
- Department of Biochemistry, Faculty of Medicine and Health Sciences, University of Sherbrooke , 12e Avenue Nord, Sherbrooke, J1H 5N4 Québec, Canada
| |
Collapse
|
22
|
Tang GW, Altman RB. Knowledge-based fragment binding prediction. PLoS Comput Biol 2014; 10:e1003589. [PMID: 24762971 PMCID: PMC3998881 DOI: 10.1371/journal.pcbi.1003589] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2013] [Accepted: 03/11/2014] [Indexed: 11/18/2022] Open
Abstract
Target-based drug discovery must assess many drug-like compounds for potential activity. Focusing on low-molecular-weight compounds (fragments) can dramatically reduce the chemical search space. However, approaches for determining protein-fragment interactions have limitations. Experimental assays are time-consuming, expensive, and not always applicable. At the same time, computational approaches using physics-based methods have limited accuracy. With increasing high-resolution structural data for protein-ligand complexes, there is now an opportunity for data-driven approaches to fragment binding prediction. We present FragFEATURE, a machine learning approach to predict small molecule fragments preferred by a target protein structure. We first create a knowledge base of protein structural environments annotated with the small molecule substructures they bind. These substructures have low-molecular weight and serve as a proxy for fragments. FragFEATURE then compares the structural environments within a target protein to those in the knowledge base to retrieve statistically preferred fragments. It merges information across diverse ligands with shared substructures to generate predictions. Our results demonstrate FragFEATURE's ability to rediscover fragments corresponding to the ligand bound with 74% precision and 82% recall on average. For many protein targets, it identifies high scoring fragments that are substructures of known inhibitors. FragFEATURE thus predicts fragments that can serve as inputs to fragment-based drug design or serve as refinement criteria for creating target-specific compound libraries for experimental or computational screening.
Collapse
Affiliation(s)
- Grace W. Tang
- Department of Bioengineering, Stanford University, Stanford, California, United States of America
| | - Russ B. Altman
- Department of Bioengineering, Stanford University, Stanford, California, United States of America
- Department of Genetics, Stanford University, Stanford, California, United States of America
- * E-mail:
| |
Collapse
|
23
|
Buturovic L, Wong M, Tang GW, Altman RB, Petkovic D. High precision prediction of functional sites in protein structures. PLoS One 2014; 9:e91240. [PMID: 24632601 PMCID: PMC3954699 DOI: 10.1371/journal.pone.0091240] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2013] [Accepted: 02/11/2014] [Indexed: 11/29/2022] Open
Abstract
We address the problem of assigning biological function to solved protein structures. Computational tools play a critical role in identifying potential active sites and informing screening decisions for further lab analysis. A critical parameter in the practical application of computational methods is the precision, or positive predictive value. Precision measures the level of confidence the user should have in a particular computed functional assignment. Low precision annotations lead to futile laboratory investigations and waste scarce research resources. In this paper we describe an advanced version of the protein function annotation system FEATURE, which achieved 99% precision and average recall of 95% across 20 representative functional sites. The system uses a Support Vector Machine classifier operating on the microenvironment of physicochemical features around an amino acid. We also compared performance of our method with state-of-the-art sequence-level annotator Pfam in terms of precision, recall and localization. To our knowledge, no other functional site annotator has been rigorously evaluated against these key criteria. The software and predictive models are incorporated into the WebFEATURE service at http://feature.stanford.edu/wf4.0-beta.
Collapse
Affiliation(s)
- Ljubomir Buturovic
- Department of Computer Science, San Francisco State University, San Francisco, California, United States of America
- * E-mail:
| | - Mike Wong
- Center for Computing for Life Sciences, San Francisco State University, San Francisco, California, United States of America
| | - Grace W. Tang
- Department of Bioengineering, Stanford University, Stanford, California, United States of America
| | - Russ B. Altman
- Department of Bioengineering, Stanford University, Stanford, California, United States of America
| | - Dragutin Petkovic
- Department of Computer Science, San Francisco State University, San Francisco, California, United States of America
- Center for Computing for Life Sciences, San Francisco State University, San Francisco, California, United States of America
| |
Collapse
|
24
|
Julfayev ES, McLaughlin RJ, Tao YP, McLaughlin WA. KB-Rank: efficient protein structure and functional annotation identification via text query. ACTA ACUST UNITED AC 2012; 13:101-10. [PMID: 22270457 PMCID: PMC3375009 DOI: 10.1007/s10969-012-9125-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2011] [Accepted: 01/07/2012] [Indexed: 12/12/2022]
Abstract
The KB-Rank tool was developed to help determine the functions of proteins. A user provides text query and protein structures are retrieved together with their functional annotation categories. Structures and annotation categories are ranked according to their estimated relevance to the queried text. The algorithm for ranking first retrieves matches between the query text and the text fields associated with the structures. The structures are next ordered by their relative content of annotations that are found to be prevalent across all the structures retrieved. An interactive web interface was implemented to navigate and interpret the relevance of the structures and annotation categories retrieved by a given search. The aim of the KB-Rank tool is to provide a means to quickly identify protein structures of interest and the annotations most relevant to the queries posed by a user. Informational and navigational searches regarding disease topics are described to illustrate the tool’s utilities. The tool is available at the URL http://protein.tcmedc.org/KB-Rank.
Collapse
Affiliation(s)
- Elchin S. Julfayev
- Department of Basic Science, The Commonwealth Medical College, 525 Pine Street, Scranton, PA 18509 USA
| | - Ryan J. McLaughlin
- Department of Basic Science, The Commonwealth Medical College, 525 Pine Street, Scranton, PA 18509 USA
| | - Yi-Ping Tao
- Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, 610 Taylor Road, Piscataway, NJ 08854-8087 USA
| | - William A. McLaughlin
- Department of Basic Science, The Commonwealth Medical College, 525 Pine Street, Scranton, PA 18509 USA
| |
Collapse
|
25
|
Human intuition in the quantitative age. The role of mathematics in biology is vital, but does it leave room for 'old-fashioned' observation and interpretation? EMBO Rep 2011; 12:401-4. [PMID: 21525944 DOI: 10.1038/embor.2011.57] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
|
26
|
Tang GW, Altman RB. Remote thioredoxin recognition using evolutionary conservation and structural dynamics. Structure 2011; 19:461-70. [PMID: 21481770 DOI: 10.1016/j.str.2011.02.007] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2010] [Revised: 02/06/2011] [Accepted: 02/16/2011] [Indexed: 12/25/2022]
Abstract
The thioredoxin family of oxidoreductases plays an important role in redox signaling and control of protein function. Not only are thioredoxins linked to a variety of disorders, but their stable structure has also seen application in protein engineering. Both sequence-based and structure-based tools exist for thioredoxin identification, but remote homolog detection remains a challenge. We developed a thioredoxin predictor using the approach of integrating sequence with structural information. We combined a sequence-based Hidden Markov Model (HMM) with a molecular dynamics enhanced structure-based recognition method (dynamic FEATURE, DF). This hybrid method (HMMDF) has high precision and recall (0.90 and 0.95, respectively) compared with HMM (0.92 and 0.87, respectively) and DF (0.82 and 0.97, respectively). Dynamic FEATURE is sensitive but struggles to resolve closely related protein families, while HMM identifies these evolutionary differences by compromising sensitivity. Our method applied to structural genomics targets makes a strong prediction of a novel thioredoxin.
Collapse
Affiliation(s)
- Grace W Tang
- Department of Bioengineering, Stanford University, Stanford, CA 94305, USA
| | | |
Collapse
|
27
|
Regad L, Martin J, Camproux AC. Dissecting protein loops with a statistical scalpel suggests a functional implication of some structural motifs. BMC Bioinformatics 2011; 12:247. [PMID: 21689388 PMCID: PMC3158783 DOI: 10.1186/1471-2105-12-247] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2010] [Accepted: 06/20/2011] [Indexed: 12/24/2022] Open
Abstract
Background One of the strategies for protein function annotation is to search particular structural motifs that are known to be shared by proteins with a given function. Results Here, we present a systematic extraction of structural motifs of seven residues from protein loops and we explore their correspondence with functional sites. Our approach is based on the structural alphabet HMM-SA (Hidden Markov Model - Structural Alphabet), which allows simplification of protein structures into uni-dimensional sequences, and advanced pattern statistics adapted to short sequences. Structural motifs of interest are selected by looking for structural motifs significantly over-represented in SCOP superfamilies in protein loops. We discovered two types of structural motifs significantly over-represented in SCOP superfamilies: (i) ubiquitous motifs, shared by several superfamilies and (ii) superfamily-specific motifs, over-represented in few superfamilies. A comparison of ubiquitous words with known small structural motifs shows that they contain well-described motifs as turn, niche or nest motifs. A comparison between superfamily-specific motifs and biological annotations of Swiss-Prot reveals that some of them actually correspond to functional sites involved in the binding sites of small ligands, such as ATP/GTP, NAD(P) and SAH/SAM. Conclusions Our findings show that statistical over-representation in SCOP superfamilies is linked to functional features. The detection of over-represented motifs within structures simplified by HMM-SA is therefore a promising approach for prediction of functional sites and annotation of uncharacterized proteins.
Collapse
|
28
|
Regad L, Saladin A, Maupetit J, Geneix C, Camproux AC. SA-Mot: a web server for the identification of motifs of interest extracted from protein loops. Nucleic Acids Res 2011; 39:W203-9. [PMID: 21665924 PMCID: PMC3125790 DOI: 10.1093/nar/gkr410] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
The detection of functional motifs is an important step for the determination of protein functions. We present here a new web server SA-Mot (Structural Alphabet Motif) for the extraction and location of structural motifs of interest from protein loops. Contrary to other methods, SA-Mot does not focus only on functional motifs, but it extracts recurrent and conserved structural motifs involved in structural redundancy of loops. SA-Mot uses the structural word notion to extract all structural motifs from uni-dimensional sequences corresponding to loop structures. Then, SA-Mot provides a description of these structural motifs using statistics computed in the loop data set and in SCOP superfamily, sequence and structural parameters. SA-Mot results correspond to an interactive table listing all structural motifs extracted from a target structure and their associated descriptors. Using this information, the users can easily locate loop regions that are important for the protein folding and function. The SA-Mot web server is available at http://sa-mot.mti.univ-paris-diderot.fr.
Collapse
Affiliation(s)
- Leslie Regad
- INSERM, U973, Université Paris 7-Paris Diderot, UMR-S973, MTi F-75013 Paris, France.
| | | | | | | | | |
Collapse
|
29
|
Julfayev ES, McLaughlin RJ, Tao YP, McLaughlin WA. A new approach to assess and predict the functional roles of proteins across all known structures. JOURNAL OF STRUCTURAL AND FUNCTIONAL GENOMICS 2011; 12:9-20. [PMID: 21445639 PMCID: PMC3089730 DOI: 10.1007/s10969-011-9105-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/17/2010] [Accepted: 03/14/2011] [Indexed: 12/11/2022]
Abstract
The three dimensional atomic structures of proteins provide information regarding their function; and codified relationships between structure and function enable the assessment of function from structure. In the current study, a new data mining tool was implemented that checks current gene ontology (GO) annotations and predicts new ones across all the protein structures available in the Protein Data Bank (PDB). The tool overcomes some of the challenges of utilizing large amounts of protein annotation and measurement information to form correspondences between protein structure and function. Protein attributes were extracted from the Structural Biology Knowledgebase and open source biological databases. Based on the presence or absence of a given set of attributes, a given protein's functional annotations were inferred. The results show that attributes derived from the three dimensional structures of proteins enhanced predictions over that using attributes only derived from primary amino acid sequence. Some predictions reflected known but not completely documented GO annotations. For example, predictions for the GO term for copper ion binding reflected used information a copper ion was known to interact with the protein based on information in a ligand interaction database. Other predictions were novel and require further experimental validation. These include predictions for proteins labeled as unknown function in the PDB. Two examples are a role in the regulation of transcription for the protein AF1396 from Archaeoglobus fulgidus and a role in RNA metabolism for the protein psuG from Thermotoga maritima.
Collapse
Affiliation(s)
- Elchin S. Julfayev
- Department of Basic Science, The Commonwealth Medical College, 525 Pine Street, Scranton, PA 18509 USA
| | - Ryan J. McLaughlin
- Department of Basic Science, The Commonwealth Medical College, 525 Pine Street, Scranton, PA 18509 USA
| | - Yi-Ping Tao
- Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, 610 Taylor Road, Piscataway, NJ 08854-8087 USA
| | - William A. McLaughlin
- Department of Basic Science, The Commonwealth Medical College, 525 Pine Street, Scranton, PA 18509 USA
| |
Collapse
|
30
|
Moll M, Bryant DH, Kavraki LE. The LabelHash algorithm for substructure matching. BMC Bioinformatics 2010; 11:555. [PMID: 21070651 PMCID: PMC2996407 DOI: 10.1186/1471-2105-11-555] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2010] [Accepted: 11/11/2010] [Indexed: 08/30/2023] Open
Abstract
Background There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Results We present LabelHash, a novel algorithm for matching substructural motifs to large collections of protein structures. The algorithm consists of two phases. In the first phase the proteins are preprocessed in a fashion that allows for instant lookup of partial matches to any motif. In the second phase, partial matches for a given motif are expanded to complete matches. The general applicability of the algorithm is demonstrated with three different case studies. First, we show that we can accurately identify members of the enolase superfamily with a single motif. Next, we demonstrate how LabelHash can complement SOIPPA, an algorithm for motif identification and pairwise substructure alignment. Finally, a large collection of Catalytic Site Atlas motifs is used to benchmark the performance of the algorithm. LabelHash runs very efficiently in parallel; matching a motif against all proteins in the 95% sequence identity filtered non-redundant Protein Data Bank typically takes no more than a few minutes. The LabelHash algorithm is available through a web server and as a suite of standalone programs at http://labelhash.kavrakilab.org. The output of the LabelHash algorithm can be further analyzed with Chimera through a plugin that we developed for this purpose. Conclusions LabelHash is an efficient, versatile algorithm for large-scale substructure matching. When LabelHash is running in parallel, motifs can typically be matched against the entire PDB on the order of minutes. The algorithm is able to identify functional homologs beyond the twilight zone of sequence identity and even beyond fold similarity. The three case studies presented in this paper illustrate the versatility of the algorithm.
Collapse
Affiliation(s)
- Mark Moll
- Department of Computer Science, Rice University, Houston, TX 77005, USA.
| | | | | |
Collapse
|
31
|
Doppelt-Azeroual O, Delfaud F, Moriaud F, de Brevern AG. Fast and automated functional classification with MED-SuMo: an application on purine-binding proteins. Protein Sci 2010; 19:847-67. [PMID: 20162627 DOI: 10.1002/pro.364] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Ligand-protein interactions are essential for biological processes, and precise characterization of protein binding sites is crucial to understand protein functions. MED-SuMo is a powerful technology to localize similar local regions on protein surfaces. Its heuristic is based on a 3D representation of macromolecules using specific surface chemical features associating chemical characteristics with geometrical properties. MED-SMA is an automated and fast method to classify binding sites. It is based on MED-SuMo technology, which builds a similarity graph, and it uses the Markov Clustering algorithm. Purine binding sites are well studied as drug targets. Here, purine binding sites of the Protein DataBank (PDB) are classified. Proteins potentially inhibited or activated through the same mechanism are gathered. Results are analyzed according to PROSITE annotations and to carefully refined functional annotations extracted from the PDB. As expected, binding sites associated with related mechanisms are gathered, for example, the Small GTPases. Nevertheless, protein kinases from different Kinome families are also found together, for example, Aurora-A and CDK2 proteins which are inhibited by the same drugs. Representative examples of different clusters are presented. The effectiveness of the MED-SMA approach is demonstrated as it gathers binding sites of proteins with similar structure-activity relationships. Moreover, an efficient new protocol associates structures absent of cocrystallized ligands to the purine clusters enabling those structures to be associated with a specific binding mechanism. Applications of this classification by binding mode similarity include target-based drug design and prediction of cross-reactivity and therefore potential toxic side effects.
Collapse
Affiliation(s)
- Olivia Doppelt-Azeroual
- INSERM UMR-S 665, Dynamique des Structures et Interactions des Macromolécules Biologiques (DSIMB), Université Paris Diderot-Paris 7, Institut National de la Transfusion Sanguine (INTS), 6, rue Alexandre Cabanel, 75739 Paris cedex 15, France.
| | | | | | | |
Collapse
|
32
|
Lee T, Min H, Kim SJ, Yoon S. Application of maximin correlation analysis to classifying protein environments for function prediction. Biochem Biophys Res Commun 2010; 400:219-24. [DOI: 10.1016/j.bbrc.2010.08.042] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2010] [Accepted: 08/11/2010] [Indexed: 10/19/2022]
|
33
|
Bryant DH, Moll M, Chen BY, Fofanov VY, Kavraki LE. Analysis of substructural variation in families of enzymatic proteins with applications to protein function prediction. BMC Bioinformatics 2010; 11:242. [PMID: 20459833 PMCID: PMC2885373 DOI: 10.1186/1471-2105-11-242] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2009] [Accepted: 05/11/2010] [Indexed: 12/02/2022] Open
Abstract
Background Structural variations caused by a wide range of physico-chemical and biological sources directly influence the function of a protein. For enzymatic proteins, the structure and chemistry of the catalytic binding site residues can be loosely defined as a substructure of the protein. Comparative analysis of drug-receptor substructures across and within species has been used for lead evaluation. Substructure-level similarity between the binding sites of functionally similar proteins has also been used to identify instances of convergent evolution among proteins. In functionally homologous protein families, shared chemistry and geometry at catalytic sites provide a common, local point of comparison among proteins that may differ significantly at the sequence, fold, or domain topology levels. Results This paper describes two key results that can be used separately or in combination for protein function analysis. The Family-wise Analysis of SubStructural Templates (FASST) method uses all-against-all substructure comparison to determine Substructural Clusters (SCs). SCs characterize the binding site substructural variation within a protein family. In this paper we focus on examples of automatically determined SCs that can be linked to phylogenetic distance between family members, segregation by conformation, and organization by homology among convergent protein lineages. The Motif Ensemble Statistical Hypothesis (MESH) framework constructs a representative motif for each protein cluster among the SCs determined by FASST to build motif ensembles that are shown through a series of function prediction experiments to improve the function prediction power of existing motifs. Conclusions FASST contributes a critical feedback and assessment step to existing binding site substructure identification methods and can be used for the thorough investigation of structure-function relationships. The application of MESH allows for an automated, statistically rigorous procedure for incorporating structural variation data into protein function prediction pipelines. Our work provides an unbiased, automated assessment of the structural variability of identified binding site substructures among protein structure families and a technique for exploring the relation of substructural variation to protein function. As available proteomic data continues to expand, the techniques proposed will be indispensable for the large-scale analysis and interpretation of structural data.
Collapse
Affiliation(s)
- Drew H Bryant
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | | | | | | |
Collapse
|
34
|
Wu S, Liu T, Altman RB. Identification of recurring protein structure microenvironments and discovery of novel functional sites around CYS residues. BMC STRUCTURAL BIOLOGY 2010; 10:4. [PMID: 20122268 PMCID: PMC2833161 DOI: 10.1186/1472-6807-10-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/31/2009] [Accepted: 02/02/2010] [Indexed: 11/29/2022]
Abstract
Background The emergence of structural genomics presents significant challenges in the annotation of biologically uncharacterized proteins. Unfortunately, our ability to analyze these proteins is restricted by the limited catalog of known molecular functions and their associated 3D motifs. Results In order to identify novel 3D motifs that may be associated with molecular functions, we employ an unsupervised, two-phase clustering approach that combines k-means and hierarchical clustering with knowledge-informed cluster selection and annotation methods. We applied the approach to approximately 20,000 cysteine-based protein microenvironments (3D regions 7.5 Å in radius) and identified 70 interesting clusters, some of which represent known motifs (e.g. metal binding and phosphatase activity), and some of which are novel, including several zinc binding sites. Detailed annotation results are available online for all 70 clusters at http://feature.stanford.edu/clustering/cys. Conclusions The use of microenvironments instead of backbone geometric criteria enables flexible exploration of protein function space, and detection of recurring motifs that are discontinuous in sequence and diverse in structure. Clustering microenvironments may thus help to functionally characterize novel proteins and better understand the protein structure-function relationship.
Collapse
Affiliation(s)
- Shirley Wu
- 23andMe, 1390 Shorebird Way, Mountain View, CA, USA
| | | | | |
Collapse
|
35
|
Glazer DS, Radmer RJ, Altman RB. Improving structure-based function prediction using molecular dynamics. Structure 2009; 17:919-29. [PMID: 19604472 DOI: 10.1016/j.str.2009.05.010] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2008] [Revised: 05/06/2009] [Accepted: 05/06/2009] [Indexed: 10/20/2022]
Abstract
The number of molecules with solved three-dimensional structure but unknown function is increasing rapidly. Particularly problematic are novel folds with little detectable similarity to molecules of known function. Experimental assays can determine the functions of such molecules, but are time-consuming and expensive. Computational approaches can identify potential functional sites; however, these approaches generally rely on single static structures and do not use information about dynamics. In fact, structural dynamics can enhance function prediction: we coupled molecular dynamics simulations with structure-based function prediction algorithms that identify Ca(2+) binding sites. When applied to 11 challenging proteins, both methods showed substantial improvement in performance, revealing 22 more sites in one case and 12 more in the other, with a modest increase in apparent false positives. Thus, we show that treating molecules as dynamic entities improves the performance of structure-based function prediction methods.
Collapse
Affiliation(s)
- Dariya S Glazer
- Department of Genetics, Stanford University, Clark Center, Stanford, CA 94305, USA
| | | | | |
Collapse
|
36
|
Identification of protein functions using a machine-learning approach based on sequence-derived properties. Proteome Sci 2009; 7:27. [PMID: 19664241 PMCID: PMC2731080 DOI: 10.1186/1477-5956-7-27] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2009] [Accepted: 08/09/2009] [Indexed: 02/07/2023] Open
Abstract
Background Predicting the function of an unknown protein is an essential goal in bioinformatics. Sequence similarity-based approaches are widely used for function prediction; however, they are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak. This study aimed to develop an accurate prediction method for identifying protein function, irrespective of sequence and structural similarities. Results A highly accurate prediction method capable of identifying protein function, based solely on protein sequence properties, is described. This method analyses and identifies specific features of the protein sequence that are highly correlated with certain protein functions and determines the combination of protein sequence features that best characterises protein function. Thirty-three features that represent subtle differences in local regions and full regions of the protein sequences were introduced. On the basis of 484 features extracted solely from the protein sequence, models were built to predict the functions of 11 different proteins from a broad range of cellular components, molecular functions, and biological processes. The accuracy of protein function prediction using random forests with feature selection ranged from 94.23% to 100%. The local sequence information was found to have a broad range of applicability in predicting protein function. Conclusion We present an accurate prediction method using a machine-learning approach based solely on protein sequence properties. The primary contribution of this paper is to propose new PNPRD features representing global and/or local differences in sequences, based on positively and/or negatively charged residues, to assist in predicting protein function. In addition, we identified a compact and useful feature subset for predicting the function of various proteins. Our results indicate that sequence-based classifiers can provide good results among a broad range of proteins, that the proposed features are useful in predicting several functions, and that the combination of our and traditional features may support the creation of a discriminative feature set for specific protein functions.
Collapse
|
37
|
Genomics, molecular imaging, bioinformatics, and bio-nano-info integration are synergistic components of translational medicine and personalized healthcare research. BMC Genomics 2008; 9 Suppl 2:I1. [PMID: 18831773 PMCID: PMC3226104 DOI: 10.1186/1471-2164-9-s2-i1] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
Supported by National Science Foundation (NSF), International Society of Intelligent Biological Medicine (ISIBM), International Journal of Computational Biology and Drug Design and International Journal of Functional Informatics and Personalized Medicine, IEEE 7th Bioinformatics and Bioengineering attracted more than 600 papers and 500 researchers and medical doctors. It was the only synergistic inter/multidisciplinary IEEE conference with 24 Keynote Lectures, 7 Tutorials, 5 Cutting-Edge Research Workshops and 32 Scientific Sessions including 11 Special Research Interest Sessions that were designed dynamically at Harvard in response to the current research trends and advances. The committee was very grateful for the IEEE Plenary Keynote Lectures given by: Dr. A. Keith Dunker (Indiana), Dr. Jun Liu (Harvard), Dr. Brian Athey (Michigan), Dr. Mark Borodovsky (Georgia Tech and President of ISIBM), Dr. Hamid Arabnia (Georgia and Vice-President of ISIBM), Dr. Ruzena Bajcsy (Berkeley and Member of United States National Academy of Engineering and Member of United States Institute of Medicine of the National Academies), Dr. Mary Yang (United States National Institutes of Health and Oak Ridge, DOE), Dr. Chih-Ming Ho (UCLA and Member of United States National Academy of Engineering and Academician of Academia Sinica), Dr. Andy Baxevanis (United States National Institutes of Health), Dr. Arif Ghafoor (Purdue), Dr. John Quackenbush (Harvard), Dr. Eric Jakobsson (UIUC), Dr. Vladimir Uversky (Indiana), Dr. Laura Elnitski (United States National Institutes of Health) and other world-class scientific leaders. The Harvard meeting was a large academic event 100% full-sponsored by IEEE financially and academically. After a rigorous peer-review process, the committee selected 27 high-quality research papers from 600 submissions. The committee is grateful for contributions from keynote speakers Dr. Russ Altman (IEEE BIBM conference keynote lecturer on combining simulation and machine learning to recognize function in 4D), Dr. Mary Qu Yang (IEEE BIBM workshop keynote lecturer on new initiatives of detecting microscopic disease using machine learning and molecular biology, http://ieeexplore.ieee.org/servlet/opac?punumber=4425386) and Dr. Jack Y. Yang (IEEE BIBM workshop keynote lecturer on data mining and knowledge discovery in translational medicine) from the first IEEE Computer Society BioInformatics and BioMedicine (IEEE BIBM) international conference and workshops, November 2-4, 2007, Silicon Valley, California, USA.
Collapse
|