Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Halperin I, Glazer DS, Wu S, Altman RB. The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications. BMC Genomics 2008;9 Suppl 2:S2. [PMID: 18831785 PMCID: PMC2559884 DOI: 10.1186/1471-2164-9-s2-s2] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open

For:	Halperin I, Glazer DS, Wu S, Altman RB. The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications. BMC Genomics 2008;9 Suppl 2:S2. [PMID: 18831785 PMCID: PMC2559884 DOI: 10.1186/1471-2164-9-s2-s2] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open

Number

Cited by Other Article(s)

Wei J, Xiao J, Chen S, Zong L, Gao X, Li Y. ProNet DB: a proteome-wise database for protein surface property representations and RNA-binding profiles. Database (Oxford) 2024;2024:baae012. [PMID: 38557634 PMCID: PMC10984565 DOI: 10.1093/database/baae012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Revised: 01/08/2024] [Accepted: 02/17/2024] [Indexed: 04/04/2024]

Draizen EJ, Readey J, Mura C, Bourne PE. Prop3D: A flexible, Python-based platform for machine learning with protein structural properties and biophysical data. BMC Bioinformatics 2024;25:11. [PMID: 38177985 PMCID: PMC10768222 DOI: 10.1186/s12859-023-05586-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Accepted: 11/27/2023] [Indexed: 01/06/2024] Open

Abstract

BACKGROUND

Machine learning (ML) has a rich history in structural bioinformatics, and modern approaches, such as deep learning, are revolutionizing our knowledge of the subtle relationships between biomolecular sequence, structure, function, dynamics and evolution. As with any advance that rests upon statistical learning approaches, the recent progress in biomolecular sciences is enabled by the availability of vast volumes of sufficiently-variable data. To be useful, such data must be well-structured, machine-readable, intelligible and manipulable. These and related requirements pose challenges that become especially acute at the computational scales typical in ML. Furthermore, in structural bioinformatics such data generally relate to protein three-dimensional (3D) structures, which are inherently more complex than sequence-based data. A significant and recurring challenge concerns the creation of large, high-quality, openly-accessible datasets that can be used for specific training and benchmarking tasks in ML pipelines for predictive modeling projects, along with reproducible splits for training and testing.

RESULTS

Here, we report 'Prop3D', a platform that allows for the creation, sharing and extensible reuse of libraries of protein domains, featurized with biophysical and evolutionary properties that can range from detailed, atomically-resolved physicochemical quantities (e.g., electrostatics) to coarser, residue-level features (e.g., phylogenetic conservation). As a community resource, we also supply a 'Prop3D-20sf' protein dataset, obtained by applying our approach to CATH . We have developed and deployed the Prop3D framework, both in the cloud and on local HPC resources, to systematically and reproducibly create comprehensive datasets via the Highly Scalable Data Service ( HSDS ). Our datasets are freely accessible via a public HSDS instance, or they can be used with accompanying Python wrappers for popular ML frameworks.

CONCLUSION

Prop3D and its associated Prop3D-20sf dataset can be of broad utility in at least three ways. Firstly, the Prop3D workflow code can be customized and deployed on various cloud-based compute platforms, with scalability achieved largely by saving the results to distributed HDF5 files via HSDS . Secondly, the linked Prop3D-20sf dataset provides a hand-crafted, already-featurized dataset of protein domains for 20 highly-populated CATH families; importantly, provision of this pre-computed resource can aid the more efficient development (and reproducible deployment) of ML pipelines. Thirdly, Prop3D-20sf's construction explicitly takes into account (in creating datasets and data-splits) the enigma of 'data leakage', stemming from the evolutionary relationships between proteins.

Collapse

Xia Y, Xia C, Pan X, Shen H. BindWeb: A web server for ligand binding residue and pocket prediction from protein structures. Protein Sci 2022;31:e4462. [PMID: 36190332 PMCID: PMC9667820 DOI: 10.1002/pro.4462] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 09/27/2022] [Accepted: 09/28/2022] [Indexed: 12/13/2022]

Esperante S, Alvarez-Paggi D, Salgueiro M, Desimone M, de Oliveira G, Arán M, García-Pardo J, Aptekmann A, Ventura S, Alonso L, de Prat-Gay G. A finely tuned interplay between calcium binding, ionic strength and pH modulates conformational and oligomerization equilibria in the Respiratory Syncytial Virus Matrix (M) protein. Arch Biochem Biophys 2022;731:109424. [DOI: 10.1016/j.abb.2022.109424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 09/14/2022] [Accepted: 09/29/2022] [Indexed: 11/30/2022]

Yang L, He W, Yun Y, Gao Y, Zhu Z, Teng M, Liang Z, Niu L. Defining A Global Map of Functional Group-based 3D Ligand-binding Motifs. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022;20:765-779. [PMID: 35288344 PMCID: PMC9881048 DOI: 10.1016/j.gpb.2021.08.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/22/2020] [Revised: 06/30/2021] [Accepted: 09/27/2021] [Indexed: 01/31/2023]

Affiliation(s)

Liu Yang School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230026, China; Division of Molecular and Cellular Biophysics, Hefei National Laboratory for Physical Sciences at the Microscale, Hefei 230026, China
Wei He School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230026, China; Division of Molecular and Cellular Biophysics, Hefei National Laboratory for Physical Sciences at the Microscale, Hefei 230026, China.
Yuehui Yun School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230026, China; Division of Molecular and Cellular Biophysics, Hefei National Laboratory for Physical Sciences at the Microscale, Hefei 230026, China
Yongxiang Gao School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230026, China; Division of Molecular and Cellular Biophysics, Hefei National Laboratory for Physical Sciences at the Microscale, Hefei 230026, China
Zhongliang Zhu School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230026, China; Division of Molecular and Cellular Biophysics, Hefei National Laboratory for Physical Sciences at the Microscale, Hefei 230026, China
Maikun Teng School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230026, China; Division of Molecular and Cellular Biophysics, Hefei National Laboratory for Physical Sciences at the Microscale, Hefei 230026, China
Zhi Liang School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230026, China; Division of Molecular and Cellular Biophysics, Hefei National Laboratory for Physical Sciences at the Microscale, Hefei 230026, China.
Liwen Niu School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230026, China; Division of Molecular and Cellular Biophysics, Hefei National Laboratory for Physical Sciences at the Microscale, Hefei 230026, China.

Collapse

Wei J, Chen S, Zong L, Gao X, Li Y. Protein-RNA interaction prediction with deep learning: structure matters. Brief Bioinform 2022;23:bbab540. [PMID: 34929730 PMCID: PMC8790951 DOI: 10.1093/bib/bbab540] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Revised: 11/14/2021] [Accepted: 11/22/2021] [Indexed: 12/11/2022] Open

Bromberg Y, Aptekmann AA, Mahlich Y, Cook L, Senn S, Miller M, Nanda V, Ferreiro DU, Falkowski PG. Quantifying structural relationships of metal-binding sites suggests origins of biological electron transfer. SCIENCE ADVANCES 2022;8:eabj3984. [PMID: 35030025 PMCID: PMC8759750 DOI: 10.1126/sciadv.abj3984] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Accepted: 11/22/2021] [Indexed: 06/07/2023]

Shah HA, Liu J, Yang Z, Feng J. Review of Machine Learning Methods for the Prediction and Reconstruction of Metabolic Pathways. Front Mol Biosci 2021;8:634141. [PMID: 34222327 PMCID: PMC8247443 DOI: 10.3389/fmolb.2021.634141] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Accepted: 06/01/2021] [Indexed: 11/13/2022] Open

Toth JM, DePietro PJ, Haas J, McLaughlin WA. ResiRole: residue-level functional site predictions to gauge the accuracies of protein structure prediction techniques. Bioinformatics 2021;37:351-359. [PMID: 32780798 PMCID: PMC8058773 DOI: 10.1093/bioinformatics/btaa712] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Revised: 07/31/2020] [Accepted: 08/05/2020] [Indexed: 11/25/2022] Open

Babbi G, Savojardo C, Martelli PL, Casadio R. Huntingtin: A Protein with a Peculiar Solvent Accessible Surface. Int J Mol Sci 2021;22:ijms22062878. [PMID: 33809039 PMCID: PMC8001614 DOI: 10.3390/ijms22062878] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Revised: 03/04/2021] [Accepted: 03/04/2021] [Indexed: 11/30/2022] Open

Ponzoni L, Peñaherrera DA, Oltvai ZN, Bahar I. Rhapsody: predicting the pathogenicity of human missense variants. Bioinformatics 2020;36:3084-3092. [PMID: 32101277 PMCID: PMC7214033 DOI: 10.1093/bioinformatics/btaa127] [Citation(s) in RCA: 48] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2019] [Revised: 12/27/2019] [Accepted: 02/21/2020] [Indexed: 12/22/2022] Open

Du Z, He Y, Li J, Uversky VN. DeepAdd: Protein function prediction from k-mer embedding and additional features. Comput Biol Chem 2020;89:107379. [PMID: 33011616 DOI: 10.1016/j.compbiolchem.2020.107379] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2019] [Revised: 09/15/2020] [Accepted: 09/17/2020] [Indexed: 10/23/2022]

A deep learning framework to predict binding preference of RNA constituents on protein surface. Nat Commun 2019;10:4941. [PMID: 31666519 PMCID: PMC6821705 DOI: 10.1038/s41467-019-12920-0] [Citation(s) in RCA: 58] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Accepted: 10/08/2019] [Indexed: 12/21/2022] Open

Gao R, Wang M, Zhou J, Fu Y, Liang M, Guo D, Nie J. Prediction of Enzyme Function Based on Three Parallel Deep CNN and Amino Acid Mutation. Int J Mol Sci 2019;20:E2845. [PMID: 31212665 PMCID: PMC6600291 DOI: 10.3390/ijms20112845] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2019] [Revised: 06/03/2019] [Accepted: 06/04/2019] [Indexed: 01/28/2023] Open

Romano JD, Tatonetti NP. Informatics and Computational Methods in Natural Product Drug Discovery: A Review and Perspectives. Front Genet 2019;10:368. [PMID: 31114606 PMCID: PMC6503039 DOI: 10.3389/fgene.2019.00368] [Citation(s) in RCA: 59] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Accepted: 04/05/2019] [Indexed: 12/17/2022] Open

ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network. Molecules 2017;22:molecules22101732. [PMID: 29039790 PMCID: PMC6151571 DOI: 10.3390/molecules22101732] [Citation(s) in RCA: 114] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Revised: 10/11/2017] [Accepted: 10/11/2017] [Indexed: 11/25/2022] Open

Torng W, Altman RB. 3D deep convolutional neural networks for amino acid environment similarity analysis. BMC Bioinformatics 2017;18:302. [PMID: 28615003 PMCID: PMC5472009 DOI: 10.1186/s12859-017-1702-0] [Citation(s) in RCA: 75] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2017] [Accepted: 05/22/2017] [Indexed: 01/08/2023] Open

Abstract

Background

Central to protein biology is the understanding of how structural elements give rise to observed function. The surfeit of protein structural data enables development of computational methods to systematically derive rules governing structural-functional relationships. However, performance of these methods depends critically on the choice of protein structural representation. Most current methods rely on features that are manually selected based on knowledge about protein structures. These are often general-purpose but not optimized for the specific application of interest.

In this paper, we present a general framework that applies 3D convolutional neural network (3DCNN) technology to structure-based protein analysis. The framework automatically extracts task-specific features from the raw atom distribution, driven by supervised labels. As a pilot study, we use our network to analyze local protein microenvironments surrounding the 20 amino acids, and predict the amino acids most compatible with environments within a protein structure. To further validate the power of our method, we construct two amino acid substitution matrices from the prediction statistics and use them to predict effects of mutations in T4 lysozyme structures.

Results

Our deep 3DCNN achieves a two-fold increase in prediction accuracy compared to models that employ conventional hand-engineered features and successfully recapitulates known information about similar and different microenvironments. Models built from our predictions and substitution matrices achieve an 85% accuracy predicting outcomes of the T4 lysozyme mutation variants. Our substitution matrices contain rich information relevant to mutation analysis compared to well-established substitution matrices. Finally, we present a visualization method to inspect the individual contributions of each atom to the classification decisions.

Conclusions

End-to-end trained deep learning networks consistently outperform methods using hand-engineered features, suggesting that the 3DCNN framework is well suited for analysis of protein microenvironments and may be useful for other protein structural analyses.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-017-1702-0) contains supplementary material, which is available to authorized users.

Collapse

Moll M, Finn PW, Kavraki LE. Structure-guided selection of specificity determining positions in the human Kinome. BMC Genomics 2016;17 Suppl 4:431. [PMID: 27556159 PMCID: PMC5001202 DOI: 10.1186/s12864-016-2790-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open

Meng J, Wekesa JS, Shi GL, Luan YS. Protein function prediction based on data fusion and functional interrelationship. Math Biosci 2016;274:25-32. [DOI: 10.1016/j.mbs.2016.02.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2015] [Revised: 01/08/2016] [Accepted: 02/01/2016] [Indexed: 10/22/2022]

Zhou W, Tang GW, Altman RB. High Resolution Prediction of Calcium-Binding Sites in 3D Protein Structures Using FEATURE. J Chem Inf Model 2015;55:1663-72. [PMID: 26226489 PMCID: PMC4731830 DOI: 10.1021/acs.jcim.5b00367] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

Chartier M, Najmanovich R. Detection of Binding Site Molecular Interaction Field Similarities. J Chem Inf Model 2015;55:1600-15. [PMID: 26158641 DOI: 10.1021/acs.jcim.5b00333] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]

Tang GW, Altman RB. Knowledge-based fragment binding prediction. PLoS Comput Biol 2014;10:e1003589. [PMID: 24762971 PMCID: PMC3998881 DOI: 10.1371/journal.pcbi.1003589] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2013] [Accepted: 03/11/2014] [Indexed: 11/18/2022] Open

Buturovic L, Wong M, Tang GW, Altman RB, Petkovic D. High precision prediction of functional sites in protein structures. PLoS One 2014;9:e91240. [PMID: 24632601 PMCID: PMC3954699 DOI: 10.1371/journal.pone.0091240] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2013] [Accepted: 02/11/2014] [Indexed: 11/29/2022] Open

Julfayev ES, McLaughlin RJ, Tao YP, McLaughlin WA. KB-Rank: efficient protein structure and functional annotation identification via text query. ACTA ACUST UNITED AC 2012;13:101-10. [PMID: 22270457 PMCID: PMC3375009 DOI: 10.1007/s10969-012-9125-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2011] [Accepted: 01/07/2012] [Indexed: 12/12/2022]

Human intuition in the quantitative age. The role of mathematics in biology is vital, but does it leave room for 'old-fashioned' observation and interpretation? EMBO Rep 2011;12:401-4. [PMID: 21525944 DOI: 10.1038/embor.2011.57] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open

Tang GW, Altman RB. Remote thioredoxin recognition using evolutionary conservation and structural dynamics. Structure 2011;19:461-70. [PMID: 21481770 DOI: 10.1016/j.str.2011.02.007] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2010] [Revised: 02/06/2011] [Accepted: 02/16/2011] [Indexed: 12/25/2022]

Regad L, Martin J, Camproux AC. Dissecting protein loops with a statistical scalpel suggests a functional implication of some structural motifs. BMC Bioinformatics 2011;12:247. [PMID: 21689388 PMCID: PMC3158783 DOI: 10.1186/1471-2105-12-247] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2010] [Accepted: 06/20/2011] [Indexed: 12/24/2022] Open

Regad L, Saladin A, Maupetit J, Geneix C, Camproux AC. SA-Mot: a web server for the identification of motifs of interest extracted from protein loops. Nucleic Acids Res 2011;39:W203-9. [PMID: 21665924 PMCID: PMC3125790 DOI: 10.1093/nar/gkr410] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open

Julfayev ES, McLaughlin RJ, Tao YP, McLaughlin WA. A new approach to assess and predict the functional roles of proteins across all known structures. JOURNAL OF STRUCTURAL AND FUNCTIONAL GENOMICS 2011;12:9-20. [PMID: 21445639 PMCID: PMC3089730 DOI: 10.1007/s10969-011-9105-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/17/2010] [Accepted: 03/14/2011] [Indexed: 12/11/2022]

Moll M, Bryant DH, Kavraki LE. The LabelHash algorithm for substructure matching. BMC Bioinformatics 2010;11:555. [PMID: 21070651 PMCID: PMC2996407 DOI: 10.1186/1471-2105-11-555] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2010] [Accepted: 11/11/2010] [Indexed: 08/30/2023] Open

Abstract

Background

There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity.

Results

We present LabelHash, a novel algorithm for matching substructural motifs to large collections of protein structures. The algorithm consists of two phases. In the first phase the proteins are preprocessed in a fashion that allows for instant lookup of partial matches to any motif. In the second phase, partial matches for a given motif are expanded to complete matches. The general applicability of the algorithm is demonstrated with three different case studies. First, we show that we can accurately identify members of the enolase superfamily with a single motif. Next, we demonstrate how LabelHash can complement SOIPPA, an algorithm for motif identification and pairwise substructure alignment. Finally, a large collection of Catalytic Site Atlas motifs is used to benchmark the performance of the algorithm. LabelHash runs very efficiently in parallel; matching a motif against all proteins in the 95% sequence identity filtered non-redundant Protein Data Bank typically takes no more than a few minutes. The LabelHash algorithm is available through a web server and as a suite of standalone programs at http://labelhash.kavrakilab.org. The output of the LabelHash algorithm can be further analyzed with Chimera through a plugin that we developed for this purpose.

Conclusions

LabelHash is an efficient, versatile algorithm for large-scale substructure matching. When LabelHash is running in parallel, motifs can typically be matched against the entire PDB on the order of minutes. The algorithm is able to identify functional homologs beyond the twilight zone of sequence identity and even beyond fold similarity. The three case studies presented in this paper illustrate the versatility of the algorithm.

Collapse

Doppelt-Azeroual O, Delfaud F, Moriaud F, de Brevern AG. Fast and automated functional classification with MED-SuMo: an application on purine-binding proteins. Protein Sci 2010;19:847-67. [PMID: 20162627 DOI: 10.1002/pro.364] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]

Lee T, Min H, Kim SJ, Yoon S. Application of maximin correlation analysis to classifying protein environments for function prediction. Biochem Biophys Res Commun 2010;400:219-24. [DOI: 10.1016/j.bbrc.2010.08.042] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2010] [Accepted: 08/11/2010] [Indexed: 10/19/2022]

Bryant DH, Moll M, Chen BY, Fofanov VY, Kavraki LE. Analysis of substructural variation in families of enzymatic proteins with applications to protein function prediction. BMC Bioinformatics 2010;11:242. [PMID: 20459833 PMCID: PMC2885373 DOI: 10.1186/1471-2105-11-242] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2009] [Accepted: 05/11/2010] [Indexed: 12/02/2022] Open

Abstract

Background

Structural variations caused by a wide range of physico-chemical and biological sources directly influence the function of a protein. For enzymatic proteins, the structure and chemistry of the catalytic binding site residues can be loosely defined as a substructure of the protein. Comparative analysis of drug-receptor substructures across and within species has been used for lead evaluation. Substructure-level similarity between the binding sites of functionally similar proteins has also been used to identify instances of convergent evolution among proteins. In functionally homologous protein families, shared chemistry and geometry at catalytic sites provide a common, local point of comparison among proteins that may differ significantly at the sequence, fold, or domain topology levels.

Results

This paper describes two key results that can be used separately or in combination for protein function analysis. The Family-wise Analysis of SubStructural Templates (FASST) method uses all-against-all substructure comparison to determine Substructural Clusters (SCs). SCs characterize the binding site substructural variation within a protein family. In this paper we focus on examples of automatically determined SCs that can be linked to phylogenetic distance between family members, segregation by conformation, and organization by homology among convergent protein lineages. The Motif Ensemble Statistical Hypothesis (MESH) framework constructs a representative motif for each protein cluster among the SCs determined by FASST to build motif ensembles that are shown through a series of function prediction experiments to improve the function prediction power of existing motifs.

Conclusions

FASST contributes a critical feedback and assessment step to existing binding site substructure identification methods and can be used for the thorough investigation of structure-function relationships. The application of MESH allows for an automated, statistically rigorous procedure for incorporating structural variation data into protein function prediction pipelines. Our work provides an unbiased, automated assessment of the structural variability of identified binding site substructures among protein structure families and a technique for exploring the relation of substructural variation to protein function. As available proteomic data continues to expand, the techniques proposed will be indispensable for the large-scale analysis and interpretation of structural data.

Collapse

Wu S, Liu T, Altman RB. Identification of recurring protein structure microenvironments and discovery of novel functional sites around CYS residues. BMC STRUCTURAL BIOLOGY 2010;10:4. [PMID: 20122268 PMCID: PMC2833161 DOI: 10.1186/1472-6807-10-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/31/2009] [Accepted: 02/02/2010] [Indexed: 11/29/2022]

Glazer DS, Radmer RJ, Altman RB. Improving structure-based function prediction using molecular dynamics. Structure 2009;17:919-29. [PMID: 19604472 DOI: 10.1016/j.str.2009.05.010] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2008] [Revised: 05/06/2009] [Accepted: 05/06/2009] [Indexed: 10/20/2022]

Identification of protein functions using a machine-learning approach based on sequence-derived properties. Proteome Sci 2009;7:27. [PMID: 19664241 PMCID: PMC2731080 DOI: 10.1186/1477-5956-7-27] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2009] [Accepted: 08/09/2009] [Indexed: 02/07/2023] Open

Abstract

Background

Predicting the function of an unknown protein is an essential goal in bioinformatics. Sequence similarity-based approaches are widely used for function prediction; however, they are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak. This study aimed to develop an accurate prediction method for identifying protein function, irrespective of sequence and structural similarities.

Results

A highly accurate prediction method capable of identifying protein function, based solely on protein sequence properties, is described. This method analyses and identifies specific features of the protein sequence that are highly correlated with certain protein functions and determines the combination of protein sequence features that best characterises protein function. Thirty-three features that represent subtle differences in local regions and full regions of the protein sequences were introduced. On the basis of 484 features extracted solely from the protein sequence, models were built to predict the functions of 11 different proteins from a broad range of cellular components, molecular functions, and biological processes. The accuracy of protein function prediction using random forests with feature selection ranged from 94.23% to 100%. The local sequence information was found to have a broad range of applicability in predicting protein function.

Conclusion

We present an accurate prediction method using a machine-learning approach based solely on protein sequence properties. The primary contribution of this paper is to propose new PNPRD features representing global and/or local differences in sequences, based on positively and/or negatively charged residues, to assist in predicting protein function. In addition, we identified a compact and useful feature subset for predicting the function of various proteins. Our results indicate that sequence-based classifiers can provide good results among a broad range of proteins, that the proposed features are useful in predicting several functions, and that the combination of our and traditional features may support the creation of a discriminative feature set for specific protein functions.

Collapse

Genomics, molecular imaging, bioinformatics, and bio-nano-info integration are synergistic components of translational medicine and personalized healthcare research. BMC Genomics 2008;9 Suppl 2:I1. [PMID: 18831773 PMCID: PMC3226104 DOI: 10.1186/1471-2164-9-s2-i1] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open

Abstract

Supported by National Science Foundation (NSF), International Society of Intelligent Biological Medicine (ISIBM), International Journal of Computational Biology and Drug Design and International Journal of Functional Informatics and Personalized Medicine, IEEE 7th Bioinformatics and Bioengineering attracted more than 600 papers and 500 researchers and medical doctors. It was the only synergistic inter/multidisciplinary IEEE conference with 24 Keynote Lectures, 7 Tutorials, 5 Cutting-Edge Research Workshops and 32 Scientific Sessions including 11 Special Research Interest Sessions that were designed dynamically at Harvard in response to the current research trends and advances. The committee was very grateful for the IEEE Plenary Keynote Lectures given by: Dr. A. Keith Dunker (Indiana), Dr. Jun Liu (Harvard), Dr. Brian Athey (Michigan), Dr. Mark Borodovsky (Georgia Tech and President of ISIBM), Dr. Hamid Arabnia (Georgia and Vice-President of ISIBM), Dr. Ruzena Bajcsy (Berkeley and Member of United States National Academy of Engineering and Member of United States Institute of Medicine of the National Academies), Dr. Mary Yang (United States National Institutes of Health and Oak Ridge, DOE), Dr. Chih-Ming Ho (UCLA and Member of United States National Academy of Engineering and Academician of Academia Sinica), Dr. Andy Baxevanis (United States National Institutes of Health), Dr. Arif Ghafoor (Purdue), Dr. John Quackenbush (Harvard), Dr. Eric Jakobsson (UIUC), Dr. Vladimir Uversky (Indiana), Dr. Laura Elnitski (United States National Institutes of Health) and other world-class scientific leaders. The Harvard meeting was a large academic event 100% full-sponsored by IEEE financially and academically. After a rigorous peer-review process, the committee selected 27 high-quality research papers from 600 submissions. The committee is grateful for contributions from keynote speakers Dr. Russ Altman (IEEE BIBM conference keynote lecturer on combining simulation and machine learning to recognize function in 4D), Dr. Mary Qu Yang (IEEE BIBM workshop keynote lecturer on new initiatives of detecting microscopic disease using machine learning and molecular biology, http://ieeexplore.ieee.org/servlet/opac?punumber=4425386) and Dr. Jack Y. Yang (IEEE BIBM workshop keynote lecturer on data mining and knowledge discovery in translational medicine) from the first IEEE Computer Society BioInformatics and BioMedicine (IEEE BIBM) international conference and workshops, November 2-4, 2007, Silicon Valley, California, USA.

Collapse