1
|
Akhter N, Kabir KL, Chennupati G, Vangara R, Alexandrov BS, Djidjev H, Shehu A. Improved Protein Decoy Selection via Non-Negative Matrix Factorization. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1670-1682. [PMID: 33400654 DOI: 10.1109/tcbb.2020.3049088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
A central challenge in protein modeling research and protein structure prediction in particular is known as decoy selection. The problem refers to selecting biologically-active/native tertiary structures among a multitude of physically-realistic structures generated by template-free protein structure prediction methods. Research on decoy selection is active. Clustering-based methods are popular, but they fail to identify good/near-native decoys on datasets where near-native decoys are severely under-sampled by a protein structure prediction method. Reasonable progress is reported by methods that additionally take into account the internal energy of a structure and employ it to identify basins in the energy landscape organizing the multitude of decoys. These methods, however, incur significant time costs for extracting basins from the landscape. In this paper, we propose a novel decoy selection method based on non-negative matrix factorization. We demonstrate that our method outperforms energy landscape-based methods. In particular, the proposed method addresses both the time cost issue and the challenge of identifying good decoys in a sparse dataset, successfully recognizing near-native decoys for both easy and hard protein targets.
Collapse
|
2
|
Alam FF, Shehu A. Unsupervised multi-instance learning for protein structure determination. J Bioinform Comput Biol 2021; 19:2140002. [PMID: 33568002 DOI: 10.1142/s0219720021400023] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Many regions of the protein universe remain inaccessible by wet-laboratory or computational structure determination methods. A significant challenge in elucidating these dark regions in silico relates to the ability to discriminate relevant structure(s) among many structures/decoys computed for a protein of interest, a problem known as decoy selection. Clustering decoys based on geometric similarity remains popular. However, it is unclear how exactly to exploit the groups of decoys revealed via clustering to select individual structures for prediction. In this paper, we provide an intuitive formulation of the decoy selection problem as an instance of unsupervised multi-instance learning. We address the problem in three stages, first organizing given decoys of a protein molecule into bags, then identifying relevant bags, and finally drawing individual instances from these bags to offer as prediction. We propose both non-parametric and parametric algorithms for drawing individual instances. Our evaluation utilizes two datasets, one benchmark dataset of ensembles of decoys for a varied list of protein molecules, and a dataset of decoy ensembles for targets drawn from recent CASP competitions. A comparative analysis with state-of-the-art methods reveals that the proposed approach outperforms existing methods, thus warranting further investigation of multi-instance learning to advance our treatment of decoy selection.
Collapse
Affiliation(s)
- Fardina Fathmiul Alam
- Department of Computer Science, George Mason University, Fairfax, Virginia 22030, USA
| | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax, Virginia 22030, USA
| |
Collapse
|
3
|
Tadepalli S, Akhter N, Barbara D, Shehu A. Anomaly Detection-Based Recognition of Near-Native Protein Structures. IEEE Trans Nanobioscience 2020; 19:562-570. [PMID: 32340957 DOI: 10.1109/tnb.2020.2990642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
The three-dimensional structures populated by a protein molecule determine to a great extent its biological activities. The rich information encoded by protein structure on protein function continues to motivate the development of computational approaches for determining functionally-relevant structures. The majority of structures generated in silico are not relevant. Discriminating relevant/native protein structures from non-native ones is an outstanding challenge in computational structural biology. Inherently, this is a recognition problem that can be addressed under the umbrella of machine learning. In this paper, based on the premise that near-native structures are effectively anomalies, we build on the concept of anomaly detection in machine learning. We propose methods that automatically select relevant subsets, as well as methods that select a single structure to offer as prediction. Evaluations are carried out on benchmark datasets and demonstrate that the proposed methods advance the state of the art. The presented results motivate further building on and adapting concepts and techniques from machine learning to improve recognition of near-native structures in protein structure prediction.
Collapse
|
4
|
Akhter N, Chennupati G, Kabir KL, Djidjev H, Shehu A. Unsupervised and Supervised Learning over theEnergy Landscape for Protein Decoy Selection. Biomolecules 2019; 9:E607. [PMID: 31615116 PMCID: PMC6843838 DOI: 10.3390/biom9100607] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Revised: 10/03/2019] [Accepted: 10/04/2019] [Indexed: 11/17/2022] Open
Abstract
The energy landscape that organizes microstates of a molecular system and governs theunderlying molecular dynamics exposes the relationship between molecular form/structure, changesto form, and biological activity or function in the cell. However, several challenges stand in the wayof leveraging energy landscapes for relating structure and structural dynamics to function. Energylandscapes are high-dimensional, multi-modal, and often overly-rugged. Deep wells or basins inthem do not always correspond to stable structural states but are instead the result of inherentinaccuracies in semi-empirical molecular energy functions. Due to these challenges, energeticsis typically ignored in computational approaches addressing long-standing central questions incomputational biology, such as protein decoy selection. In the latter, the goal is to determine over apossibly large number of computationally-generated three-dimensional structures of a protein thosestructures that are biologically-active/native. In recent work, we have recast our attention on theprotein energy landscape and its role in helping us to advance decoy selection. Here, we summarizesome of our successes so far in this direction via unsupervised learning. More importantly, we furtheradvance the argument that the energy landscape holds valuable information to aid and advance thestate of protein decoy selection via novel machine learning methodologies that leverage supervisedlearning. Our focus in this article is on decoy selection for the purpose of a rigorous, quantitativeevaluation of how leveraging protein energy landscapes advances an important problem in proteinmodeling. However, the ideas and concepts presented here are generally useful to make discoveriesin studies aiming to relate molecular structure and structural dynamics to function.
Collapse
Affiliation(s)
- Nasrin Akhter
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA.
| | - Gopinath Chennupati
- Information Sciences (CCS-3) Group, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
| | - Kazi Lutful Kabir
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA.
| | - Hristo Djidjev
- Information Sciences (CCS-3) Group, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
| | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA.
- Center for Adaptive Human-Machine Partnership, George Mason University, Fairfax, VA 22030, USA.
- Department of Bioengineering, George Mason University, Fairfax, VA 22030, USA.
- School of Systems Biology, George Mason University, Fairfax, VA 22030, USA.
| |
Collapse
|
5
|
Khor S. Folding with a protein's native shortcut network. Proteins 2019; 86:924-934. [PMID: 29790602 DOI: 10.1002/prot.25524] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2017] [Revised: 04/13/2018] [Accepted: 05/14/2018] [Indexed: 11/09/2022]
Abstract
A complex network approach to protein folding is proposed, wherein a protein's contact map is reconceptualized as a network of shortcut edges, and folding is steered by a structural characteristic of this network. Shortcut networks are generated by a known message passing algorithm operating on protein residue networks. It is found that the shortcut networks of native structures (SCN0s) are relevant graph objects with which to study protein folding at a formal level. The logarithm form of their contact order (SCN0_lnCO) correlates significantly with folding rate of two-state and nontwo-state proteins. The clustering coefficient of SCN0s (CSCN0 ) correlates significantly with folding rate, transition-state placement and stability of two-state folders. Reasonable folding pathways for several model proteins are produced when CSCN0 is used to combine protein segments incrementally to form the native structure. The folding bias captured by CSCN0 is detectable in non-native structures, as evidenced by Molecular Dynamics simulation generated configurations for the fast folding Villin-headpiece peptide. These results support the use of shortcut networks to investigate the role protein geometry plays in the folding of both small and large globular proteins, and have implications for the design of multibody interaction schemes in folding models. One facet of this geometry is the set of native shortcut triangles, whose attributes are found to be well-suited to identify dehydrated intraprotein areas in tight turns, or at the interface of different secondary structure elements.
Collapse
Affiliation(s)
- Susan Khor
- Department of Computer Science, Memorial University of Newfoundland, St. John's, Newfoundland and Labrador, Canada
| |
Collapse
|
6
|
Akhter N, Hassan L, Rajabi Z, Barbará D, Shehu A. Learning Organizations of Protein Energy Landscapes: An Application on Decoy Selection in Template-Free Protein Structure Prediction. Methods Mol Biol 2019; 1958:147-171. [PMID: 30945218 DOI: 10.1007/978-1-4939-9161-7_8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/31/2023]
Abstract
The protein energy landscape, which lifts the protein structure space by associating energies with structures, has been useful in improving our understanding of the relationship between structure, dynamics, and function. Currently, however, it is challenging to automatically extract and utilize the underlying organization of an energy landscape to the link structural states it houses to biological activity. In this chapter, we first report on two computational approaches that extract such an organization, one that ignores energies and operates directly in the structure space and another that operates on the energy landscape associated with the structure space. We then describe two complementary approaches, one based on unsupervised learning and another based on supervised learning. Both approaches utilize the extracted organization to address the problem of decoy selection in template-free protein structure prediction. The presented results make the case that learning organizations of protein energy landscapes advances our ability to link structures to biological activity.
Collapse
Affiliation(s)
- Nasrin Akhter
- Department of Computer Science, George Mason University, Fairfax, VA, USA
| | - Liban Hassan
- Department of Computer Science, George Mason University, Fairfax, VA, USA
| | - Zahra Rajabi
- Department of Computer Science, George Mason University, Fairfax, VA, USA
| | - Daniel Barbará
- Department of Computer Science, George Mason University, Fairfax, VA, USA
| | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax, VA, USA.
| |
Collapse
|
7
|
Gadiyaram V, Vishveshwara S, Vishveshwara S. From Quantum Chemistry to Networks in Biology: A Graph Spectral Approach to Protein Structure Analyses. J Chem Inf Model 2019; 59:1715-1727. [DOI: 10.1021/acs.jcim.9b00002] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Vasundhara Gadiyaram
- IISc Mathematics Initiative (IMI), Indian Institute of Science, C V Raman Road, Bengaluru, Karnataka 560012, India
| | - Smitha Vishveshwara
- Department of Physics, University of Illinois at Urbana−Champaign, Urbana, Illinois 61801-3080, United States
| | - Saraswathi Vishveshwara
- Molecular Biophysics Unit, Indian Institute of Science, C V Raman Road, Bengaluru, Karnataka 560012, India
| |
Collapse
|
8
|
Pražnikar J, Tomić M, Turk D. Validation and quality assessment of macromolecular structures using complex network analysis. Sci Rep 2019; 9:1678. [PMID: 30737447 PMCID: PMC6368557 DOI: 10.1038/s41598-019-38658-9] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2018] [Accepted: 01/07/2019] [Indexed: 02/06/2023] Open
Abstract
Validation of three-dimensional structures is at the core of structural determination methods. The local validation criteria, such as deviations from ideal bond length and bonding angles, Ramachandran plot outliers and clashing contacts, are a standard part of structure analysis before structure deposition, whereas the global and regional packing may not yet have been addressed. In the last two decades, three-dimensional models of macromolecules such as proteins have been successfully described by a network of nodes and edges. Amino acid residues as nodes and close contact between the residues as edges have been used to explore basic network properties, to study protein folding and stability and to predict catalytic sites. Using complex network analysis, we introduced common network parameters to distinguish between correct and incorrect three-dimensional protein structures. The analysis showed that correct structures have a higher average node degree, higher graph energy, and lower shortest path length than their incorrect counterparts. Thus, correct protein models are more densely intra-connected, and in turn, the transfer of information between nodes/amino acids is more efficient. Moreover, protein graph spectra were used to investigate model bias in protein structure.
Collapse
Affiliation(s)
- Jure Pražnikar
- Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Glagoljaška 8, Koper, Slovenia.
- Department of Biochemistry, Molecular and Structural Biology, Institute Jožef Stefan, Jamova 39, Ljubljana, Slovenia.
| | - Miloš Tomić
- Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Glagoljaška 8, Koper, Slovenia
| | - Dušan Turk
- Department of Biochemistry, Molecular and Structural Biology, Institute Jožef Stefan, Jamova 39, Ljubljana, Slovenia
- Center of excellence for Integrated Approaches in Chemistry and Biology of Proteins, Jamova 39, Ljubljana, Slovenia
| |
Collapse
|
9
|
An Energy Landscape Treatment of Decoy Selection in Template-Free Protein Structure Prediction. COMPUTATION 2018. [DOI: 10.3390/computation6020039] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
10
|
Ghosh S, Vishveshwara S. Ranking the quality of protein structure models using sidechain based network properties. F1000Res 2014; 3:17. [PMID: 25580218 PMCID: PMC4038323 DOI: 10.12688/f1000research.3-17.v1] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 01/20/2014] [Indexed: 01/31/2023] Open
Abstract
Determining the correct structure of a protein given its sequence still remains an arduous task with many researchers working towards this goal. Most structure prediction methodologies result in the generation of a large number of probable candidates with the final challenge being to select the best amongst these. In this work, we have used Protein Structure Networks of native and modeled proteins in combination with Support Vector Machines to estimate the quality of a protein structure model and finally to provide ranks for these models. Model ranking is performed using regression analysis and helps in model selection from a group of many similar and good quality structures. Our results show that structures with a rank greater than 16 exhibit native protein-like properties while those below 10 are non-native like. The tool is also made available as a web-server ( http://vishgraph.mbu.iisc.ernet.in/GraProStr/native_non_native_ranking.html), where, 5 modelled structures can be evaluated at a given time.
Collapse
Affiliation(s)
- Soma Ghosh
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, 560012, India ; I.I.Sc. Mathematics Initiative, Indian Institute of Science, Bangalore, 560012, India
| | | |
Collapse
|