1
|
Hamamsy T, Morton JT, Blackwell R, Berenberg D, Carriero N, Gligorijevic V, Strauss CEM, Leman JK, Cho K, Bonneau R. Protein remote homology detection and structural alignment using deep learning. Nat Biotechnol 2024; 42:975-985. [PMID: 37679542 PMCID: PMC11180608 DOI: 10.1038/s41587-023-01917-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Accepted: 07/26/2023] [Indexed: 09/09/2023]
Abstract
Exploiting sequence-structure-function relationships in biotechnology requires improved methods for aligning proteins that have low sequence similarity to previously annotated proteins. We develop two deep learning methods to address this gap, TM-Vec and DeepBLAST. TM-Vec allows searching for structure-structure similarities in large sequence databases. It is trained to accurately predict TM-scores as a metric of structural similarity directly from sequence pairs without the need for intermediate computation or solution of structures. Once structurally similar proteins have been identified, DeepBLAST can structurally align proteins using only sequence information by identifying structurally homologous regions between proteins. It outperforms traditional sequence alignment methods and performs similarly to structure-based alignment methods. We show the merits of TM-Vec and DeepBLAST on a variety of datasets, including better identification of remotely homologous proteins compared with state-of-the-art sequence alignment and structure prediction methods.
Collapse
Grants
- R35GM122515 National Science Foundation (NSF)
- IOS-1546218 National Science Foundation (NSF)
- R35 GM122515 NIGMS NIH HHS
- R01 DK103358 NIDDK NIH HHS
- CBET- 1728858 National Science Foundation (NSF)
- R01 AI130945 NIAID NIH HHS
- This research was supported by NIH R01DK103358, the Simons Foundation, NSF- IOS-1546218, R35GM122515, NSF CBET- 1728858, NIH R01AI130945, to T.H. This research was supported by the intramural research program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) to J.T.M. This research was supported by the Flatiron Institute as part of the Simons Foundation to Robert Blackwell, J.K.L., and N.C. This research was supported by Los Alamos National Lab to C.S. This research was supported by the Samsung Advanced Institute of Technology (Next Generation Deep Learning: from pattern recognition to AI), Samsung Research (Improving Deep Learning using Latent Structure), and NSF Award 1922658 to K.C.
- Simons Foundation
- U.S. Department of Health & Human Services | NIH | Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD)
Collapse
Affiliation(s)
- Tymor Hamamsy
- Center for Data Science, New York University, New York, NY, USA
| | - James T Morton
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA
| | - Robert Blackwell
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Daniel Berenberg
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA
- Prescient Design, New York, NY, USA
| | - Nicholas Carriero
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | | | | | - Julia Koehler Leman
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Kyunghyun Cho
- Center for Data Science, New York University, New York, NY, USA.
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
- Prescient Design, New York, NY, USA.
- CIFAR, Toronto, Ontario, Canada.
| | - Richard Bonneau
- Center for Data Science, New York University, New York, NY, USA.
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
- Prescient Design, New York, NY, USA.
- Department of Biology, New York University, New York, NY, USA.
| |
Collapse
|
2
|
Kryshtafovych A, Rigden DJ. To split or not to split: CASP15 targets and their processing into tertiary structure evaluation units. Proteins 2023; 91:1558-1570. [PMID: 37254889 PMCID: PMC10687315 DOI: 10.1002/prot.26533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Revised: 05/02/2023] [Accepted: 05/18/2023] [Indexed: 06/01/2023]
Abstract
Processing of CASP15 targets into evaluation units (EUs) and assigning them to evolutionary-based prediction classes is presented in this study. The targets were first split into structural domains based on compactness and similarity to other proteins. Models were then evaluated against these domains and their combinations. The domains were joined into larger EUs if predictors' performance on the combined units was similar to that on individual domains. Alternatively, if most predictors performed better on the individual domains, then they were retained as EUs. As a result, 112 evaluation units were created from 77 tertiary structure prediction targets. The EUs were assigned to four prediction classes roughly corresponding to target difficulty categories in previous CASPs: TBM (template-based modeling, easy or hard), FM (free modeling), and the TBM/FM overlap category. More than a third of CASP15 EUs were attributed to the historically most challenging FM class, where homology or structural analogy to proteins of known fold cannot be detected.
Collapse
Affiliation(s)
| | - Daniel J. Rigden
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| |
Collapse
|
3
|
Abakarova M, Marquet C, Rera M, Rost B, Laine E. Alignment-based Protein Mutational Landscape Prediction: Doing More with Less. Genome Biol Evol 2023; 15:evad201. [PMID: 37936309 PMCID: PMC10653582 DOI: 10.1093/gbe/evad201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Revised: 10/27/2023] [Accepted: 11/01/2023] [Indexed: 11/09/2023] Open
Abstract
The wealth of genomic data has boosted the development of computational methods predicting the phenotypic outcomes of missense variants. The most accurate ones exploit multiple sequence alignments, which can be costly to generate. Recent efforts for democratizing protein structure prediction have overcome this bottleneck by leveraging the fast homology search of MMseqs2. Here, we show the usefulness of this strategy for mutational outcome prediction through a large-scale assessment of 1.5M missense variants across 72 protein families. Our study demonstrates the feasibility of producing alignment-based mutational landscape predictions that are both high-quality and compute-efficient for entire proteomes. We provide the community with the whole human proteome mutational landscape and simplified access to our predictive pipeline.
Collapse
Affiliation(s)
- Marina Abakarova
- CNRS, IBPS, Laboratory of Computational and Quantitative Biology (LCQB), Sorbonne Université, UMR 7238, Paris 75005, France
- Université Paris Cité, INSERM UMR U1284, 75004 Paris, France
| | - Céline Marquet
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748 Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany
| | - Michael Rera
- Université Paris Cité, INSERM UMR U1284, 75004 Paris, France
| | - Burkhard Rost
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748 Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, Garching, 85748 Munich, Germany
- TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany
| | - Elodie Laine
- CNRS, IBPS, Laboratory of Computational and Quantitative Biology (LCQB), Sorbonne Université, UMR 7238, Paris 75005, France
- Institut universitaire de France (IUF)
| |
Collapse
|
4
|
Kagaya Y, Zhang Z, Ibtehaz N, Wang X, Nakamura T, Huang D, Kihara D. NuFold: A Novel Tertiary RNA Structure Prediction Method Using Deep Learning with Flexible Nucleobase Center Representation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.20.558715. [PMID: 37790488 PMCID: PMC10542152 DOI: 10.1101/2023.09.20.558715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/05/2023]
Abstract
RNA is not only playing a core role in the central dogma as mRNA between DNA and protein, but also many non-coding RNAs have been discovered to have unique and diverse biological functions. As genome sequences become increasingly available and our knowledge of RNA sequences grows, the study of RNA's structure and function has become more demanding. However, experimental determination of three-dimensional RNA structures is both costly and time-consuming, resulting in a substantial disparity between RNA sequence data and structural insights. In response to this challenge, we propose a novel computational approach that harnesses state-of-the-art deep learning architecture NuFold to accurately predict RNA tertiary structures. This approach aims to offer a cost-effective and efficient means of bridging the gap between RNA sequence information and structural comprehension. NuFold implements a nucleobase center representation, which allows it to reproduce all possible nucleotide conformations accurately.
Collapse
Affiliation(s)
- Yuki Kagaya
- Department of Biological Sciences, Purdue University, West Lafayette, Indiana, 47907, USA
| | - Zicong Zhang
- Department of Computer Science, Purdue University, West Lafayette, Indiana, 47907, USA
| | - Nabil Ibtehaz
- Department of Computer Science, Purdue University, West Lafayette, Indiana, 47907, USA
| | - Xiao Wang
- Department of Computer Science, Purdue University, West Lafayette, Indiana, 47907, USA
| | - Tsukasa Nakamura
- Department of Biological Sciences, Purdue University, West Lafayette, Indiana, 47907, USA
| | - David Huang
- Department of Biological Sciences, Purdue University, West Lafayette, Indiana, 47907, USA
| | - Daisuke Kihara
- Department of Biological Sciences, Purdue University, West Lafayette, Indiana, 47907, USA
- Department of Computer Science, Purdue University, West Lafayette, Indiana, 47907, USA
| |
Collapse
|
5
|
de Brevern AG. An agnostic analysis of the human AlphaFold2 proteome using local protein conformations. Biochimie 2023; 207:11-19. [PMID: 36417962 DOI: 10.1016/j.biochi.2022.11.009] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2022] [Revised: 10/14/2022] [Accepted: 11/17/2022] [Indexed: 11/21/2022]
Abstract
Knowledge of the 3D structure of proteins is a valuable asset for understanding their precise biological mechanisms. However, the cost of production of 3D structures and experimental difficulties limit their obtaining. The proposal of 3D structural models is consequently an appealing alternative. The release of the AlphaFold Deep Learning approach has revolutionized the field. The recent near-complete human proteome proposal makes it possible to analyse large amounts of data and evaluate the results of the approach in greater depth. The 3D human proteome was thus analysed in light of the classic secondary structures, and many less-used protein local conformations (PolyProline II helices, type of γ-turns, of β-turns and of β-bulges, curvature of the helices, and a structural alphabet). Without questioning the global quality of the approach, this analysis highlights certain local conformations, which maybe poorly predicted and they could therefore be better addressed.
Collapse
Affiliation(s)
- Alexandre G de Brevern
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM UMR_S 1134, BIGR, DSIMB Bioinformatics team, F-75014, Paris, France.
| |
Collapse
|
6
|
Ibtehaz N, Sourav SMSH, Bayzid MS, Rahman MS. Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis. Protein J 2023; 42:135-146. [PMID: 36977849 DOI: 10.1007/s10930-023-10096-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/13/2023] [Indexed: 03/29/2023]
Abstract
The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the 'language of life', has been analyzed for a multitude of applications and inferences. Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. We propose a novel k-mer embedding scheme, Align-gram, which is capable of mapping the similar k-mers close to each other in a vector space. Furthermore, we experiment with other sequence-based protein representations and observe that the embeddings derived from Align-gram aids modeling and training deep learning models better. Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.
Collapse
|
7
|
Li S, Yuan L, Ma Y, Liu Y. WG-ICRN: Protein 8-state secondary structure prediction based on Wasserstein generative adversarial networks and residual networks with Inception modules. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:7721-7737. [PMID: 37161169 DOI: 10.3934/mbe.2023333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Protein secondary structure is the basis of studying the tertiary structure of proteins, drug design and development, and the 8-state protein secondary structure can provide more adequate protein information than the 3-state structure. Therefore, this paper proposes a novel method WG-ICRN for predicting protein 8-state secondary structures. First, we use the Wasserstein generative adversarial network (WGAN) to extract protein features in the position-specific scoring matrix (PSSM). The extracted features are combined with PSSM into a new feature set of WG-data, which contains richer feature information. Then, we use the residual network (ICRN) with Inception to further extract the features in WG-data and complete the prediction. Compared with the residual network, ICRN can reduce parameter calculations and increase the width of feature extraction to obtain more feature information. We evaluated the prediction performance of the model using six datasets. The experimental results show that the WGAN has excellent feature extraction capabilities, and ICRN can further improve network performance and improve prediction accuracy. Compared with four popular models, WG-ICRN achieves better prediction performance.
Collapse
Affiliation(s)
- Shun Li
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| | - Lu Yuan
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| | - Yuming Ma
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| | - Yihui Liu
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| |
Collapse
|
8
|
Yuan L, Ma Y, Liu Y. Ensemble deep learning models for protein secondary structure prediction using bidirectional temporal convolution and bidirectional long short-term memory. Front Bioeng Biotechnol 2023; 11:1051268. [PMID: 36860882 PMCID: PMC9968878 DOI: 10.3389/fbioe.2023.1051268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Accepted: 02/03/2023] [Indexed: 02/16/2023] Open
Abstract
Protein secondary structure prediction (PSSP) is a challenging task in computational biology. However, existing models with deep architectures are not sufficient and comprehensive for deep long-range feature extraction of long sequences. This paper proposes a novel deep learning model to improve Protein secondary structure prediction. In the model, our proposed bidirectional temporal convolutional network (BTCN) can extract the bidirectional deep local dependencies in protein sequences segmented by the sliding window technique, the bidirectional long short-term memory (BLSTM) network can extract the global interactions between residues, and our proposed multi-scale bidirectional temporal convolutional network (MSBTCN) can further capture the bidirectional multi-scale long-range features of residues while preserving the hidden layer information more comprehensively. In particular, we also propose that fusing the features of 3-state and 8-state Protein secondary structure prediction can further improve the prediction accuracy. Moreover, we also propose and compare multiple novel deep models by combining bidirectional long short-term memory with temporal convolutional network (TCN), reverse temporal convolutional network (RTCN), multi-scale temporal convolutional network (multi-scale bidirectional temporal convolutional network), bidirectional temporal convolutional network and multi-scale bidirectional temporal convolutional network, respectively. Furthermore, we demonstrate that the reverse prediction of secondary structure outperforms the forward prediction, suggesting that amino acids at later positions have a greater impact on secondary structure recognition. Experimental results on benchmark datasets including CASP10, CASP11, CASP12, CASP13, CASP14, and CB513 show that our methods achieve better prediction performance compared to five state-of-the-art methods.
Collapse
Affiliation(s)
- Lu Yuan
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Yuming Ma
- *Correspondence: Yuming Ma, ; Yihui Liu,
| | - Yihui Liu
- *Correspondence: Yuming Ma, ; Yihui Liu,
| |
Collapse
|
9
|
Liu J, Zhao K, Zhang G. Improved model quality assessment using sequence and structural information by enhanced deep neural networks. Brief Bioinform 2023; 24:6865134. [PMID: 36460624 DOI: 10.1093/bib/bbac507] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2022] [Revised: 10/02/2022] [Accepted: 10/24/2022] [Indexed: 12/04/2022] Open
Abstract
Protein model quality assessment plays an important role in protein structure prediction, protein design and drug discovery. In this work, DeepUMQA2, a substantially improved version of DeepUMQA for protein model quality assessment, is proposed. First, sequence features containing protein co-evolution information and structural features reflecting family information are extracted to complement model-dependent features. Second, a novel backbone network based on triangular multiplication update and axial attention mechanism is designed to enhance information exchange between inter-residue pairs. On CASP13 and CASP14 datasets, the performance of DeepUMQA2 increases by 20.5 and 20.4% compared with DeepUMQA, respectively (measured by top 1 loss). Moreover, on the three-month CAMEO dataset (11 March to 04 June 2022), DeepUMQA2 outperforms DeepUMQA by 15.5% (measured by local AUC0,0.2) and ranks first among all competing server methods in CAMEO blind test. Experimental results show that DeepUMQA2 outperforms state-of-the-art model quality assessment methods, such as ProQ3D-LDDT, ModFOLD8, and DeepAccNet and DeepUMQA2 can select more suitable best models than state-of-the-art protein structure methods, such as AlphaFold2, RoseTTAFold and I-TASSER, provided themselves.
Collapse
Affiliation(s)
- Jun Liu
- College of Information Engineering, Zhejiang University of Technology
| | - Kailong Zhao
- College of Information Engineering, Zhejiang University of Technology
| | - Guijun Zhang
- College of Information Engineering, Zhejiang University of Technology
| |
Collapse
|
10
|
Johnson TO, Akinsanmi AO, Ejembi SA, Adeyemi OE, Oche JR, Johnson GI, Adegboyega AE. Modern drug discovery for inflammatory bowel disease: The role of computational methods. World J Gastroenterol 2023; 29:310-331. [PMID: 36687123 PMCID: PMC9846937 DOI: 10.3748/wjg.v29.i2.310] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/22/2022] [Revised: 11/02/2022] [Accepted: 12/21/2022] [Indexed: 01/06/2023] Open
Abstract
Inflammatory bowel diseases (IBDs) comprising ulcerative colitis, Crohn’s disease and microscopic colitis are characterized by chronic inflammation of the gastrointestinal tract. IBD has spread around the world and is becoming more prevalent at an alarming rate in developing countries whose societies have become more westernized. Cell therapy, intestinal microecology, apheresis therapy, exosome therapy and small molecules are emerging therapeutic options for IBD. Currently, it is thought that low-molecular-mass substances with good oral bio-availability and the ability to permeate the cell membrane to regulate the action of elements of the inflammatory signaling pathway are effective therapeutic options for the treatment of IBD. Several small molecule inhibitors are being developed as a promising alternative for IBD therapy. The use of highly efficient and time-saving techniques, such as computational methods, is still a viable option for the development of these small molecule drugs. The computer-aided (in silico) discovery approach is one drug development technique that has mostly proven efficacy. Computational approaches when combined with traditional drug development methodology dramatically boost the likelihood of drug discovery in a sustainable and cost-effective manner. This review focuses on the modern drug discovery approaches for the design of novel IBD drugs with an emphasis on the role of computational methods. Some computational approaches to IBD genomic studies, target identification, and virtual screening for the discovery of new drugs and in the repurposing of existing drugs are discussed.
Collapse
Affiliation(s)
| | | | | | | | - Jane-Rose Oche
- Department of Biochemistry, University of Jos, Jos 930222, Plateau, Nigeria
| | - Grace Inioluwa Johnson
- Faculty of Clinical Sciences, College of Health Sciences, University of Jos, Jos 930222, Plateau, Nigeria
| | | |
Collapse
|
11
|
Nagaraju M, Liu H. A scoring function for the prediction of protein complex interfaces based on the neighborhood preferences of amino acids. Acta Crystallogr D Struct Biol 2023; 79:31-39. [PMID: 36601805 DOI: 10.1107/s2059798322011858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Accepted: 12/13/2022] [Indexed: 12/24/2022] Open
Abstract
Proteins often assemble into functional complexes, the structures of which are more difficult to obtain than those of the individual protein molecules. Given the structures of the subunits, it is possible to predict plausible complex models via computational methods such as molecular docking. Assessing the quality of the predicted models is crucial to obtain correct complex structures. Here, an energy-scoring function was developed based on the interfacial residues of structures in the Protein Data Bank. The statistically derived energy function (Nepre) imitates the neighborhood preferences of amino acids, including the types and relative positions of neighboring residues. Based on the preference statistics, a program iNepre was implemented and its performance was evaluated with several benchmarking decoy data sets. The results show that iNepre scores are powerful in model ranking to select the best protein complex structures.
Collapse
Affiliation(s)
- Mulpuri Nagaraju
- Complex Systems Division, Beijing Computational Science Research Center, Beijing 100193, People's Republic of China
| | - Haiguang Liu
- Complex Systems Division, Beijing Computational Science Research Center, Beijing 100193, People's Republic of China
| |
Collapse
|
12
|
Yuan L, Ma Y, Liu Y. Protein secondary structure prediction based on Wasserstein generative adversarial networks and temporal convolutional networks with convolutional block attention modules. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:2203-2218. [PMID: 36899529 DOI: 10.3934/mbe.2023102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
As an important task in bioinformatics, protein secondary structure prediction (PSSP) is not only beneficial to protein function research and tertiary structure prediction, but also to promote the design and development of new drugs. However, current PSSP methods cannot sufficiently extract effective features. In this study, we propose a novel deep learning model WGACSTCN, which combines Wasserstein generative adversarial network with gradient penalty (WGAN-GP), convolutional block attention module (CBAM) and temporal convolutional network (TCN) for 3-state and 8-state PSSP. In the proposed model, the mutual game of generator and discriminator in WGAN-GP module can effectively extract protein features, and our CBAM-TCN local extraction module can capture key deep local interactions in protein sequences segmented by sliding window technique, and the CBAM-TCN long-range extraction module can further capture the key deep long-range interactions in sequences. We evaluate the performance of the proposed model on seven benchmark datasets. Experimental results show that our model exhibits better prediction performance compared to the four state-of-the-art models. The proposed model has strong feature extraction ability, which can extract important information more comprehensively.
Collapse
Affiliation(s)
- Lu Yuan
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| | - Yuming Ma
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| | - Yihui Liu
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| |
Collapse
|
13
|
Avramouli A. Computational Methods for Protein Tertiary Structure Analysis. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2023; 1424:61-67. [PMID: 37486480 DOI: 10.1007/978-3-031-31982-2_7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/25/2023]
Abstract
Protein folding accuracy is fundamental to all cells. In spite of this, it is difficult to maintain the fidelity of protein synthesis and folding due to the fact that the implicit genetic and biochemical systems are inherently prone to error, which leads to the constant production of a certain amount of misfolded proteins. This problem is further compounded by genetic variation and the effects of environmental stress. To that end, the prediction of protein structures for tertiary protein structure analysis and prediction might be an ideal approach for the study of mutation effects in macromolecules and their complexes. With the development and accessibility to increasingly powerful computational systems, this type of study will enable a wide variety of opportunities for the creation of better-targeted peptide-based pharmacotherapy and prospects for precision medicine in future.
Collapse
|
14
|
Bartuzi D, Kaczor AA, Matosiuk D. Illuminating the "Twilight Zone": Advances in Difficult Protein Modeling. Methods Mol Biol 2023; 2627:25-40. [PMID: 36959440 DOI: 10.1007/978-1-0716-2974-1_2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/25/2023]
Abstract
Homology modeling was long considered a method of choice in tertiary protein structure prediction. However, it used to provide models of acceptable quality only when templates with appreciable sequence identity with a target could be found. The threshold value was long assumed to be around 20-30%. Below this level, obtained sequence identity was getting dangerously close to values that can be obtained by chance, after aligning any random, unrelated sequences. In these cases, other approaches, including ab initio folding simulations or fragment assembly, were usually employed. The most recent editions of the CASP and CAMEO community-wide modeling methods assessment have brought some surprising outcomes, proving that much more clues can be inferred from protein sequence analyses than previously thought. In this chapter, we focus on recent advances in the field of difficult protein modeling, pushing the threshold deep into the "twilight zone", with particular attention devoted to improvements in applications of machine learning and model evaluation.
Collapse
Affiliation(s)
- Damian Bartuzi
- Department of Synthesis and Chemical Technology of Pharmaceutical Substances with Computer Modelling Laboratory, Medical University of Lublin, Lublin, Poland.
| | - Agnieszka A Kaczor
- Department of Synthesis and Chemical Technology of Pharmaceutical Substances with Computer Modelling Laboratory, Medical University of Lublin, Lublin, Poland
- University of Eastern Finland, School of Pharmacy, Kuopio, Finland
| | - Dariusz Matosiuk
- Department of Synthesis and Chemical Technology of Pharmaceutical Substances with Computer Modelling Laboratory, Medical University of Lublin, Lublin, Poland
| |
Collapse
|
15
|
Newton MH, Zaman R, Mataeimoghadam F, Rahman J, Sattar A. Constraint Guided Beta-Sheet Refinement for Protein Structure Prediction. Comput Biol Chem 2022; 101:107773. [DOI: 10.1016/j.compbiolchem.2022.107773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Revised: 09/15/2022] [Accepted: 09/16/2022] [Indexed: 11/16/2022]
|
16
|
Lee S, Kim S, Lee GR, Kwon S, Woo H, Seok C, Park H. Evaluating GPCR modeling and docking strategies in the era of deep learning-based protein structure prediction. Comput Struct Biotechnol J 2022; 21:158-167. [PMID: 36544468 PMCID: PMC9747351 DOI: 10.1016/j.csbj.2022.11.057] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 11/27/2022] [Accepted: 11/27/2022] [Indexed: 12/03/2022] Open
Abstract
While deep learning (DL) has brought a revolution in the protein structure prediction field, still an important question remains how the revolution can be transferred to advances in structure-based drug discovery. Because the lessons from the recent GPCR dock challenge were inconclusive primarily due to the size of the dataset, in this work we further elaborated on 70 diverse GPCR complexes bound to either small molecules or peptides to investigate the best-practice modeling and docking strategies for GPCR drug discovery. From our quantitative analysis, it is shown that substantial improvements in docking and virtual screening have been possible by the advance in DL-based protein structure predictions with respect to the expected results from the combination of best pre-DL tools. The success rate of docking on DL-based model structures approaches that of cross-docking on experimental structures, showing over 30% improvement from the best pre-DL protocols. This amount of performance could be achieved only when two modeling points were considered properly: 1) correct functional-state modeling of receptors and 2) receptor-flexible docking. Best-practice modeling strategies and the model confidence estimation metric suggested in this work may serve as a guideline for future computer-aided GPCR drug discovery scenarios.
Collapse
Key Words
- AF, AlphaFold
- CAPRI, critical assessment of predicted interactions, DOF, Degree-of-freedom
- DL, deep learning
- Deep learning
- Drug discovery
- GALD, Rosetta GA LigandDock
- GD3, GalaxyDock3
- GDT, global distance test
- GPCR
- Ligand docking
- MD, molecular dynamics
- Protein structure prediction
- RMSD, root-mean-squared deviation
- SBDD, Structure-based drug design
- TBM, template-based modeling or template-based model
- p-lDDT, predicted local distance difference test
Collapse
Affiliation(s)
- Sumin Lee
- Interdisciplinary Program in Artificial Intelligence, Seoul National University, Seoul 08826, Republic of Korea
| | - Seeun Kim
- Department of Chemistry, Seoul National University, Seoul 08826, Republic of Korea
| | - Gyu Rie Lee
- Department of Biochemistry, University of Washington, WA, USA
| | - Sohee Kwon
- Department of Chemistry, Seoul National University, Seoul 08826, Republic of Korea
| | - Hyeonuk Woo
- Department of Chemistry, Seoul National University, Seoul 08826, Republic of Korea
| | - Chaok Seok
- Department of Chemistry, Seoul National University, Seoul 08826, Republic of Korea,Corresponding authors.
| | - Hahnbeom Park
- Brain Science Institute, Korea Institute of Science and Technology, Seoul 02792, Republic of Korea,Corresponding authors.
| |
Collapse
|
17
|
Yuan L, Hu X, Ma Y, Liu Y. DLBLS_SS: protein secondary structure prediction using deep learning and broad learning system. RSC Adv 2022; 12:33479-33487. [PMID: 36505696 PMCID: PMC9682407 DOI: 10.1039/d2ra06433b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 11/16/2022] [Indexed: 11/24/2022] Open
Abstract
Protein secondary structure prediction (PSSP) is not only beneficial to the study of protein structure and function but also to the development of drugs. As a challenging task in computational biology, experimental methods for PSSP are time-consuming and expensive. In this paper, we propose a novel PSSP model DLBLS_SS based on deep learning and broad learning system (BLS) to predict 3-state and 8-state secondary structure. We first use a bidirectional long short-term memory (BLSTM) network to extract global features in residue sequences. Then, our proposed SEBTCN based on temporal convolutional networks (TCN) and channel attention can capture bidirectional key long-range dependencies in sequences. We also use BLS to rapidly optimize fused features while further capturing local interactions between residues. We conduct extensive experiments on public test sets including CASP10, CASP11, CASP12, CASP13, CASP14 and CB513 to evaluate the performance of the model. Experimental results show that our model exhibits better 3-state and 8-state PSSP performance compared to five state-of-the-art models.
Collapse
Affiliation(s)
- Lu Yuan
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| | - Xiaopei Hu
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| | - Yuming Ma
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| | - Yihui Liu
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| |
Collapse
|
18
|
Bhowmick S, Jing T, Wang W, Zhang EY, Zhang F, Yang Y. In Silico Protein Folding Prediction of COVID-19 Mutations and Variants. Biomolecules 2022; 12:1665. [PMID: 36359015 PMCID: PMC9688002 DOI: 10.3390/biom12111665] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Revised: 11/08/2022] [Accepted: 11/09/2022] [Indexed: 08/27/2023] Open
Abstract
With its fast-paced mutagenesis, the SARS-CoV-2 Omicron variant has threatened many societies worldwide. Strategies for predicting mutagenesis such as the computational prediction of SARS-CoV-2 structural diversity and its interaction with the human receptor will greatly benefit our understanding of the virus and help develop therapeutics against it. We aim to use protein structure prediction algorithms along with molecular docking to study the effects of various mutations in the Receptor Binding Domain (RBD) of the SARS-CoV-2 and its key interactions with the angiotensin-converting enzyme 2 (ACE-2) receptor. The RBD structures of the naturally occurring variants of SARS-CoV-2 were generated from the WUHAN-Hu-1 using the trRosetta algorithm. Docking (HADDOCK) and binding analysis (PRODIGY) between the predicted RBD sequences and ACE-2 highlighted key interactions at the Receptor-Binding Motif (RBM). Further mutagenesis at conserved residues in the Original, Delta, and Omicron variants (P499S and T500R) demonstrated stronger binding and interactions with the ACE-2 receptor. The predicted T500R mutation underwent some preliminary tests in vitro for its binding and transmissibility in cells; the results correlate with the in-silico analysis. In summary, we suggest conserved residues P499 and T500 as potential mutation sites that could increase the binding affinity and yet do not exist in nature. This work demonstrates the use of the trRosetta algorithm to predict protein structure and future mutations at the RBM of SARS-CoV-2, followed by experimental testing for further efficacy verification. It is important to understand the protein structure and folding to help develop potential therapeutics.
Collapse
Affiliation(s)
| | | | | | | | | | - Yanmin Yang
- Department of Neurology and Neurological Sciences, School of Medicine, Stanford University, 1201 Welch Road, MSLS, P259, Stanford, CA 94305, USA
| |
Collapse
|
19
|
An J, Weng X. Collectively encoding protein properties enriches protein language models. BMC Bioinformatics 2022; 23:467. [DOI: 10.1186/s12859-022-05031-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 10/31/2022] [Indexed: 11/10/2022] Open
Abstract
AbstractPre-trained natural language processing models on a large natural language corpus can naturally transfer learned knowledge to protein domains by fine-tuning specific in-domain tasks. However, few studies focused on enriching such protein language models by jointly learning protein properties from strongly-correlated protein tasks. Here we elaborately designed a multi-task learning (MTL) architecture, aiming to decipher implicit structural and evolutionary information from three sequence-level classification tasks for protein family, superfamily and fold. Considering the co-existing contextual relevance between human words and protein language, we employed BERT, pre-trained on a large natural language corpus, as our backbone to handle protein sequences. More importantly, the encoded knowledge obtained in the MTL stage can be well transferred to more fine-grained downstream tasks of TAPE. Experiments on structure- or evolution-related applications demonstrate that our approach outperforms many state-of-the-art Transformer-based protein models, especially in remote homology detection.
Collapse
|
20
|
Chinnam NB, Syed A, Hura GL, Hammel M, Tainer JA, Tsutakawa SE. Combining small angle X-ray scattering (SAXS) with protein structure predictions to characterize conformations in solution. Methods Enzymol 2022; 678:351-376. [PMID: 36641214 PMCID: PMC10132260 DOI: 10.1016/bs.mie.2022.09.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Accurate protein structure predictions, enabled by recent advances in machine learning algorithms, provide an entry point to probing structural mechanisms and to integrating and querying many types of biochemical and biophysical results. Limitations in such protein structure predictions can be reduced and addressed through comparison to experimental Small Angle X-ray Scattering (SAXS) data that provides protein structural information in solution. SAXS data can not only validate computational predictions, but can improve conformational and assembly prediction to produce atomic models that are consistent with solution data and biologically relevant states. Here, we describe how to obtain protein structure predictions, compare them to experimental SAXS data and improve models to reflect experimental information from SAXS data. Furthermore, we consider the potential for such experimentally-validated protein structure predictions to broadly improve functional annotation in proteins identified in metagenomics and to identify functional clustering on conserved sites despite low sequence homology.
Collapse
Affiliation(s)
- Naga Babu Chinnam
- Department of Molecular and Cellular Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, United States
| | - Aleem Syed
- Department of Molecular and Cellular Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, United States
| | - Greg L Hura
- Molecular Biophysics and Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, United States; Department of Chemistry and Biochemistry, University of California Santa Cruz, Santa Cruz, CA, United States
| | - Michal Hammel
- Molecular Biophysics and Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, United States
| | - John A Tainer
- Department of Molecular and Cellular Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, United States; Department of Cancer Biology, University of Texas MD Anderson Cancer Center, Houston, TX, United States
| | - Susan E Tsutakawa
- Molecular Biophysics and Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, United States.
| |
Collapse
|
21
|
Geffen Y, Ofran Y, Unger R. DistilProtBert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts. Bioinformatics 2022; 38:ii95-ii98. [PMID: 36124789 DOI: 10.1093/bioinformatics/btac474] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
SUMMARY Recently, deep learning models, initially developed in the field of natural language processing (NLP), were applied successfully to analyze protein sequences. A major drawback of these models is their size in terms of the number of parameters needed to be fitted and the amount of computational resources they require. Recently, 'distilled' models using the concept of student and teacher networks have been widely used in NLP. Here, we adapted this concept to the problem of protein sequence analysis, by developing DistilProtBert, a distilled version of the successful ProtBert model. Implementing this approach, we reduced the size of the network and the running time by 50%, and the computational resources needed for pretraining by 98% relative to ProtBert model. Using two published tasks, we showed that the performance of the distilled model approaches that of the full model. We next tested the ability of DistilProtBert to distinguish between real and random protein sequences. The task is highly challenging if the composition is maintained on the level of singlet, doublet and triplet amino acids. Indeed, traditional machine-learning algorithms have difficulties with this task. Here, we show that DistilProtBert preforms very well on singlet, doublet and even triplet-shuffled versions of the human proteome, with AUC of 0.92, 0.91 and 0.87, respectively. Finally, we suggest that by examining the small number of false-positive classifications (i.e. shuffled sequences classified as proteins by DistilProtBert), we may be able to identify de novo potential natural-like proteins based on random shuffling of amino acid sequences. AVAILABILITY AND IMPLEMENTATION https://github.com/yarongef/DistilProtBert.
Collapse
Affiliation(s)
- Yaron Geffen
- The Mina & Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan 5290002, Israel
| | - Yanay Ofran
- The Mina & Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan 5290002, Israel
| | - Ron Unger
- The Mina & Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan 5290002, Israel
| |
Collapse
|
22
|
Protein structure prediction based on particle swarm optimization and tabu search strategy. BMC Bioinformatics 2022; 23:352. [PMID: 35999491 PMCID: PMC9396775 DOI: 10.1186/s12859-022-04888-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Accepted: 08/10/2022] [Indexed: 11/10/2022] Open
Abstract
Background The stability of protein sequence structure plays an important role in the prevention and treatment of diseases. Results In this paper, particle swarm optimization and tabu search are combined to propose a new method for protein structure prediction. The experimental results show that: for four groups of artificial protein sequences with different lengths, this method obtains the lowest potential energy value and stable structure prediction results, and the effect is obviously better than the other two comparison methods. Taking the first group of protein sequences as an example, our method improves the prediction of minimum potential energy by 127% and 7% respectively. Conclusions Therefore, the method proposed in this paper is more suitable for the prediction of protein structural stability.
Collapse
|
23
|
Stevens AO, He Y. Benchmarking the Accuracy of AlphaFold 2 in Loop Structure Prediction. Biomolecules 2022; 12:985. [PMID: 35883541 PMCID: PMC9312937 DOI: 10.3390/biom12070985] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 07/05/2022] [Accepted: 07/12/2022] [Indexed: 01/22/2023] Open
Abstract
The inhibition of protein-protein interactions is a growing strategy in drug development. In addition to structured regions, many protein loop regions are involved in protein-protein interactions and thus have been identified as potential drug targets. To effectively target such regions, protein structure is critical. Loop structure prediction is a challenging subgroup in the field of protein structure prediction because of the reduced level of conservation in protein sequences compared to the secondary structure elements. AlphaFold 2 has been suggested to be one of the greatest achievements in the field of protein structure prediction. The AlphaFold 2 predicted protein structures near the X-ray resolution in the Critical Assessment of protein Structure Prediction (CASP 14) competition in 2020. The purpose of this work is to survey the performance of AlphaFold 2 in specifically predicting protein loop regions. We have constructed an independent dataset of 31,650 loop regions from 2613 proteins (deposited after the AlphaFold 2 was trained) with both experimentally determined structures and AlphaFold 2 predicted structures. With extensive evaluation using our dataset, the results indicate that AlphaFold 2 is a good predictor of the structure of loop regions, especially for short loop regions. Loops less than 10 residues in length have an average Root Mean Square Deviation (RMSD) of 0.33 Å and an average the Template Modeling score (TM-score) of 0.82. However, we see that as the number of residues in a given loop increases, the accuracy of AlphaFold 2's prediction decreases. Loops more than 20 residues in length have an average RMSD of 2.04 Å and an average TM-score of 0.55. Such a correlation between accuracy and length of the loop is directly linked to the increase in flexibility. Moreover, AlphaFold 2 does slightly over-predict α-helices and β-strands in proteins.
Collapse
Affiliation(s)
- Amy O. Stevens
- Department of Chemistry and Chemical Biology, University of New Mexico, Albuquerque, NM 87131, USA;
| | - Yi He
- Department of Chemistry and Chemical Biology, University of New Mexico, Albuquerque, NM 87131, USA;
- Translational Informatics Division, Department of Internal Medicine, University of New Mexico, Albuquerque, NM 87131, USA
| |
Collapse
|
24
|
Fenoy E, Edera AA, Stegmayer G. Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks. Brief Bioinform 2022; 23:6618242. [PMID: 35758229 DOI: 10.1093/bib/bbac232] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Revised: 05/13/2022] [Accepted: 05/18/2022] [Indexed: 11/13/2022] Open
Abstract
A representation method is an algorithm that calculates numerical feature vectors for samples in a dataset. Such vectors, also known as embeddings, define a relatively low-dimensional space able to efficiently encode high-dimensional data. Very recently, many types of learned data representations based on machine learning have appeared and are being applied to several tasks in bioinformatics. In particular, protein representation learning methods integrate different types of protein information (sequence, domains, etc.), in supervised or unsupervised learning approaches, and provide embeddings of protein sequences that can be used for downstream tasks. One task that is of special interest is the automatic function prediction of the huge number of novel proteins that are being discovered nowadays and are still totally uncharacterized. However, despite its importance, up to date there is not a fair benchmark study of the predictive performance of existing proposals on the same large set of proteins and for very concrete and common bioinformatics tasks. Therefore, this lack of benchmark studies prevent the community from using adequate predictive methods for accelerating the functional characterization of proteins. In this study, we performed a detailed comparison of protein sequence representation learning methods, explaining each approach and comparing them with an experimental benchmark on several bioinformatics tasks: (i) determining protein sequence similarity in the embedding space; (ii) inferring protein domains and (iii) predicting ontology-based protein functions. We examine the advantages and disadvantages of each representation approach over the benchmark results. We hope the results and the discussion of this study can help the community to select the most adequate machine learning-based technique for protein representation according to the bioinformatics task at hand.
Collapse
Affiliation(s)
- Emilio Fenoy
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Alejando A Edera
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Georgina Stegmayer
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| |
Collapse
|
25
|
Zhang W, Meng Q, Wang J, Guo F. HDIContact: a novel predictor of residue-residue contacts on hetero-dimer interfaces via sequential information and transfer learning strategy. Brief Bioinform 2022; 23:6599074. [PMID: 35653713 DOI: 10.1093/bib/bbac169] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Revised: 03/07/2022] [Accepted: 04/16/2022] [Indexed: 11/12/2022] Open
Abstract
Proteins maintain the functional order of cell in life by interacting with other proteins. Determination of protein complex structural information gives biological insights for the research of diseases and drugs. Recently, a breakthrough has been made in protein monomer structure prediction. However, due to the limited number of the known protein structure and homologous sequences of complexes, the prediction of residue-residue contacts on hetero-dimer interfaces is still a challenge. In this study, we have developed a deep learning framework for inferring inter-protein residue contacts from sequential information, called HDIContact. We utilized transfer learning strategy to produce Multiple Sequence Alignment (MSA) two-dimensional (2D) embedding based on patterns of concatenated MSA, which could reduce the influence of noise on MSA caused by mismatched sequences or less homology. For MSA 2D embedding, HDIContact took advantage of Bi-directional Long Short-Term Memory (BiLSTM) with two-channel to capture 2D context of residue pairs. Our comprehensive assessment on the Escherichia coli (E. coli) test dataset showed that HDIContact outperformed other state-of-the-art methods, with top precision of 65.96%, the Area Under the Receiver Operating Characteristic curve (AUROC) of 83.08% and the Area Under the Precision Recall curve (AUPR) of 25.02%. In addition, we analyzed the potential of HDIContact for human-virus protein-protein complexes, by achieving top five precision of 80% on O75475-P04584 related to Human Immunodeficiency Virus. All experiments indicated that our method was a valuable technical tool for predicting inter-protein residue contacts, which would be helpful for understanding protein-protein interaction mechanisms.
Collapse
Affiliation(s)
- Wei Zhang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Qiaozhen Meng
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Fei Guo
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| |
Collapse
|
26
|
Paiva VDA, Gomes IDS, Monteiro CR, Mendonça MV, Martins PM, Santana CA, Gonçalves-Almeida V, Izidoro SC, Melo-Minardi RCD, Silveira SDA. Protein structural bioinformatics: An overview. Comput Biol Med 2022; 147:105695. [DOI: 10.1016/j.compbiomed.2022.105695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 06/01/2022] [Accepted: 06/02/2022] [Indexed: 11/27/2022]
|
27
|
Weissenow K, Heinzinger M, Rost B. Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction. Structure 2022; 30:1169-1177.e4. [DOI: 10.1016/j.str.2022.05.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 02/25/2022] [Accepted: 04/29/2022] [Indexed: 01/27/2023]
|
28
|
Monroe L, Kihara D. Using steered molecular dynamic tension for assessing quality of computational protein structure models. J Comput Chem 2022; 43:1140-1150. [PMID: 35475517 DOI: 10.1002/jcc.26876] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Revised: 02/16/2022] [Accepted: 04/15/2022] [Indexed: 11/12/2022]
Abstract
The native structures of proteins, except for notable exceptions of intrinsically disordered proteins, in general take their most stable conformation in the physiological condition to maintain their structural framework so that their biological function can be properly carried out. Experimentally, the stability of a protein can be measured by several means, among which the pulling experiment using the atomic force microscope (AFM) stands as a unique method. AFM directly measures the resistance from unfolding, which can be quantified from the observed force-extension profile. It has been shown that key features observed in an AFM pulling experiment can be well reproduced by computational molecular dynamics simulations. Here, we applied computational pulling for estimating the accuracy of computational protein structure models under the hypothesis that the structural stability would positively correlated with the accuracy, i.e. the closeness to the native, of a model. We used in total 4929 structure models for 24 target proteins from the Critical Assessment of Techniques of Structure Prediction (CASP) and investigated if the magnitude of the break force, that is, the force required to rearrange the model's structure, from the force profile was sufficient information for selecting near-native models. We found that near-native models can be successfully selected by examining their break forces suggesting that high break force indeed indicates high stability of models. On the other hand, there were also near-native models that had relatively low peak forces. The mechanisms of the stability exhibited by the break forces were explored and discussed.
Collapse
Affiliation(s)
- Lyman Monroe
- Department of Biological Sciences, Purdue University, West Lafayette, Indiana, USA
| | - Daisuke Kihara
- Department of Biological Sciences, Purdue University, West Lafayette, Indiana, USA.,Department of Computer Science, Purdue University, West Lafayette, Indiana, USA.,Purdue Center for Cancer Research, Purdue University, West Lafayette, Indiana, USA
| |
Collapse
|
29
|
Gu J, Zhang T, Wu C, Liang Y, Shi X. Refined Contact Map Prediction of Peptides Based on GCN and ResNet. Front Genet 2022; 13:859626. [PMID: 35571037 PMCID: PMC9092020 DOI: 10.3389/fgene.2022.859626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Accepted: 03/23/2022] [Indexed: 11/13/2022] Open
Abstract
Predicting peptide inter-residue contact maps plays an important role in computational biology, which determines the topology of the peptide structure. However, due to the limited number of known homologous structures, there is still much room for inter-residue contact map prediction. Current models are not sufficient for capturing the high accuracy relationship between the residues, especially for those with a long-range distance. In this article, we developed a novel deep neural network framework to refine the rough contact map produced by the existing methods. The rough contact map is used to construct the residue graph that is processed by the graph convolutional neural network (GCN). GCN can better capture the global information and is therefore used to grasp the long-range contact relationship. The residual convolutional neural network is also applied in the framework for learning local information. We conducted the experiments on four different test datasets, and the inter-residue long-range contact map prediction accuracy demonstrates the effectiveness of our proposed method.
Collapse
Affiliation(s)
- Jiawei Gu
- College of Computer Science and Technology, University of Jilin, Changchun, China
| | - Tianhao Zhang
- College of Computer Science and Technology, University of Jilin, Changchun, China
| | - Chunguo Wu
- College of Computer Science and Technology, University of Jilin, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Changchun, China
| | - Yanchun Liang
- College of Computer Science and Technology, University of Jilin, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Changchun, China
- School of Computer Science, Zhuhai College of Science and Technology, Zhuhai, China
| | - Xiaohu Shi
- College of Computer Science and Technology, University of Jilin, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Changchun, China
- School of Computer Science, Zhuhai College of Science and Technology, Zhuhai, China
- *Correspondence: Xiaohu Shi,
| |
Collapse
|
30
|
Guo SS, Liu J, Zhou XG, Zhang GJ. DeepUMQA: ultrafast shape recognition-based protein model quality assessment using deep learning. Bioinformatics 2022; 38:1895-1903. [PMID: 35134108 DOI: 10.1093/bioinformatics/btac056] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2021] [Revised: 12/26/2021] [Accepted: 01/27/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Protein model quality assessment is a key component of protein structure prediction. In recent research, the voxelization feature was used to characterize the local structural information of residues, but it may be insufficient for describing residue-level topological information. Design features that can further reflect residue-level topology when combined with deep learning methods are therefore crucial to improve the performance of model quality assessment. RESULTS We developed a deep-learning method, DeepUMQA, based on Ultrafast Shape Recognition (USR) for the residue-level single-model quality assessment. In the framework of the deep residual neural network, the residue-level USR feature was introduced to describe the topological relationship between the residue and overall structure by calculating the first moment of a set of residue distance sets and then combined with 1D, 2D and voxelization features to assess the quality of the model. Experimental results on the CASP13, CASP14 test datasets and CAMEO blind test show that USR could supplement the voxelization features to comprehensively characterize residue structure information and significantly improve model assessment accuracy. The performance of DeepUMQA ranks among the top during the state-of-the-art single-model quality assessment methods, including ProQ2, ProQ3, ProQ3D, Ornate, VoroMQA, ProteinGCN, ResNetQA, QDeep, GraphQA, ModFOLD6, ModFOLD7, ModFOLD8, QMEAN3, QMEANDisCo3 and DeepAccNet. AVAILABILITY AND IMPLEMENTATION The DeepUMQA server is freely available at http://zhanglab-bioinf.com/DeepUMQA/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sai-Sai Guo
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Jun Liu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Xiao-Gen Zhou
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Gui-Jun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| |
Collapse
|
31
|
|
32
|
A Benchmark Dataset for Evaluating Practical Performance of Model Quality Assessment of Homology Models. Bioengineering (Basel) 2022; 9:bioengineering9030118. [PMID: 35324806 PMCID: PMC8945737 DOI: 10.3390/bioengineering9030118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Revised: 03/08/2022] [Accepted: 03/11/2022] [Indexed: 11/25/2022] Open
Abstract
Protein structure prediction is an important issue in structural bioinformatics. In this process, model quality assessment (MQA), which estimates the accuracy of the predicted structure, is also practically important. Currently, the most commonly used dataset to evaluate the performance of MQA is the critical assessment of the protein structure prediction (CASP) dataset. However, the CASP dataset does not contain enough targets with high-quality models, and thus cannot sufficiently evaluate the MQA performance in practical use. Additionally, most application studies employ homology modeling because of its reliability. However, the CASP dataset includes models generated by de novo methods, which may lead to the mis-estimation of MQA performance. In this study, we created new benchmark datasets, named a homology models dataset for model quality assessment (HMDM), that contain targets with high-quality models derived using homology modeling. We then benchmarked the performance of the MQA methods using the new datasets and compared their performance to that of the classical selection based on the sequence identity of the template proteins. The results showed that model selection by the latest MQA methods using deep learning is better than selection by template sequence identity and classical statistical potentials. Using HMDM, it is possible to verify the MQA performance for high-accuracy homology models.
Collapse
|
33
|
Nagarajan SK, Babu S, Devaraju P, Sohn H, Madhavan T. Structure and dynamics of the somatostatin receptor 3-ligand binding in the presence of lipids examined using computational structural biology methods. Proteins 2022; 90:704-719. [PMID: 34661304 DOI: 10.1002/prot.26267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 09/29/2021] [Accepted: 10/12/2021] [Indexed: 11/08/2022]
Abstract
In the past two decades, the structural biology studies on G-protein coupled receptors (GPCRs) are on the rise. Understanding the relation between the structure and function of GPCRs is important as they play a huge role in various signaling mechanisms in a eukaryotic cell. Somatostatin receptor 3 (SSTR3), one of the GPCRs, is one such important receptor which oversees different cellular processes including cell-to-cell signaling. However, the information available regarding the structural features of SSTR3 responsible for their bioactivity is scarce. In this study, we report a structural understanding of SSTR3-ligand binding that could be helpful in demystifying the structural complexities related to functioning of the receptor. An integrated protocol consisting of different computational structural biology tools including protein structure prediction via comparative modeling, binding site characterization, three-dimensional quantitative structure-activity relationship based on comparative molecular field analysis and comparative molecular similarity indices analysis, density functional theory, and molecular dynamics simulations were performed. Different understandings from the simulation of SSTR3-ligand complexes, mainly the conditions that are favorable for the formation of lowest bioactive state of SSTR3 ligands are reported. In addition to that, we report the important physicochemical descriptors of SSTR3 ligands that could significantly influence their bioactivity. The results of the study could be helpful in developing novel SSTR3 ligands (both agonists and antagonists) with high potency and receptor selectivity.
Collapse
Affiliation(s)
- Santhosh Kumar Nagarajan
- Computational Biology Lab, Department of Genetic Engineering, School of Bioengineering, SRM Institute of Science and Technology, Chennai, India
| | - Sathya Babu
- Computational Biology Lab, Department of Genetic Engineering, School of Bioengineering, SRM Institute of Science and Technology, Chennai, India
| | - Panneer Devaraju
- Division of Microbiology and Molecular Biology, Vector Control Research Centre, Indian Council of Medical Research, Pondicherry, India
| | - Honglae Sohn
- Department of Chemistry, Chosun University, Gwangju, South Korea
- Department of Carbon Materials, Chosun University, Gwangju, South Korea
| | - Thirumurthy Madhavan
- Computational Biology Lab, Department of Genetic Engineering, School of Bioengineering, SRM Institute of Science and Technology, Chennai, India
| |
Collapse
|
34
|
Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 2022; 38:2102-2110. [PMID: 35020807 PMCID: PMC9386727 DOI: 10.1093/bioinformatics/btac020] [Citation(s) in RCA: 136] [Impact Index Per Article: 68.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Revised: 12/27/2021] [Accepted: 01/07/2022] [Indexed: 02/03/2023] Open
Abstract
SUMMARY Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data. AVAILABILITY AND IMPLEMENTATION Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Yam Peleg
- Deep Trading Ltd., Haifa 3508401, Israel
| | - Nadav Rappoport
- Department of Software and Information Systems Engineering, Faculty of Engineering Sciences, Ben-Gurion University of the Negev, Beer Sheva 8410501, Israel
| | - Michal Linial
- Department of Biological Chemistry, The Alexander Silberman Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel
| |
Collapse
|
35
|
Kaushik R, Zhang KYJ. ProFitFun: a protein tertiary structure fitness function for quantifying the accuracies of model structures. Bioinformatics 2022; 38:369-376. [PMID: 34542606 DOI: 10.1093/bioinformatics/btab666] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Revised: 09/06/2021] [Accepted: 09/16/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION An accurate estimation of the quality of protein model structures typifies as a cornerstone in protein structure prediction regimes. Despite the recent groundbreaking success in the field of protein structure prediction, there are certain prospects for the improvement in model quality estimation at multiple stages of protein structure prediction and thus, to further push the prediction accuracy. Here, a novel approach, named ProFitFun, for assessing the quality of protein models is proposed by harnessing the sequence and structural features of experimental protein structures in terms of the preferences of backbone dihedral angles and relative surface accessibility of their amino acid residues at the tripeptide level. The proposed approach leverages upon the backbone dihedral angle and surface accessibility preferences of the residues by accounting for its N-terminal and C-terminal neighbors in the protein structure. These preferences are used to evaluate protein structures through a machine learning approach and tested on an extensive dataset of diverse proteins. RESULTS The approach was extensively validated on a large test dataset (n = 25 005) of protein structures, comprising 23 661 models of 82 non-homologous proteins and 1344 non-homologous experimental structures. In addition, an external dataset of 40 000 models of 200 non-homologous proteins was also used for the validation of the proposed method. Both datasets were further used for benchmarking the proposed method with four different state-of-the-art methods for protein structure quality assessment. In the benchmarking, the proposed method outperformed some state-of-the-art methods in terms of Spearman's and Pearson's correlation coefficients, average GDT-TS loss, sum of z-scores and average absolute difference of predictions over corresponding observed values. The high accuracy of the proposed approach promises a potential use of the sequence and structural features in computational protein design. AVAILABILITY AND IMPLEMENTATION http://github.com/KYZ-LSB/ProTerS-FitFun. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rahul Kaushik
- Laboratory for Structural Bioinformatics, Center for Biosystems Dynamics Research, RIKEN, Yokohama, Kanagawa 230-0045, Japan
| | - Kam Y J Zhang
- Laboratory for Structural Bioinformatics, Center for Biosystems Dynamics Research, RIKEN, Yokohama, Kanagawa 230-0045, Japan
| |
Collapse
|
36
|
Hou Q, Pucci F, Pan F, Xue F, Rooman M, Feng Q. Using metagenomic data to boost protein structure prediction and discovery. Comput Struct Biotechnol J 2022; 20:434-442. [PMID: 35070166 PMCID: PMC8760478 DOI: 10.1016/j.csbj.2021.12.030] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Revised: 12/17/2021] [Accepted: 12/21/2021] [Indexed: 11/19/2022] Open
Abstract
Over the past decade, metagenomic sequencing approaches have been providing an ever-increasing amount of protein sequence data at an astonishing rate. These constitute an invaluable source of information which has been exploited in various research fields such as the study of the role of the gut microbiota in human diseases and aging. However, only a small fraction of all metagenomic sequences collected have been functionally or structurally characterized, leaving much of them completely unexplored. Here, we review how this information has been used in protein structure prediction and protein discovery. We begin by presenting some widely used metagenomic databases and analyze in detail how metagenomic data has contributed to the impressive improvement in the accuracy of structure prediction methods in recent years. We then examine how metagenomic information can be exploited to annotate protein sequences. More specifically, we focus on the role of metagenomes in the discovery of enzymes and new CRISPR-Cas systems, and in the identification of antibiotic resistance genes. With this review, we provide an overview of how metagenomic data is currently revolutionizing our understanding of protein science.
Collapse
Affiliation(s)
- Qingzhen Hou
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Shandong 250012, China
- National Institute of Health Data Science of China, Shandong University, Shandong 250002, China
| | - Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, 1050 Brussels, Belgium
| | - Fengming Pan
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Shandong 250012, China
- National Institute of Health Data Science of China, Shandong University, Shandong 250002, China
| | - Fuzhong Xue
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Shandong 250012, China
- National Institute of Health Data Science of China, Shandong University, Shandong 250002, China
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, 1050 Brussels, Belgium
| | - Qiang Feng
- Shandong Provincial Key Laboratory of Oral Tissue Regeneration & Shandong Engineering Laboratory for Dental Materials and Oral Tissue Regeneration, Department of Human Microbiome, School of Stomatology, Shandong University, Jinan, Shandong Province 250012, China
- State Key Laboratory of Microbial Technology, Shandong University, Qingdao, Shandong Province 266237, China
| |
Collapse
|
37
|
Present and future challenges in therapeutic designing using computational approaches. COMPUTATIONAL APPROACHES FOR NOVEL THERAPEUTIC AND DIAGNOSTIC DESIGNING TO MITIGATE SARS-COV-2 INFECTION 2022. [PMCID: PMC9300749 DOI: 10.1016/b978-0-323-91172-6.00020-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Currently, various computational methods are being used for the purpose of therapeutic design. The advent of the Coronavirus disease-2019 (COVID-19) pandemic has created a lot of problems due to which the development of effective treatment options is urgently needed. Computational intelligence is used in the control, prevention, prediction, diagnosis, and treatment of the disease. Several important drug targets have been identified in severe acute respiratory syndrome-Coronavirus-2 using in silico methods. Computer-aided drug design includes a variety of theoretical and computational approaches that are part of modern drug discovery. Advances in machine learning methods and their applications speed up the drug discovery process. Exploration of nucleic acid-based therapeutics is playing an important role in healthcare also. But a lot of challenges have also been seen that complicate the therapeutic design. Therefore, investigation of challenges associated with therapeutic design is important, and the present chapter is aimed to cover various therapeutic design approaches and challenges associated with them. Moreover, the role of computational strategies in the exploration of potential therapeutics against COVID-19 has been investigated.
Collapse
|
38
|
Fischer TLL, Bödecker MADI, Zehnacker A, Mata RA, Suhm MA. Setting up the HyDRA blind challenge for the microhydration of organic molecules. Phys Chem Chem Phys 2022; 24:11442-11454. [DOI: 10.1039/d2cp01119k] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The procedure leading to the first HyDRA blind challenge for the prediction of water donor stretching vibrations in monohydrates of organic molecules is described. A training set of 10 monohydrates...
Collapse
|
39
|
Abstract
Summary Motivation. Predicting the native state of a protein has long been considered a gateway problem for understanding protein folding. Recent advances in structural modeling driven by deep learning have achieved unprecedented success at predicting a protein’s crystal structure, but it is not clear if these models are learning the physics of how proteins dynamically fold into their equilibrium structure or are just accurate knowledge-based predictors of the final state. Results. In this work, we compare the pathways generated by state-of-the-art protein structure prediction methods to experimental data about protein folding pathways. The methods considered were AlphaFold 2, RoseTTAFold, trRosetta, RaptorX, DMPfold, EVfold, SAINT2 and Rosetta. We find evidence that their simulated dynamics capture some information about the folding pathway, but their predictive ability is worse than a trivial classifier using sequence-agnostic features like chain length. The folding trajectories produced are also uncorrelated with experimental observables such as intermediate structures and the folding rate constant. These results suggest that recent advances in structure prediction do not yet provide an enhanced understanding of protein folding. Availability. The data underlying this article are available in GitHub at https://github.com/oxpig/structure-vs-folding/ Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Carlos Outeiral
- Department of Statistics, University of Oxford, Oxford OX1 3PB, UK
| | - Daniel A Nissley
- Department of Statistics, University of Oxford, Oxford OX1 3PB, UK
| | | |
Collapse
|
40
|
Liu J, Zhao KL, He GX, Wang LJ, Zhou XG, Zhang GJ. A de novo protein structure prediction by iterative partition sampling, topology adjustment and residue-level distance deviation optimization. Bioinformatics 2021; 38:99-107. [PMID: 34459867 DOI: 10.1093/bioinformatics/btab620] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2021] [Revised: 07/23/2021] [Accepted: 08/25/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION With the great progress of deep learning-based inter-residue contact/distance prediction, the discrete space formed by fragment assembly cannot satisfy the distance constraint well. Thus, the optimal solution of the continuous space may not be achieved. Designing an effective closed-loop continuous dihedral angle optimization strategy that complements the discrete fragment assembly is crucial to improve the performance of the distance-assisted fragment assembly method. RESULTS In this article, we proposed a de novo protein structure prediction method called IPTDFold based on closed-loop iterative partition sampling, topology adjustment and residue-level distance deviation optimization. First, local dihedral angle crossover and mutation operators are designed to explore the conformational space extensively and achieve information exchange between the conformations in the population. Then, the dihedral angle rotation model of loop region with partial inter-residue distance constraints is constructed, and the rotation angle satisfying the constraints is obtained by differential evolution algorithm, so as to adjust the spatial position relationship between the secondary structures. Finally, the residue distance deviation is evaluated according to the difference between the conformation and the predicted distance, and the dihedral angle of the residue is optimized with biased probability. The final model is generated by iterating the above three steps. IPTDFold is tested on 462 benchmark proteins, 24 FM targets of CASP13 and 20 FM targets of CASP14. Results show that IPTDFold is significantly superior to the distance-assisted fragment assembly method Rosetta_D (Rosetta with distance). In particular, the prediction accuracy of IPTDFold does not decrease as the length of the protein increases. When using the same FastRelax protocol, the prediction accuracy of IPTDFold is significantly superior to that of trRosetta without orientation constraints, and is equivalent to that of the full version of trRosetta. AVAILABILITYAND IMPLEMENTATION The source code and executable are freely available at https://github.com/iobio-zjut/IPTDFold. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jun Liu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Kai-Long Zhao
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Guang-Xing He
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Liu-Jing Wang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Xiao-Gen Zhou
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109-2218, USA
| | - Gui-Jun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| |
Collapse
|
41
|
Du Z, Su H, Wang W, Ye L, Wei H, Peng Z, Anishchenko I, Baker D, Yang J. The trRosetta server for fast and accurate protein structure prediction. Nat Protoc 2021; 16:5634-5651. [PMID: 34759384 DOI: 10.1038/s41596-021-00628-9] [Citation(s) in RCA: 238] [Impact Index Per Article: 79.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Accepted: 08/31/2021] [Indexed: 11/10/2022]
Abstract
The trRosetta (transform-restrained Rosetta) server is a web-based platform for fast and accurate protein structure prediction, powered by deep learning and Rosetta. With the input of a protein's amino acid sequence, a deep neural network is first used to predict the inter-residue geometries, including distance and orientations. The predicted geometries are then transformed as restraints to guide the structure prediction on the basis of direct energy minimization, which is implemented under the framework of Rosetta. The trRosetta server distinguishes itself from other similar structure prediction servers in terms of rapid and accurate de novo structure prediction. As an illustration, trRosetta was applied to two Pfam families with unknown structures, for which the predicted de novo models were estimated to have high accuracy. Nevertheless, to take advantage of homology modeling, homologous templates are used as additional inputs to the network automatically. In general, it takes ~1 h to predict the final structure for a typical protein with ~300 amino acids, using a maximum of 10 CPU cores in parallel in our cluster system. To enable large-scale structure modeling, a downloadable package of trRosetta with open-source codes is available as well. A detailed guidance for using the package is also available in this protocol. The server and the package are available at https://yanglab.nankai.edu.cn/trRosetta/ and https://yanglab.nankai.edu.cn/trRosetta/download/ , respectively.
Collapse
Affiliation(s)
- Zongyang Du
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Hong Su
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Wenkai Wang
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Lisha Ye
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Hong Wei
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Zhenling Peng
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
| | - Ivan Anishchenko
- Department of Biochemistry, University of Washington, Seattle, WA, USA.,Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA, USA.,Institute for Protein Design, University of Washington, Seattle, WA, USA.,Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Jianyi Yang
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China.
| |
Collapse
|
42
|
Kryshtafovych A, Moult J, Albrecht R, Chang GA, Chao K, Fraser A, Greenfield J, Hartmann MD, Herzberg O, Josts I, Leiman PG, Linden SB, Lupas AN, Nelson DC, Rees SD, Shang X, Sokolova ML, Tidow H. Computational models in the service of X-ray and cryo-electron microscopy structure determination. Proteins 2021; 89:1633-1646. [PMID: 34449113 PMCID: PMC8616789 DOI: 10.1002/prot.26223] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Revised: 08/11/2021] [Accepted: 08/17/2021] [Indexed: 01/20/2023]
Abstract
Critical assessment of structure prediction (CASP) conducts community experiments to determine the state of the art in computing protein structure from amino acid sequence. The process relies on the experimental community providing information about not yet public or about to be solved structures, for use as targets. For some targets, the experimental structure is not solved in time for use in CASP. Calculated structure accuracy improved dramatically in this round, implying that models should now be much more useful for resolving many sorts of experimental difficulties. To test this, selected models for seven unsolved targets were provided to the experimental groups. These models were from the AlphaFold2 group, who overall submitted the most accurate predictions in CASP14. Four targets were solved with the aid of the models, and, additionally, the structure of an already solved target was improved. An a posteriori analysis showed that, in some cases, models from other groups would also be effective. This paper provides accounts of the successful application of models to structure determination, including molecular replacement for X-ray crystallography, backbone tracing and sequence positioning in a cryo-electron microscopy structure, and correction of local features. The results suggest that, in future, there will be greatly increased synergy between computational and experimental approaches to structure determination.
Collapse
Affiliation(s)
| | - John Moult
- Institute for Bioscience and Biotechnology Research, Department of Cell Biology and Molecular genetics, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
| | - Reinhard Albrecht
- Department of Protein Evolution, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Geoffrey A. Chang
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California-San Diego, La Jolla, CA, 92093, USA
- Department of Pharmacology, University of California-San Diego, La Jolla, CA, 92093, USA
| | - Kinlin Chao
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, MD 20850, USA
| | - Alec Fraser
- Department of Biochemistry and Molecular Biology, Sealy Center for Structural Biology and Molecular Biophysics (SCSB), The University of Texas Medical Branch at Galveston, TX 77555, USA
| | - Julia Greenfield
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, MD 20850, USA
| | - Marcus D. Hartmann
- Department of Protein Evolution, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Osnat Herzberg
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, MD 20850, USA
- Department of Chemistry and Biochemistry, University of Maryland, College Park, MD 20742, USA
| | - Inokentijs Josts
- The Hamburg Advanced Research Center for Bioorganic Chemistry (HARBOR) & Department of Chemistry, Institute for Biochemistry and Molecular Biology, University of Hamburg, Luruper Chaussee 149, 22761 Hamburg, Germany
| | - Petr G. Leiman
- Department of Biochemistry and Molecular Biology, Sealy Center for Structural Biology and Molecular Biophysics (SCSB), The University of Texas Medical Branch at Galveston, TX 77555, USA
| | - Sara B. Linden
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, MD 20850, USA
| | - Andrei N. Lupas
- Department of Protein Evolution, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Daniel C. Nelson
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, MD 20850, USA
- Department of Veterinary Medicine, University of Maryland, College Park, MD 20742, USA
| | - Steven D. Rees
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California-San Diego, La Jolla, CA, 92093, USA
| | - Xiaoran Shang
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, MD 20850, USA
| | - Maria L. Sokolova
- Center of Life Sciences, Skolkovo Institute of Science and Technology, Moscow, 121205, Russia
| | - Henning Tidow
- The Hamburg Advanced Research Center for Bioorganic Chemistry (HARBOR) & Department of Chemistry, Institute for Biochemistry and Molecular Biology, University of Hamburg, Luruper Chaussee 149, 22761 Hamburg, Germany
| | | |
Collapse
|
43
|
Koehler Leman J, Lyskov S, Lewis SM, Adolf-Bryfogle J, Alford RF, Barlow K, Ben-Aharon Z, Farrell D, Fell J, Hansen WA, Harmalkar A, Jeliazkov J, Kuenze G, Krys JD, Ljubetič A, Loshbaugh AL, Maguire J, Moretti R, Mulligan VK, Nance ML, Nguyen PT, Ó Conchúir S, Roy Burman SS, Samanta R, Smith ST, Teets F, Tiemann JKS, Watkins A, Woods H, Yachnin BJ, Bahl CD, Bailey-Kellogg C, Baker D, Das R, DiMaio F, Khare SD, Kortemme T, Labonte JW, Lindorff-Larsen K, Meiler J, Schief W, Schueler-Furman O, Siegel JB, Stein A, Yarov-Yarovoy V, Kuhlman B, Leaver-Fay A, Gront D, Gray JJ, Bonneau R. Ensuring scientific reproducibility in bio-macromolecular modeling via extensive, automated benchmarks. Nat Commun 2021; 12:6947. [PMID: 34845212 PMCID: PMC8630030 DOI: 10.1038/s41467-021-27222-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Accepted: 11/02/2021] [Indexed: 01/14/2023] Open
Abstract
Each year vast international resources are wasted on irreproducible research. The scientific community has been slow to adopt standard software engineering practices, despite the increases in high-dimensional data, complexities of workflows, and computational environments. Here we show how scientific software applications can be created in a reproducible manner when simple design goals for reproducibility are met. We describe the implementation of a test server framework and 40 scientific benchmarks, covering numerous applications in Rosetta bio-macromolecular modeling. High performance computing cluster integration allows these benchmarks to run continuously and automatically. Detailed protocol captures are useful for developers and users of Rosetta and other macromolecular modeling tools. The framework and design concepts presented here are valuable for developers and users of any type of scientific software and for the scientific community to create reproducible methods. Specific examples highlight the utility of this framework, and the comprehensive documentation illustrates the ease of adding new tests in a matter of hours.
Collapse
Affiliation(s)
- Julia Koehler Leman
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, 10010, USA.
- Department of Biology, New York University, New York, NY, 10003, USA.
| | - Sergey Lyskov
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Steven M Lewis
- Cyrus Biotechnology, 1201 Second Ave, Suite 900, Seattle, WA, 98101, USA
| | - Jared Adolf-Bryfogle
- Department of Immunology and Microbiology, Scripps Research, La Jolla, CA, 92037, USA
- IAVI Neutralizing Antibody Center, Scripps Research, La Jolla, CA, 92037, USA
| | - Rebecca F Alford
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Kyle Barlow
- Graduate Program in Bioinformatics, University of California San Francisco, San Francisco, CA, 94158, USA
| | - Ziv Ben-Aharon
- Department of Microbiology and Molecular Genetics, Hebrew University, Hadassah Medical School, POB 12272, Jerusalem, 91120, Israel
| | - Daniel Farrell
- Department of Biochemistry, University of Washington, Seattle, WA, 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA, 98195, USA
| | - Jason Fell
- Genome Center, University of California, Davis, CA, 95616, USA
- Department of Biochemistry & Molecular Medicine, University of California, Davis, CA, 95616, USA
- Department of Chemistry, University of California, Davis, CA, 95616, USA
| | - William A Hansen
- Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ, 08904, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ, 08904, USA
| | - Ameya Harmalkar
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Jeliazko Jeliazkov
- Program in Molecular Biophysics, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Georg Kuenze
- Department of Chemistry, Vanderbilt University, Nashville, TN, 37235, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN, 37235, USA
- Institute for Drug Discovery, Medical School, Leipzig University, 04103, Leipzig, Germany
| | - Justyna D Krys
- Faculty of Chemistry, Biological and Chemical Research Center, University of Warsaw, Pasteura 1, 02-093, Warsaw, Poland
| | - Ajasja Ljubetič
- Department of Biochemistry, University of Washington, Seattle, WA, 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA, 98195, USA
| | - Amanda L Loshbaugh
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, 94158, USA
- Biophysics Graduate Program, University of California San Francisco, San Francisco, CA, 94158, USA
| | - Jack Maguire
- Program in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
| | - Rocco Moretti
- Department of Chemistry, Vanderbilt University, Nashville, TN, 37235, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN, 37235, USA
| | - Vikram Khipple Mulligan
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, 10010, USA
| | - Morgan L Nance
- Program in Molecular Biophysics, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Phuong T Nguyen
- Department of Physiology and Membrane Biology, School of Medicine, University of California, Davis, CA, 95616, USA
| | - Shane Ó Conchúir
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, 94158, USA
| | - Shourya S Roy Burman
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Rituparna Samanta
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Shannon T Smith
- Center for Structural Biology, Vanderbilt University, Nashville, TN, 37235, USA
- Chemical and Physical Biology Program, Vanderbilt University, Nashville, TN, 37235, USA
| | - Frank Teets
- Department of Bioochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27516, USA
| | - Johanna K S Tiemann
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, DK-2200, Copenhagen N., Denmark
| | - Andrew Watkins
- Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | - Hope Woods
- Center for Structural Biology, Vanderbilt University, Nashville, TN, 37235, USA
- Chemical and Physical Biology Program, Vanderbilt University, Nashville, TN, 37235, USA
| | - Brahm J Yachnin
- Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ, 08904, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ, 08904, USA
| | - Christopher D Bahl
- Institute for Protein Innovation, Boston, MA, 02115, USA
- Division of Hematology/Oncology, Boston Children's Hospital, Boston, MA, 02115, USA
- Department of Pediatrics, Harvard Medical School, Boston, MA, 02115, USA
| | | | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA, 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA, 98195, USA
| | - Rhiju Das
- Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | - Frank DiMaio
- Department of Biochemistry, University of Washington, Seattle, WA, 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA, 98195, USA
| | - Sagar D Khare
- Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ, 08904, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ, 08904, USA
| | - Tanja Kortemme
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, 94158, USA
- Biophysics Graduate Program, University of California San Francisco, San Francisco, CA, 94158, USA
| | - Jason W Labonte
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Kresten Lindorff-Larsen
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, DK-2200, Copenhagen N., Denmark
| | - Jens Meiler
- Department of Chemistry, Vanderbilt University, Nashville, TN, 37235, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN, 37235, USA
- Institute for Drug Discovery, Medical School, Leipzig University, 04103, Leipzig, Germany
| | - William Schief
- Department of Immunology and Microbiology, Scripps Research, La Jolla, CA, 92037, USA
- IAVI Neutralizing Antibody Center, Scripps Research, La Jolla, CA, 92037, USA
| | - Ora Schueler-Furman
- Department of Microbiology and Molecular Genetics, Hebrew University, Hadassah Medical School, POB 12272, Jerusalem, 91120, Israel
| | - Justin B Siegel
- Genome Center, University of California, Davis, CA, 95616, USA
- Department of Biochemistry & Molecular Medicine, University of California, Davis, CA, 95616, USA
- Department of Chemistry, University of California, Davis, CA, 95616, USA
| | - Amelie Stein
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, DK-2200, Copenhagen N., Denmark
| | - Vladimir Yarov-Yarovoy
- Department of Physiology and Membrane Biology, School of Medicine, University of California, Davis, CA, 95616, USA
| | - Brian Kuhlman
- Department of Bioochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27516, USA
| | - Andrew Leaver-Fay
- Department of Bioochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27516, USA
| | - Dominik Gront
- Faculty of Chemistry, Biological and Chemical Research Center, University of Warsaw, Pasteura 1, 02-093, Warsaw, Poland
| | - Jeffrey J Gray
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA.
| | - Richard Bonneau
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, 10010, USA.
- Department of Biology, New York University, New York, NY, 10003, USA.
- Department of Computer Science, New York University, New York, NY, 10003, USA.
| |
Collapse
|
44
|
Ho CT, Huang YW, Chen TR, Lo CH, Lo WC. Discovering the Ultimate Limits of Protein Secondary Structure Prediction. Biomolecules 2021; 11:1627. [PMID: 34827624 PMCID: PMC8615938 DOI: 10.3390/biom11111627] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Revised: 10/25/2021] [Accepted: 10/28/2021] [Indexed: 12/29/2022] Open
Abstract
Secondary structure prediction (SSP) of proteins is an important structural biology technique with many applications. There have been ~300 algorithms published in the past seven decades with fierce competition in accuracy. In the first 60 years, the accuracy of three-state SSP rose from ~56% to 81%; after that, it has long stayed at 81-86%. In the 1990s, the theoretical limit of three-state SSP accuracy had been estimated to be 88%. Thus, SSP is now generally considered not challenging or too challenging to improve. However, we found that the limit of three-state SSP might be underestimated. Besides, there is still much room for improving segment-based and eight-state SSPs, but the limits of these emerging topics have not been determined. This work performs large-scale sequence and structural analyses to estimate SSP accuracy limits and assess state-of-the-art SSP methods. The limit of three-state SSP is re-estimated to be ~92%, 4-5% higher than previously expected, indicating that SSP is still challenging. The estimated limit of eight-state SSP is 84-87%. Several proposals for improving future SSP algorithms are made based on our results. We hope that these findings will help move forward the development of SSP and all its applications.
Collapse
Affiliation(s)
- Chia-Tzu Ho
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
| | - Yu-Wei Huang
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
| | - Teng-Ruei Chen
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
| | - Chia-Hua Lo
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan
| | - Wei-Cheng Lo
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan
- The Center for Bioinformatics Research, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan
| |
Collapse
|
45
|
Tong A, Wenkel F, Macdonald K, Krishnaswamy S, Wolf G. DATA-DRIVEN LEARNING OF GEOMETRIC SCATTERING MODULES FOR GNNS. IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING : [PROCEEDINGS]. IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING 2021; 2021:10.1109/mlsp52302.2021.9596169. [PMID: 36945315 PMCID: PMC10026018 DOI: 10.1109/mlsp52302.2021.9596169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
We propose a new graph neural network (GNN) module, based on relaxations of recently proposed geometric scattering transforms, which consist of a cascade of graph wavelet filters. Our learnable geometric scattering (LEGS) module enables adaptive tuning of the wavelets to encourage band-pass features to emerge in learned representations. The incorporation of our LEGS-module in GNNs enables the learning of longer-range graph relations compared to many popular GNNs, which often rely on encoding graph structure via smoothness or similarity between neighbors. Further, its wavelet priors result in simplified architectures with significantly fewer learned parameters compared to competing GNNs. We demonstrate the predictive performance of LEGS-based networks on graph classification benchmarks, as well as the descriptive quality of their learned features in biochemical graph data exploration tasks.
Collapse
Affiliation(s)
- Alexander Tong
- Yale University, Dept. of Comp. Sci., New Haven, CT, USA
| | - Frederick Wenkel
- Université de Montréal, Dept. of Math. & Stat.; Mila - Quebec AI Institute, Montreal, QC, Canada
| | | | - Smita Krishnaswamy
- Dept. of Genetics, New Haven, CT, USA
- Yale University, Dept. of Comp. Sci., New Haven, CT, USA
| | - Guy Wolf
- Université de Montréal, Dept. of Math. & Stat.; Mila - Quebec AI Institute, Montreal, QC, Canada
| |
Collapse
|
46
|
Kryshtafovych A, Moult J, Billings WM, Della Corte D, Fidelis K, Kwon S, Olechnovič K, Seok C, Venclovas Č, Won J. Modeling SARS-CoV-2 proteins in the CASP-commons experiment. Proteins 2021; 89:1987-1996. [PMID: 34462960 PMCID: PMC8616790 DOI: 10.1002/prot.26231] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 08/23/2021] [Accepted: 08/26/2021] [Indexed: 01/21/2023]
Abstract
Critical Assessment of Structure Prediction (CASP) is an organization aimed at advancing the state of the art in computing protein structure from sequence. In the spring of 2020, CASP launched a community project to compute the structures of the most structurally challenging proteins coded for in the SARS-CoV-2 genome. Forty-seven research groups submitted over 3000 three-dimensional models and 700 sets of accuracy estimates on 10 proteins. The resulting models were released to the public. CASP community members also worked together to provide estimates of local and global accuracy and identify structure-based domain boundaries for some proteins. Subsequently, two of these structures (ORF3a and ORF8) have been solved experimentally, allowing assessment of both model quality and the accuracy estimates. Models from the AlphaFold2 group were found to have good agreement with the experimental structures, with main chain GDT_TS accuracy scores ranging from 63 (a correct topology) to 87 (competitive with experiment).
Collapse
Affiliation(s)
| | - John Moult
- Department of Cell Biology and Molecular genetics, Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland, USA
| | - Wendy M Billings
- Department of Physics & Astronomy, Brigham Young University, Provo, Utah, USA
| | - Dennis Della Corte
- Department of Physics & Astronomy, Brigham Young University, Provo, Utah, USA
| | - Krzysztof Fidelis
- Genome Center, University of California, Davis, Davis, California, USA
| | - Sohee Kwon
- Department of Chemistry, Seoul National University, Seoul, South Korea
| | - Kliment Olechnovič
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Vilnius, Lithuania
| | - Chaok Seok
- Department of Chemistry, Seoul National University, Seoul, South Korea
| | - Česlovas Venclovas
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Vilnius, Lithuania
| | - Jonghun Won
- Department of Chemistry, Seoul National University, Seoul, South Korea
| | | |
Collapse
|
47
|
Mortuza SM, Zheng W, Zhang C, Li Y, Pearce R, Zhang Y. Improving fragment-based ab initio protein structure assembly using low-accuracy contact-map predictions. Nat Commun 2021; 12:5011. [PMID: 34408149 PMCID: PMC8373938 DOI: 10.1038/s41467-021-25316-w] [Citation(s) in RCA: 40] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2021] [Accepted: 08/04/2021] [Indexed: 11/28/2022] Open
Abstract
Sequence-based contact prediction has shown considerable promise in assisting non-homologous structure modeling, but it often requires many homologous sequences and a sufficient number of correct contacts to achieve correct folds. Here, we developed a method, C-QUARK, that integrates multiple deep-learning and coevolution-based contact-maps to guide the replica-exchange Monte Carlo fragment assembly simulations. The method was tested on 247 non-redundant proteins, where C-QUARK could fold 75% of the cases with TM-scores (template-modeling scores) ≥0.5, which was 2.6 times more than that achieved by QUARK. For the 59 cases that had either low contact accuracy or few homologous sequences, C-QUARK correctly folded 6 times more proteins than other contact-based folding methods. C-QUARK was also tested on 64 free-modeling targets from the 13th CASP (critical assessment of protein structure prediction) experiment and had an average GDT_TS (global distance test) score that was 5% higher than the best CASP predictors. These data demonstrate, in a robust manner, the progress in modeling non-homologous protein structures using low-accuracy and sparse contact-map predictions.
Collapse
Affiliation(s)
- S M Mortuza
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Wei Zheng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Yang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Robin Pearce
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
48
|
Wang L, Liu J, Xia Y, Xu J, Zhou X, Zhang G. Distance-guided protein folding based on generalized descent direction. Brief Bioinform 2021; 22:6341661. [PMID: 34355233 DOI: 10.1093/bib/bbab296] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2021] [Revised: 06/30/2021] [Accepted: 07/12/2021] [Indexed: 12/25/2022] Open
Abstract
Advances in the prediction of the inter-residue distance for a protein sequence have increased the accuracy to predict the correct folds of proteins with distance information. Here, we propose a distance-guided protein folding algorithm based on generalized descent direction, named GDDfold, which achieves effective structural perturbation and potential minimization in two stages. In the global stage, random-based direction is designed using evolutionary knowledge, which guides conformation population to cross potential barriers and explore conformational space rapidly in a large range. In the local stage, locally rugged potential landscape can be explored with the aid of conjugate-based direction integrated into a specific search strategy, which can improve the exploitation ability. GDDfold is tested on 347 proteins of a benchmark set, 24 template-free modeling (FM) approaches targets of CASP13 and 20 FM targets of CASP14. Results show that GDDfold correctly folds [template modeling (TM) score ≥ = 0.5] 316 out of 347 proteins, where 65 proteins have TM scores that are greater than 0.8, and significantly outperforms Rosetta-dist (distance-assisted fragment assembly method) and L-BFGSfold (distance geometry optimization method). On CASP FM targets, GDDfold is comparable with five state-of-the-art full-version methods, namely, Quark, RaptorX, Rosetta, MULTICOM and trRosetta in the CASP 13 and 14 server groups.
Collapse
Affiliation(s)
- Liujing Wang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Jun Liu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Yuhao Xia
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Jiakang Xu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Xiaogen Zhou
- Department of Computational Medicine and Bioinformatics, University of Michigan, Michigan USA
| | - Guijun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| |
Collapse
|
49
|
Zheng W, Li Y, Zhang C, Zhou X, Pearce R, Bell EW, Huang X, Zhang Y. Protein structure prediction using deep learning distance and hydrogen-bonding restraints in CASP14. Proteins 2021; 89:1734-1751. [PMID: 34331351 DOI: 10.1002/prot.26193] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2021] [Revised: 07/06/2021] [Accepted: 07/22/2021] [Indexed: 11/10/2022]
Abstract
In this article, we report 3D structure prediction results by two of our best server groups ("Zhang-Server" and "QUARK") in CASP14. These two servers were built based on the D-I-TASSER and D-QUARK algorithms, which integrated four newly developed components into the classical protein folding pipelines, I-TASSER and QUARK, respectively. The new components include: (a) a new multiple sequence alignment (MSA) collection tool, DeepMSA2, which is extended from the DeepMSA program; (b) a contact-based domain boundary prediction algorithm, FUpred, to detect protein domain boundaries; (c) a residual convolutional neural network-based method, DeepPotential, to predict multiple spatial restraints by co-evolutionary features derived from the MSA; and (d) optimized spatial restraint energy potentials to guide the structure assembly simulations. For 37 FM targets, the average TM-scores of the first models produced by D-I-TASSER and D-QUARK were 96% and 112% higher than those constructed by I-TASSER and QUARK, respectively. The data analysis indicates noticeable improvements produced by each of the four new components, especially for the newly added spatial restraints from DeepPotential and the well-tuned force field that combines spatial restraints, threading templates, and generic knowledge-based potentials. However, challenges still exist in the current pipelines. These include difficulties in modeling multi-domain proteins due to low accuracy in inter-domain distance prediction and modeling protein domains from oligomer complexes, as the co-evolutionary analysis cannot distinguish inter-chain and intra-chain distances. Specifically tuning the deep learning-based predictors for multi-domain targets and protein complexes may be helpful to address these issues.
Collapse
Affiliation(s)
- Wei Zheng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA
| | - Yang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA.,School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA
| | - Xiaogen Zhou
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA
| | - Robin Pearce
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA
| | - Eric W Bell
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA
| | - Xiaoqiang Huang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
50
|
Chen TR, Juan SH, Huang YW, Lin YC, Lo WC. A secondary structure-based position-specific scoring matrix applied to the improvement in protein secondary structure prediction. PLoS One 2021; 16:e0255076. [PMID: 34320027 PMCID: PMC8318245 DOI: 10.1371/journal.pone.0255076] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 07/11/2021] [Indexed: 11/18/2022] Open
Abstract
Protein secondary structure prediction (SSP) has a variety of applications; however, there has been relatively limited improvement in accuracy for years. With a vision of moving forward all related fields, we aimed to make a fundamental advance in SSP. There have been many admirable efforts made to improve the machine learning algorithm for SSP. This work thus took a step back by manipulating the input features. A secondary structure element-based position-specific scoring matrix (SSE-PSSM) is proposed, based on which a new set of machine learning features can be established. The feasibility of this new PSSM was evaluated by rigid independent tests with training and testing datasets sharing <25% sequence identities. In all experiments, the proposed PSSM outperformed the traditional amino acid PSSM. This new PSSM can be easily combined with the amino acid PSSM, and the improvement in accuracy was remarkable. Preliminary tests made by combining the SSE-PSSM and well-known SSP methods showed 2.0% and 5.2% average improvements in three- and eight-state SSP accuracies, respectively. If this PSSM can be integrated into state-of-the-art SSP methods, the overall accuracy of SSP may break the current restriction and eventually bring benefit to all research and applications where secondary structure prediction plays a vital role during development. To facilitate the application and integration of the SSE-PSSM with modern SSP methods, we have established a web server and standalone programs for generating SSE-PSSM available at http://10.life.nctu.edu.tw/SSE-PSSM.
Collapse
Affiliation(s)
- Teng-Ruei Chen
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Sheng-Hung Juan
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Yu-Wei Huang
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Yen-Cheng Lin
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Wei-Cheng Lo
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- The Center for Bioinformatics Research, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- * E-mail:
| |
Collapse
|