1
|
Lee JW, Won JH, Jeon S, Choo Y, Yeon Y, Oh JS, Kim M, Kim S, Joung I, Jang C, Lee SJ, Kim TH, Jin KH, Song G, Kim ES, Yoo J, Paek E, Noh YK, Joo K. DeepFold: enhancing protein structure prediction through optimized loss functions, improved template features, and re-optimized energy function. Bioinformatics 2023; 39:btad712. [PMID: 37995286 PMCID: PMC10699847 DOI: 10.1093/bioinformatics/btad712] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 11/17/2023] [Accepted: 11/22/2023] [Indexed: 11/25/2023] Open
Abstract
MOTIVATION Predicting protein structures with high accuracy is a critical challenge for the broad community of life sciences and industry. Despite progress made by deep neural networks like AlphaFold2, there is a need for further improvements in the quality of detailed structures, such as side-chains, along with protein backbone structures. RESULTS Building upon the successes of AlphaFold2, the modifications we made include changing the losses of side-chain torsion angles and frame aligned point error, adding loss functions for side chain confidence and secondary structure prediction, and replacing template feature generation with a new alignment method based on conditional random fields. We also performed re-optimization by conformational space annealing using a molecular mechanics energy function which integrates the potential energies obtained from distogram and side-chain prediction. In the CASP15 blind test for single protein and domain modeling (109 domains), DeepFold ranked fourth among 132 groups with improvements in the details of the structure in terms of backbone, side-chain, and Molprobity. In terms of protein backbone accuracy, DeepFold achieved a median GDT-TS score of 88.64 compared with 85.88 of AlphaFold2. For TBM-easy/hard targets, DeepFold ranked at the top based on Z-scores for GDT-TS. This shows its practical value to the structural biology community, which demands highly accurate structures. In addition, a thorough analysis of 55 domains from 39 targets with publicly available structures indicates that DeepFold shows superior side-chain accuracy and Molprobity scores among the top-performing groups. AVAILABILITY AND IMPLEMENTATION DeepFold tools are open-source software available at https://github.com/newtonjoo/deepfold.
Collapse
Affiliation(s)
- Jae-Won Lee
- Department of Computer Science, Hanyang University, Seoul 04763, Korea
- Center for Advanced Computation, Korea Institute for Advanced Study, Seoul 02455, Korea
| | - Jong-Hyun Won
- Department of Computer Science, Hanyang University, Seoul 04763, Korea
- Center for Advanced Computation, Korea Institute for Advanced Study, Seoul 02455, Korea
| | - Seonggwang Jeon
- Department of Computer Science, Hanyang University, Seoul 04763, Korea
- Center for Advanced Computation, Korea Institute for Advanced Study, Seoul 02455, Korea
| | - Yujin Choo
- Center for Advanced Computation, Korea Institute for Advanced Study, Seoul 02455, Korea
- Department of Artificial intelligence, Hanyang University, Seoul 04763, Korea
| | - Yubin Yeon
- Department of Computer Science, Hanyang University, Seoul 04763, Korea
- Center for Advanced Computation, Korea Institute for Advanced Study, Seoul 02455, Korea
| | - Jin-Seon Oh
- Center for Advanced Computation, Korea Institute for Advanced Study, Seoul 02455, Korea
- Department of Artificial intelligence, Hanyang University, Seoul 04763, Korea
| | - Minsoo Kim
- Department of Physics, Sungkyunkwan University, Suwon 16419, Korea
| | - SeonHwa Kim
- School of Electrical Engineering, Korea University, Seoul 02841, Korea
| | | | - Cheongjae Jang
- Artificial Intelligence Institute, Hanyang University, Seoul 04763, Korea
| | - Sung Jong Lee
- Basic Science Research Institute, Changwon National University, Changwon 51140, Korea
| | - Tae Hyun Kim
- Department of Computer Science, Hanyang University, Seoul 04763, Korea
| | - Kyong Hwan Jin
- School of Electrical Engineering, Korea University, Seoul 02841, Korea
| | - Giltae Song
- School of Computer Science and Engineering, Pusan National University, Busan 46241, Korea
| | - Eun-Sol Kim
- Department of Computer Science, Hanyang University, Seoul 04763, Korea
| | - Jejoong Yoo
- Department of Physics, Sungkyunkwan University, Suwon 16419, Korea
| | - Eunok Paek
- Department of Computer Science, Hanyang University, Seoul 04763, Korea
| | - Yung-Kyun Noh
- Department of Computer Science, Hanyang University, Seoul 04763, Korea
- School of Computational Sciences, Korea Institute for Advanced Study, Seoul 02455, Korea
| | - Keehyoung Joo
- Center for Advanced Computation, Korea Institute for Advanced Study, Seoul 02455, Korea
| |
Collapse
|
2
|
Protein Structure Refinement Using Multi-Objective Particle Swarm Optimization with Decomposition Strategy. Int J Mol Sci 2021; 22:ijms22094408. [PMID: 33922489 PMCID: PMC8122964 DOI: 10.3390/ijms22094408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 04/16/2021] [Accepted: 04/20/2021] [Indexed: 12/02/2022] Open
Abstract
Protein structure refinement is a crucial step for more accurate protein structure predictions. Most existing approaches treat it as an energy minimization problem to intuitively improve the quality of initial models by searching for structures with lower energy. Considering that a single energy function could not reflect the accurate energy landscape of all the proteins, our previous AIR 1.0 pipeline uses multiple energy functions to realize a multi-objectives particle swarm optimization-based model refinement. It is expected to provide a general balanced conformation search protocol guided from different energy evaluations. However, AIR 1.0 solves the multi-objective optimization problem as a whole, which could not result in good solution diversity and convergence on some targets. In this study, we report a decomposition-based method AIR 2.0, which is an updated version of AIR, for protein structure refinement. AIR 2.0 decomposes a multi-objective optimization problem into a number of subproblems and optimizes them simultaneously using particle swarm optimization algorithm. The solutions yielded by AIR 2.0 show better convergence and diversity compared to its previous version, which increases the possibilities of digging out better structure conformations. The experimental results on CASP13 refinement benchmark targets and blind tests in CASP 14 demonstrate the efficacy of AIR 2.0.
Collapse
|
3
|
Runthala A, Chowdhury S. Refined template selection and combination algorithm significantly improves template-based modeling accuracy. J Bioinform Comput Biol 2020; 17:1950006. [PMID: 31057073 DOI: 10.1142/s0219720019500069] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
In contrast to ab-initio protein modeling methodologies, comparative modeling is considered as the most popular and reliable algorithm to model protein structure. However, the selection of the best set of templates is still a major challenge. An effective template-ranking algorithm is developed to efficiently select only the reliable hits for predicting the protein structures. The algorithm employs the pairwise as well as multiple sequence alignments of template hits to rank and select the best possible set of templates. It captures several key sequences and structural information of template hits and converts into scores to effectively rank them. This selected set of templates is used to model a target. Modeling accuracy of the algorithm is tested and evaluated on TBM-HA domain containing CASP8, CASP9 and CASP10 targets. On an average, this template ranking and selection algorithm improves GDT-TS, GDT-HA and TM_Score by 3.531, 4.814 and 0.022, respectively. Further, it has been shown that the inclusion of structurally similar templates with ample conformational diversity is crucial for the modeling algorithm to maximally as well as reliably span the target sequence and construct its near-native model. The optimal model sampling also holds the key to predict the best possible target structure.
Collapse
Affiliation(s)
- Ashish Runthala
- 1 Department of Biological Sciences, Birla Institute of Technology and Science, Pilani-333031, India
| | - Shibasish Chowdhury
- 1 Department of Biological Sciences, Birla Institute of Technology and Science, Pilani-333031, India
| |
Collapse
|
4
|
Shrestha R, Fajardo E, Gil N, Fidelis K, Kryshtafovych A, Monastyrskyy B, Fiser A. Assessing the accuracy of contact predictions in CASP13. Proteins 2019; 87:1058-1068. [PMID: 31587357 PMCID: PMC6851495 DOI: 10.1002/prot.25819] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2019] [Revised: 09/17/2019] [Accepted: 09/17/2019] [Indexed: 01/07/2023]
Abstract
The accuracy of sequence-based tertiary contact predictions was assessed in a blind prediction experiment at the CASP13 meeting. After 4 years of significant improvements in prediction accuracy, another dramatic advance has taken place since CASP12 was held 2 years ago. The precision of predicting the top L/5 contacts in the free modeling category, where L is the corresponding length of the protein in residues, has exceeded 70%. As a comparison, the best-performing group at CASP12 with a 47% precision would have finished below the top 1/3 of the CASP13 groups. Extensively trained deep neural network approaches dominate the top performing algorithms, which appear to efficiently integrate information on coevolving residues and interacting fragments or possibly utilize memories of sequence similarities and sometimes can deliver accurate results even in the absence of virtually any target specific evolutionary information. If the current performance is evaluated by F-score on L contacts, it stands around 24% right now, which, despite the tremendous impact and advance in improving its utility for structure modeling, also suggests that there is much room left for further improvement.
Collapse
Affiliation(s)
- Rojan Shrestha
- Department of Systems and Computational Biology, and Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| | - Eduardo Fajardo
- Department of Systems and Computational Biology, and Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| | - Nelson Gil
- Department of Systems and Computational Biology, and Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| | - Krzysztof Fidelis
- Genome Center, University of California, Davis, 451 Health Sciences Dr., Davis CA 95616-8816, USA
| | - Andriy Kryshtafovych
- Genome Center, University of California, Davis, 451 Health Sciences Dr., Davis CA 95616-8816, USA
| | - Bohdan Monastyrskyy
- Genome Center, University of California, Davis, 451 Health Sciences Dr., Davis CA 95616-8816, USA
| | - Andras Fiser
- Department of Systems and Computational Biology, and Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| |
Collapse
|
5
|
Heo L, Feig M. High-accuracy protein structures by combining machine-learning with physics-based refinement. Proteins 2019; 88:637-642. [PMID: 31693199 DOI: 10.1002/prot.25847] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2019] [Revised: 10/05/2019] [Accepted: 11/03/2019] [Indexed: 12/16/2022]
Abstract
Protein structure prediction has long been available as an alternative to experimental structure determination, especially via homology modeling based on templates from related sequences. Recently, models based on distance restraints from coevolutionary analysis via machine learning to have significantly expanded the ability to predict structures for sequences without templates. One such method, AlphaFold, also performs well on sequences where templates are available but without using such information directly. Here we show that combining machine-learning based models from AlphaFold with state-of-the-art physics-based refinement via molecular dynamics simulations further improves predictions to outperform any other prediction method tested during the latest round of CASP. The resulting models have highly accurate global and local structures, including high accuracy at functionally important interface residues, and they are highly suitable as initial models for crystal structure determination via molecular replacement.
Collapse
Affiliation(s)
- Lim Heo
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan
| | - Michael Feig
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan
| |
Collapse
|
6
|
Shi C, Chen J, Kang X, Zhao G, Lao X, Zheng H. Deep Learning in the Study of Protein-Related Interactions. Protein Pept Lett 2019; 27:359-369. [PMID: 31538879 DOI: 10.2174/0929866526666190723114142] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2019] [Revised: 03/13/2019] [Accepted: 04/05/2019] [Indexed: 11/22/2022]
Abstract
Protein-related interaction prediction is critical to understanding life processes, biological functions, and mechanisms of drug action. Experimental methods used to determine proteinrelated interactions have always been costly and inefficient. In recent years, advances in biological and medical technology have provided us with explosive biological and physiological data, and deep learning-based algorithms have shown great promise in extracting features and learning patterns from complex data. At present, deep learning in protein research has emerged. In this review, we provide an introductory overview of the deep neural network theory and its unique properties. Mainly focused on the application of this technology in protein-related interactions prediction over the past five years, including protein-protein interactions prediction, protein-RNA\DNA, Protein- drug interactions prediction, and others. Finally, we discuss some of the challenges that deep learning currently faces.
Collapse
Affiliation(s)
- Cheng Shi
- School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| | - Jiaxing Chen
- School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| | - Xinyue Kang
- School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| | - Guiling Zhao
- School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| | - Xingzhen Lao
- School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| | - Heng Zheng
- School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| |
Collapse
|
7
|
Zeng H, Wang S, Zhou T, Zhao F, Li X, Wu Q, Xu J. ComplexContact: a web server for inter-protein contact prediction using deep learning. Nucleic Acids Res 2019; 46:W432-W437. [PMID: 29790960 PMCID: PMC6030867 DOI: 10.1093/nar/gky420] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2018] [Accepted: 05/20/2018] [Indexed: 12/15/2022] Open
Abstract
ComplexContact (http://raptorx2.uchicago.edu/ComplexContact/) is a web server for sequence-based interfacial residue-residue contact prediction of a putative protein complex. Interfacial residue-residue contacts are critical for understanding how proteins form complex and interact at residue level. When receiving a pair of protein sequences, ComplexContact first searches for their sequence homologs and builds two paired multiple sequence alignments (MSA), then it applies co-evolution analysis and a CASP-winning deep learning (DL) method to predict interfacial contacts from paired MSAs and visualizes the prediction as an image. The DL method was originally developed for intra-protein contact prediction and performed the best in CASP12. Our large-scale experimental test further shows that ComplexContact greatly outperforms pure co-evolution methods for inter-protein contact prediction, regardless of the species.
Collapse
Affiliation(s)
- Hong Zeng
- School of Computer Science and Technology, Hangzhou Dianzi University, China
| | - Sheng Wang
- King Abdullah University of Science and Technology (KAUST), Saudi Arabia.,Toyota Technological Institute at Chicago, USA
| | - Tianming Zhou
- Toyota Technological Institute at Chicago, USA.,Institute for Interdisciplinary Information Sciences, Tsinghua University, China
| | - Feifeng Zhao
- School of Computer Science and Technology, Hangzhou Dianzi University, China
| | - Xiufeng Li
- School of Computer Science and Technology, Hangzhou Dianzi University, China
| | - Qing Wu
- School of Computer Science and Technology, Hangzhou Dianzi University, China
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, USA
| |
Collapse
|
8
|
Heo L, Arbour CF, Feig M. Driven to near-experimental accuracy by refinement via molecular dynamics simulations. Proteins 2019; 87:1263-1275. [PMID: 31197841 DOI: 10.1002/prot.25759] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2019] [Revised: 06/01/2019] [Accepted: 06/07/2019] [Indexed: 12/17/2022]
Abstract
Protein model refinement has been an essential part of successful protein structure prediction. Molecular dynamics simulation-based refinement methods have shown consistent improvement of protein models. There had been progress in the extent of refinement for a few years since the idea of ensemble averaging of sampled conformations emerged. There was little progress in CASP12 because conformational sampling was not sufficiently diverse due to harmonic restraints. During CASP13, a new refinement method was tested that achieved significant improvements over CASP12. The new method intended to address previous bottlenecks in the refinement problem by introducing new features. Flat-bottom harmonic restraints replaced harmonic restraints, sampling was performed iteratively, and a new scoring function and selection criteria were used. The new protocol expanded conformational sampling at reduced computational costs. In addition to overall improvements, some models were refined significantly to near-experimental accuracy.
Collapse
Affiliation(s)
- Lim Heo
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan
| | - Collin F Arbour
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan
| | - Michael Feig
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan
| |
Collapse
|
9
|
Huang YJ, Brock KP, Ishida Y, Swapna GVT, Inouye M, Marks DS, Sander C, Montelione GT. Combining Evolutionary Covariance and NMR Data for Protein Structure Determination. Methods Enzymol 2018; 614:363-392. [PMID: 30611430 PMCID: PMC6640129 DOI: 10.1016/bs.mie.2018.11.004] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Accurate protein structure determination by solution-state NMR is challenging for proteins greater than about 20kDa, for which extensive perdeuteration is generally required, providing experimental data that are incomplete (sparse) and ambiguous. However, the massive increase in evolutionary sequence information coupled with advances in methods for sequence covariance analysis can provide reliable residue-residue contact information for a protein from sequence data alone. These "evolutionary couplings (ECs)" can be combined with sparse NMR data to determine accurate 3D protein structures. This hybrid "EC-NMR" method has been developed using NMR data for several soluble proteins and validated by comparison with corresponding reference structures determined by X-ray crystallography and/or conventional NMR methods. For small proteins, only backbone resonance assignments are utilized, while for larger proteins both backbone and some sidechain methyl resonance assignments are generally required. ECs can be combined with sparse NMR data obtained on deuterated, selectively protonated protein samples to provide structures that are more accurate and complete than those obtained using such sparse NMR data alone. EC-NMR also has significant potential for analysis of protein structures from solid-state NMR data and for studies of integral membrane proteins. The requirement that ECs are consistent with NMR data recorded on a specific member of a protein family, under specific conditions, also allows identification of ECs that reflect alternative allosteric or excited states of the protein structure.
Collapse
Affiliation(s)
- Yuanpeng Janet Huang
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, Piscataway, NJ, United States; Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, NJ, United States
| | - Kelly P Brock
- Department of Systems Biology, Harvard Medical School, Boston, MA, United States
| | - Yojiro Ishida
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, Piscataway, NJ, United States; Department of Biochemistry and Molecular Biology, Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey, Piscataway, NJ, United States
| | - Gurla V T Swapna
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, Piscataway, NJ, United States; Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, NJ, United States
| | - Masayori Inouye
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, Piscataway, NJ, United States; Department of Biochemistry and Molecular Biology, Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey, Piscataway, NJ, United States
| | - Debora S Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, United States
| | - Chris Sander
- Department of Cell Biology, Harvard Medical School and cBio Center, Dana-Farber Cancer Institute, Boston, MA, United States
| | - Gaetano T Montelione
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, Piscataway, NJ, United States; Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, NJ, United States; Department of Biochemistry and Molecular Biology, Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey, Piscataway, NJ, United States.
| |
Collapse
|
10
|
Delarue M, Koehl P. Combined approaches from physics, statistics, and computer science for ab initio protein structure prediction: ex unitate vires (unity is strength)? F1000Res 2018; 7. [PMID: 30079234 PMCID: PMC6058471 DOI: 10.12688/f1000research.14870.1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/19/2018] [Indexed: 11/20/2022] Open
Abstract
Connecting the dots among the amino acid sequence of a protein, its structure, and its function remains a central theme in molecular biology, as it would have many applications in the treatment of illnesses related to misfolding or protein instability. As a result of high-throughput sequencing methods, biologists currently live in a protein sequence-rich world. However, our knowledge of protein structure based on experimental data remains comparatively limited. As a consequence, protein structure prediction has established itself as a very active field of research to fill in this gap. This field, once thought to be reserved for theoretical biophysicists, is constantly reinventing itself, borrowing ideas informed by an ever-increasing assembly of scientific domains, from biology, chemistry, (statistical) physics, mathematics, computer science, statistics, bioinformatics, and more recently data sciences. We review the recent progress arising from this integration of knowledge, from the development of specific computer architecture to allow for longer timescales in physics-based simulations of protein folding to the recent advances in predicting contacts in proteins based on detection of coevolution using very large data sets of aligned protein sequences.
Collapse
Affiliation(s)
- Marc Delarue
- Unité Dynamique Structurale des Macromolécules, Institut Pasteur, and UMR 3528 du CNRS, Paris, France
| | - Patrice Koehl
- Department of Computer Science, Genome Center, University of California, Davis, Davis, California, USA
| |
Collapse
|