1
|
Erckert K, Rost B. Assessing the role of evolutionary information for enhancing protein language model embeddings. Sci Rep 2024; 14:20692. [PMID: 39237735 PMCID: PMC11377704 DOI: 10.1038/s41598-024-71783-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Accepted: 08/30/2024] [Indexed: 09/07/2024] Open
Abstract
Embeddings from protein Language Models (pLMs) are replacing evolutionary information from multiple sequence alignments (MSAs) as the most successful input for protein prediction. Is this because embeddings capture evolutionary information? We tested various approaches to explicitly incorporate evolutionary information into embeddings on various protein prediction tasks. While older pLMs (SeqVec, ProtBert) significantly improved through MSAs, the more recent pLM ProtT5 did not benefit. For most tasks, pLM-based outperformed MSA-based methods, and the combination of both even decreased performance for some (intrinsic disorder). We highlight the effectiveness of pLM-based methods and find limited benefits from integrating MSAs.
Collapse
Affiliation(s)
- Kyra Erckert
- TUM School of Computation, Information and Technology, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748, Garching/Munich, Germany.
- TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany.
| | - Burkhard Rost
- TUM School of Computation, Information and Technology, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748, Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching/Munich, Germany
- TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
2
|
Zhao L, Li J, Zhan W, Jiang X, Zhang B. Prediction of protein secondary structure by the improved TCN-BiLSTM-MHA model with knowledge distillation. Sci Rep 2024; 14:16488. [PMID: 39020005 PMCID: PMC11255250 DOI: 10.1038/s41598-024-67403-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Accepted: 07/10/2024] [Indexed: 07/19/2024] Open
Abstract
Secondary structure prediction is a key step in understanding protein function and biological properties and is highly important in the fields of new drug development, disease treatment, bioengineering, etc. Accurately predicting the secondary structure of proteins helps to reveal how proteins are folded and how they function in cells. The application of deep learning models in protein structure prediction is particularly important because of their ability to process complex sequence information and extract meaningful patterns and features, thus significantly improving the accuracy and efficiency of prediction. In this study, a combined model integrating an improved temporal convolutional network (TCN), bidirectional long short-term memory (BiLSTM), and a multi-head attention (MHA) mechanism is proposed to enhance the accuracy of protein prediction in both eight-state and three-state structures. One-hot encoding features and word vector representations of physicochemical properties are incorporated. A significant emphasis is placed on knowledge distillation techniques utilizing the ProtT5 pretrained model, leading to performance improvements. The improved TCN, achieved through multiscale fusion and bidirectional operations, allows for better extraction of amino acid sequence features than traditional TCN models. The model demonstrated excellent prediction performance on multiple datasets. For the TS115, CB513 and PDB (2018-2020) datasets, the prediction accuracy of the eight-state structure of the six datasets in this paper reached 88.2%, 84.9%, and 95.3%, respectively, and the prediction accuracy of the three-state structure reached 91.3%, 90.3%, and 96.8%, respectively. This study not only improves the accuracy of protein secondary structure prediction but also provides an important tool for understanding protein structure and function, which is particularly applicable to resource-constrained contexts and provides a valuable tool for understanding protein structure and function.
Collapse
Affiliation(s)
- Lufei Zhao
- Agricultural Science and Engineering School, Liaocheng University, Liaocheng, 252059, China
| | - Jingyi Li
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, 430073, China
| | - Weiqiang Zhan
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, 430073, China
| | - Xuchu Jiang
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, 430073, China.
- Emergency Management Research Center, Zhongnan University of Economics and Law, Wuhan, 430073, China.
| | - Biao Zhang
- School of Computer Science, Liaocheng University, Liaocheng, 252059, China
| |
Collapse
|
3
|
Planas-Iglesias J, Borko S, Swiatkowski J, Elias M, Havlasek M, Salamon O, Grakova E, Kunka A, Martinovic T, Damborsky J, Martinovic J, Bednar D. AggreProt: a web server for predicting and engineering aggregation prone regions in proteins. Nucleic Acids Res 2024; 52:W159-W169. [PMID: 38801076 PMCID: PMC11223854 DOI: 10.1093/nar/gkae420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Revised: 04/23/2024] [Accepted: 05/13/2024] [Indexed: 05/29/2024] Open
Abstract
Recombinant proteins play pivotal roles in numerous applications including industrial biocatalysts or therapeutics. Despite the recent progress in computational protein structure prediction, protein solubility and reduced aggregation propensity remain challenging attributes to design. Identification of aggregation-prone regions is essential for understanding misfolding diseases or designing efficient protein-based technologies, and as such has a great socio-economic impact. Here, we introduce AggreProt, a user-friendly webserver that automatically exploits an ensemble of deep neural networks to predict aggregation-prone regions (APRs) in protein sequences. Trained on experimentally evaluated hexapeptides, AggreProt compares to or outperforms state-of-the-art algorithms on two independent benchmark datasets. The server provides per-residue aggregation profiles along with information on solvent accessibility and transmembrane propensity within an intuitive interface with interactive sequence and structure viewers for comprehensive analysis. We demonstrate AggreProt efficacy in predicting differential aggregation behaviours in proteins on several use cases, which emphasize its potential for guiding protein engineering strategies towards decreased aggregation propensity and improved solubility. The webserver is freely available and accessible at https://loschmidt.chemi.muni.cz/aggreprot/.
Collapse
Affiliation(s)
- Joan Planas-Iglesias
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, Brno, Czech Republic
- International Clinical Research Center, St. Anne's University Hospital Brno, Brno, Czech Republic
| | - Simeon Borko
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, Brno, Czech Republic
- International Clinical Research Center, St. Anne's University Hospital Brno, Brno, Czech Republic
| | - Jan Swiatkowski
- IT4Innovations, VSB – Technical University of Ostrava, 17. listopadu 2172/15, 708 00 Ostrava-Poruba, Czech Republic
| | - Matej Elias
- IT4Innovations, VSB – Technical University of Ostrava, 17. listopadu 2172/15, 708 00 Ostrava-Poruba, Czech Republic
| | - Martin Havlasek
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, Brno, Czech Republic
- International Clinical Research Center, St. Anne's University Hospital Brno, Brno, Czech Republic
| | - Ondrej Salamon
- IT4Innovations, VSB – Technical University of Ostrava, 17. listopadu 2172/15, 708 00 Ostrava-Poruba, Czech Republic
| | - Ekaterina Grakova
- IT4Innovations, VSB – Technical University of Ostrava, 17. listopadu 2172/15, 708 00 Ostrava-Poruba, Czech Republic
| | - Antonín Kunka
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, Brno, Czech Republic
- International Clinical Research Center, St. Anne's University Hospital Brno, Brno, Czech Republic
| | - Tomas Martinovic
- IT4Innovations, VSB – Technical University of Ostrava, 17. listopadu 2172/15, 708 00 Ostrava-Poruba, Czech Republic
| | - Jiri Damborsky
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, Brno, Czech Republic
- International Clinical Research Center, St. Anne's University Hospital Brno, Brno, Czech Republic
| | - Jan Martinovic
- IT4Innovations, VSB – Technical University of Ostrava, 17. listopadu 2172/15, 708 00 Ostrava-Poruba, Czech Republic
| | - David Bednar
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, Brno, Czech Republic
- International Clinical Research Center, St. Anne's University Hospital Brno, Brno, Czech Republic
| |
Collapse
|
4
|
Leśniewski M, Pyrka M, Czaplewski C, Co NT, Jiang Y, Gong Z, Tang C, Liwo A. Assessment of Two Restraint Potentials for Coarse-Grained Chemical-Cross-Link-Assisted Modeling of Protein Structures. J Chem Inf Model 2024; 64:1377-1393. [PMID: 38345917 PMCID: PMC10900291 DOI: 10.1021/acs.jcim.3c01890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2023] [Revised: 01/20/2024] [Accepted: 01/22/2024] [Indexed: 02/27/2024]
Abstract
The influence of distance restraints from chemical cross-link mass spectroscopy (XL-MS) on the quality of protein structures modeled with the coarse-grained UNRES force field was assessed by using a protocol based on multiplexed replica exchange molecular dynamics, in which both simulated and experimental cross-link restraints were employed, for 23 small proteins. Six cross-links with upper distance boundaries from 4 Å to 12 Å (azido benzoic acid succinimide (ABAS), triazidotriazine (TATA), succinimidyldiazirine (SDA), disuccinimidyl adipate (DSA), disuccinimidyl glutarate (DSG), and disuccinimidyl suberate (BS3)) and two types of restraining potentials ((i) simple flat-bottom Lorentz-like potentials dependent on side chain distance (all cross-links) and (ii) distance- and orientation-dependent potentials determined based on molecular dynamics simulations of model systems (DSA, DSG, BS3, and SDA)) were considered. The Lorentz-like potentials with properly set parameters were found to produce a greater number of higher-quality models compared to unrestrained simulations than the MD-based potentials, because the latter can force too long distances between side chains. Therefore, the flat-bottom Lorentz-like potentials are recommended to represent cross-link restraints. It was also found that significant improvement of model quality upon the introduction of cross-link restraints is obtained when the sum of differences of indices of cross-linked residues exceeds 150.
Collapse
Affiliation(s)
- Mateusz Leśniewski
- Faculty
of Chemistry, University of Gdańsk, Fahrenheit Union of Universities, ul. Wita Stwosza 63, 80-308 Gdańsk, Poland
| | - Maciej Pyrka
- Faculty
of Chemistry, University of Gdańsk, Fahrenheit Union of Universities, ul. Wita Stwosza 63, 80-308 Gdańsk, Poland
- Department
of Physics and Biophysics, University of
Warmia and Mazury, ul. Oczapowskiego 4, 10-719 Olsztyn, Poland
| | - Cezary Czaplewski
- Faculty
of Chemistry, University of Gdańsk, Fahrenheit Union of Universities, ul. Wita Stwosza 63, 80-308 Gdańsk, Poland
| | - Nguyen Truong Co
- Faculty
of Chemistry, University of Gdańsk, Fahrenheit Union of Universities, ul. Wita Stwosza 63, 80-308 Gdańsk, Poland
| | - Yida Jiang
- College
of Chemistry and Molecular Engineering & Center for Quantitative
Biology & PKU-Tsinghua Center for Life Sciences & Beijing
National Laboratory for Molecular Sciences, Peking University, Beijing 100871, China
| | - Zhou Gong
- Innovation
Academy of Precision Measurement Science and Technology, Chinese Academy of Sciences, 30 W. Xiao Hong Shan, Wuhan 430071, China
| | - Chun Tang
- College
of Chemistry and Molecular Engineering & Center for Quantitative
Biology & PKU-Tsinghua Center for Life Sciences & Beijing
National Laboratory for Molecular Sciences, Peking University, Beijing 100871, China
| | - Adam Liwo
- Faculty
of Chemistry, University of Gdańsk, Fahrenheit Union of Universities, ul. Wita Stwosza 63, 80-308 Gdańsk, Poland
| |
Collapse
|
5
|
Yue T, Wang Y, Zhang L, Gu C, Xue H, Wang W, Lyu Q, Dun Y. Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models. Int J Mol Sci 2023; 24:15858. [PMID: 37958843 PMCID: PMC10649223 DOI: 10.3390/ijms242115858] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 10/24/2023] [Accepted: 10/30/2023] [Indexed: 11/15/2023] Open
Abstract
The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning, since we expect a superhuman intelligence that explores beyond our knowledge to interpret the genome from deep learning. A powerful deep learning model should rely on the insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with proper deep learning-based architecture, and we remark on practical considerations of developing deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research and point out current challenges and potential research directions for future genomics applications. We believe the collaborative use of ever-growing diverse data and the fast iteration of deep learning models will continue to contribute to the future of genomics.
Collapse
Affiliation(s)
- Tianwei Yue
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Yuanxin Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Longxiang Zhang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Chunming Gu
- Department of Biomedical Engineering, School of Medicine, Johns Hopkins University, Baltimore, MD 21218, USA;
| | - Haoru Xue
- The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA;
| | - Wenping Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Qi Lyu
- Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI 48824, USA;
| | - Yujie Dun
- School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China;
| |
Collapse
|
6
|
Madeo G, Savojardo C, Manfredi M, Martelli PL, Casadio R. CoCoNat: a novel method based on deep learning for coiled-coil prediction. Bioinformatics 2023; 39:btad495. [PMID: 37540220 PMCID: PMC10425188 DOI: 10.1093/bioinformatics/btad495] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Revised: 07/31/2023] [Accepted: 08/03/2023] [Indexed: 08/05/2023] Open
Abstract
MOTIVATION Coiled-coil domains (CCD) are widespread in all organisms and perform several crucial functions. Given their relevance, the computational detection of CCD is very important for protein functional annotation. State-of-the-art prediction methods include the precise identification of CCD boundaries, the annotation of the typical heptad repeat pattern along the coiled-coil helices as well as the prediction of the oligomerization state. RESULTS In this article, we describe CoCoNat, a novel method for predicting coiled-coil helix boundaries, residue-level register annotation, and oligomerization state. Our method encodes sequences with the combination of two state-of-the-art protein language models and implements a three-step deep learning procedure concatenated with a Grammatical-Restrained Hidden Conditional Random Field for CCD identification and refinement. A final neural network predicts the oligomerization state. When tested on a blind test set routinely adopted, CoCoNat obtains a performance superior to the current state-of-the-art both for residue-level and segment-level CCD. CoCoNat significantly outperforms the most recent state-of-the-art methods on register annotation and prediction of oligomerization states. AVAILABILITY AND IMPLEMENTATION CoCoNat web server is available at https://coconat.biocomp.unibo.it. Standalone version is available on GitHub at https://github.com/BolognaBiocomp/coconat.
Collapse
Affiliation(s)
- Giovanni Madeo
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Italy
| | - Castrense Savojardo
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Italy
| | - Matteo Manfredi
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Italy
| | - Pier Luigi Martelli
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Italy
| | - Rita Casadio
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Italy
| |
Collapse
|
7
|
Gao T, Zhao Y, Zhang L, Wang H. Secondary and Topological Structural Merge Prediction of Alpha-Helical Transmembrane Proteins Using a Hybrid Model Based on Hidden Markov and Long Short-Term Memory Neural Networks. Int J Mol Sci 2023; 24:ijms24065720. [PMID: 36982795 PMCID: PMC10057634 DOI: 10.3390/ijms24065720] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2023] [Revised: 03/11/2023] [Accepted: 03/13/2023] [Indexed: 03/19/2023] Open
Abstract
Alpha-helical transmembrane proteins (αTMPs) play essential roles in drug targeting and disease treatments. Due to the challenges of using experimental methods to determine their structure, αTMPs have far fewer known structures than soluble proteins. The topology of transmembrane proteins (TMPs) can determine the spatial conformation relative to the membrane, while the secondary structure helps to identify their functional domain. They are highly correlated on αTMPs sequences, and achieving a merge prediction is instructive for further understanding the structure and function of αTMPs. In this study, we implemented a hybrid model combining Deep Learning Neural Networks (DNNs) with a Class Hidden Markov Model (CHMM), namely HDNNtopss. DNNs extract rich contextual features through stacked attention-enhanced Bidirectional Long Short-Term Memory (BiLSTM) networks and Convolutional Neural Networks (CNNs), and CHMM captures state-associative temporal features. The hybrid model not only reasonably considers the probability of the state path but also has a fitting and feature-extraction capability for deep learning, which enables flexible prediction and makes the resulting sequence more biologically meaningful. It outperforms current advanced merge-prediction methods with a Q4 of 0.779 and an MCC of 0.673 on the independent test dataset, which have practical, solid significance. In comparison to advanced prediction methods for topological and secondary structures, it achieves the highest topology prediction with a Q2 of 0.884, which has a strong comprehensive performance. At the same time, we implemented a joint training method, Co-HDNNtopss, and achieved a good performance to provide an important reference for similar hybrid-model training.
Collapse
Affiliation(s)
- Ting Gao
- School of Information Science and Technology, Institute of Computational Biology, Northeast Normal University, Changchun 130117, China; (T.G.); (Y.Z.)
| | - Yutong Zhao
- School of Information Science and Technology, Institute of Computational Biology, Northeast Normal University, Changchun 130117, China; (T.G.); (Y.Z.)
| | - Li Zhang
- School of Computer Science and Engineering, Changchun University of Technology, Changchun 130012, China;
| | - Han Wang
- School of Information Science and Technology, Institute of Computational Biology, Northeast Normal University, Changchun 130117, China; (T.G.); (Y.Z.)
- Correspondence:
| |
Collapse
|
8
|
Understanding How Cells Probe the World: A Preliminary Step towards Modeling Cell Behavior? Int J Mol Sci 2023; 24:ijms24032266. [PMID: 36768586 PMCID: PMC9916635 DOI: 10.3390/ijms24032266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Revised: 01/16/2023] [Accepted: 01/20/2023] [Indexed: 01/26/2023] Open
Abstract
Cell biologists have long aimed at quantitatively modeling cell function. Recently, the outstanding progress of high-throughput measurement methods and data processing tools has made this a realistic goal. The aim of this paper is twofold: First, to suggest that, while much progress has been done in modeling cell states and transitions, current accounts of environmental cues driving these transitions remain insufficient. There is a need to provide an integrated view of the biochemical, topographical and mechanical information processed by cells to take decisions. It might be rewarding in the near future to try to connect cell environmental cues to physiologically relevant outcomes rather than modeling relationships between these cues and internal signaling networks. The second aim of this paper is to review exogenous signals that are sensed by living cells and significantly influence fate decisions. Indeed, in addition to the composition of the surrounding medium, cells are highly sensitive to the properties of neighboring surfaces, including the spatial organization of anchored molecules and substrate mechanical and topographical properties. These properties should thus be included in models of cell behavior. It is also suggested that attempts at cell modeling could strongly benefit from two research lines: (i) trying to decipher the way cells encode the information they retrieve from environment analysis, and (ii) developing more standardized means of assessing the quality of proposed models, as was done in other research domains such as protein structure prediction.
Collapse
|
9
|
Yuan L, Ma Y, Liu Y. Protein secondary structure prediction based on Wasserstein generative adversarial networks and temporal convolutional networks with convolutional block attention modules. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:2203-2218. [PMID: 36899529 DOI: 10.3934/mbe.2023102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
As an important task in bioinformatics, protein secondary structure prediction (PSSP) is not only beneficial to protein function research and tertiary structure prediction, but also to promote the design and development of new drugs. However, current PSSP methods cannot sufficiently extract effective features. In this study, we propose a novel deep learning model WGACSTCN, which combines Wasserstein generative adversarial network with gradient penalty (WGAN-GP), convolutional block attention module (CBAM) and temporal convolutional network (TCN) for 3-state and 8-state PSSP. In the proposed model, the mutual game of generator and discriminator in WGAN-GP module can effectively extract protein features, and our CBAM-TCN local extraction module can capture key deep local interactions in protein sequences segmented by sliding window technique, and the CBAM-TCN long-range extraction module can further capture the key deep long-range interactions in sequences. We evaluate the performance of the proposed model on seven benchmark datasets. Experimental results show that our model exhibits better prediction performance compared to the four state-of-the-art models. The proposed model has strong feature extraction ability, which can extract important information more comprehensively.
Collapse
Affiliation(s)
- Lu Yuan
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| | - Yuming Ma
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| | - Yihui Liu
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| |
Collapse
|
10
|
Zhao C, Liu T, Wang Z. Predicting residue-specific qualities of individual protein models using residual neural networks and graph neural networks. Proteins 2022; 90:2091-2102. [PMID: 35842895 PMCID: PMC9796650 DOI: 10.1002/prot.26400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 06/24/2022] [Accepted: 07/08/2022] [Indexed: 01/02/2023]
Abstract
The estimation of protein model accuracy (EMA) or model quality assessment (QA) is important for protein structure prediction. An accurate EMA algorithm can guide the refinement of models or pick the best model or best parts of models from a pool of predicted tertiary structures. We developed two novel methods: MASS2 and LAW, for predicting residue-specific or local qualities of individual models, which incorporate residual neural networks and graph neural networks, respectively. These two methods use similar features extracted from protein models but different architectures of neural networks to predict the local accuracies of single models. MASS2 and LAW participated in the QA category of CASP14, and according to our evaluations based on CASP14 official criteria, MASS2 and LAW are the best and second-best methods based on the Z-scores of ASE/100, AUC, and ULR-1.F1. We also evaluated MASS2, LAW, and the residue-specific predicted deviations (between model and native structure) generated by AlphaFold2 on CASP14 AlphaFold2 tertiary structure (TS) models. LAW achieved comparable or better performances compared to the predicted deviations generated by AlphaFold2 on AlphaFold2 TS models, even though LAW was not trained on any AlphaFold2 TS models. Specifically, LAW performed better on AUC and ULR scores, and AlphaFold2 performed better on ASE scores. This means that AlphaFold2 is better at predicting deviations, but LAW is better at classifying accurate and inaccurate residues and detecting unreliable local regions. MASS2 and LAW can be freely accessed from http://dna.cs.miami.edu/MASS2-CASP14/ and http://dna.cs.miami.edu/LAW-CASP14/, respectively.
Collapse
Affiliation(s)
- Chenguang Zhao
- Department of Computer ScienceUniversity of MiamiCoral GablesFloridaUSA
| | - Tong Liu
- Department of Computer ScienceUniversity of MiamiCoral GablesFloridaUSA
| | - Zheng Wang
- Department of Computer ScienceUniversity of MiamiCoral GablesFloridaUSA
| |
Collapse
|
11
|
Yuan L, Hu X, Ma Y, Liu Y. DLBLS_SS: protein secondary structure prediction using deep learning and broad learning system. RSC Adv 2022; 12:33479-33487. [PMID: 36505696 PMCID: PMC9682407 DOI: 10.1039/d2ra06433b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 11/16/2022] [Indexed: 11/24/2022] Open
Abstract
Protein secondary structure prediction (PSSP) is not only beneficial to the study of protein structure and function but also to the development of drugs. As a challenging task in computational biology, experimental methods for PSSP are time-consuming and expensive. In this paper, we propose a novel PSSP model DLBLS_SS based on deep learning and broad learning system (BLS) to predict 3-state and 8-state secondary structure. We first use a bidirectional long short-term memory (BLSTM) network to extract global features in residue sequences. Then, our proposed SEBTCN based on temporal convolutional networks (TCN) and channel attention can capture bidirectional key long-range dependencies in sequences. We also use BLS to rapidly optimize fused features while further capturing local interactions between residues. We conduct extensive experiments on public test sets including CASP10, CASP11, CASP12, CASP13, CASP14 and CB513 to evaluate the performance of the model. Experimental results show that our model exhibits better 3-state and 8-state PSSP performance compared to five state-of-the-art models.
Collapse
Affiliation(s)
- Lu Yuan
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| | - Xiaopei Hu
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| | - Yuming Ma
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| | - Yihui Liu
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| |
Collapse
|
12
|
Ismi DP, Pulungan R, Afiahayati. Deep learning for protein secondary structure prediction: Pre and post-AlphaFold. Comput Struct Biotechnol J 2022; 20:6271-6286. [PMID: 36420164 PMCID: PMC9678802 DOI: 10.1016/j.csbj.2022.11.012] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Revised: 11/05/2022] [Accepted: 11/05/2022] [Indexed: 11/13/2022] Open
Abstract
This paper aims to provide a comprehensive review of the trends and challenges of deep neural networks for protein secondary structure prediction (PSSP). In recent years, deep neural networks have become the primary method for protein secondary structure prediction. Previous studies showed that deep neural networks had uplifted the accuracy of three-state secondary structure prediction to more than 80%. Favored deep learning methods, such as convolutional neural networks, recurrent neural networks, inception networks, and graph neural networks, have been implemented in protein secondary structure prediction. Methods adapted from natural language processing (NLP) and computer vision are also employed, including attention mechanism, ResNet, and U-shape networks. In the post-AlphaFold era, PSSP studies focus on different objectives, such as enhancing the quality of evolutionary information and exploiting protein language models as the PSSP input. The recent trend to utilize pre-trained language models as input features for secondary structure prediction provides a new direction for PSSP studies. Moreover, the state-of-the-art accuracy achieved by previous PSSP models is still below its theoretical limit. There are still rooms for improvement to be made in the field.
Collapse
Affiliation(s)
- Dewi Pramudi Ismi
- Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
- Department of Infomatics, Faculty of Industrial Technology, Universitas Ahmad Dahlan, Yogyakarta, Indonesia
| | - Reza Pulungan
- Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
| | - Afiahayati
- Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
| |
Collapse
|
13
|
Stapor K, Kotowski K, Smolarczyk T, Roterman I. Lightweight ProteinUnet2 network for protein secondary structure prediction: a step towards proper evaluation. BMC Bioinformatics 2022; 23:100. [PMID: 35317722 PMCID: PMC8939211 DOI: 10.1186/s12859-022-04623-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2021] [Accepted: 02/28/2022] [Indexed: 11/10/2022] Open
Abstract
Background The prediction of protein secondary structures is a crucial and significant step for ab initio tertiary structure prediction which delivers the information about proteins activity and functions. As the experimental methods are expensive and sometimes impossible, many SS predictors, mainly based on different machine learning methods have been proposed for many years. Currently, most of the top methods use evolutionary-based input features produced by PSSM and HHblits software, although quite recently the embeddings—the new description of protein sequences generated by language models (LM) have appeared that could be leveraged as input features. Apart from input features calculation, the top models usually need extensive computational resources for training and prediction and are barely possible to run on a regular PC. SS prediction as the imbalanced classification problem should not be judged by the commonly used Q3/Q8 metrics. Moreover, as the benchmark datasets are not random samples, the classical statistical null hypothesis testing based on the Neyman–Pearson approach is not appropriate. Results We present a lightweight deep network ProteinUnet2 for SS prediction which is based on U-Net convolutional architecture and evolutionary-based input features (from PSSM and HHblits) as well as SPOT-Contact features. Through an extensive evaluation study, we report the performance of ProteinUnet2 in comparison with top SS prediction methods based on evolutionary information (SAINT and SPOT-1D). We also propose a new statistical methodology for prediction performance assessment based on the significance from Fisher–Pitman permutation tests accompanied by practical significance measured by Cohen’s effect size. Conclusions Our results suggest that ProteinUnet2 architecture has much shorter training and inference times while maintaining results similar to SAINT and SPOT-1D predictors. Taking into account the relatively long times of calculating evolutionary-based features (from PSSM in particular), it would be worth conducting the predictive ability tests on embeddings as input features in the future. We strongly believe that our proposed here statistical methodology for the evaluation of SS prediction results will be adopted and used (and even expanded) by the research community. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04623-z.
Collapse
Affiliation(s)
- Katarzyna Stapor
- Department of Applied Informatics, Silesian University of Technology, Akademicka 16, 44-100, Gliwice, Poland.
| | - Krzysztof Kotowski
- Department of Applied Informatics, Silesian University of Technology, Akademicka 16, 44-100, Gliwice, Poland
| | - Tomasz Smolarczyk
- Department of Applied Informatics, Silesian University of Technology, Akademicka 16, 44-100, Gliwice, Poland
| | - Irena Roterman
- Department of Bioinformatics and Telemedicine, Jagiellonian University Medical College, Medyczna 7, 30-688, Kraków, Poland
| |
Collapse
|
14
|
Feng SH, Xia CQ, Shen HB. CoCoPRED: coiled-coil protein structural feature prediction from amino acid sequence using deep neural networks. Bioinformatics 2022; 38:720-729. [PMID: 34718416 DOI: 10.1093/bioinformatics/btab744] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Revised: 10/08/2021] [Accepted: 10/27/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Coiled-coil is composed of two or more helices that are wound around each other. It widely exists in proteins and has been discovered to play a variety of critical roles in biology processes. Generally, there are three types of structural features in coiled-coil: coiled-coil domain (CCD), oligomeric state and register. However, most of the existing computational tools only focus on one of them. RESULTS Here, we describe a new deep learning model, CoCoPRED, which is based on convolutional layers, bidirectional long short-term memory, and attention mechanism. It has three networks, i.e. CCD network, oligomeric state network, and register network, corresponding to the three types of structural features in coiled-coil. This means CoCoPRED has the ability of fulfilling comprehensive prediction for coiled-coil proteins. Through the 5-fold cross-validation experiment, we demonstrate that CoCoPRED can achieve better performance than the state-of-the-art models on both CCD prediction and oligomeric state prediction. Further analysis suggests the CCD prediction may be a performance indicator of the oligomeric state prediction in CoCoPRED. The attention heads in CoCoPRED indicate that registers a, b and e are more crucial for the oligomeric state prediction. AVAILABILITY AND IMPLEMENTATION CoCoPRED is available at http://www.csbio.sjtu.edu.cn/bioinf/CoCoPRED. The datasets used in this research can also be downloaded from the website. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shi-Hao Feng
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Chun-Qiu Xia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.,Department of Computer Science, Shanghai Jiao Tong University, Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai 200240, China
| |
Collapse
|
15
|
Ho CT, Huang YW, Chen TR, Lo CH, Lo WC. Discovering the Ultimate Limits of Protein Secondary Structure Prediction. Biomolecules 2021; 11:1627. [PMID: 34827624 PMCID: PMC8615938 DOI: 10.3390/biom11111627] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Revised: 10/25/2021] [Accepted: 10/28/2021] [Indexed: 12/29/2022] Open
Abstract
Secondary structure prediction (SSP) of proteins is an important structural biology technique with many applications. There have been ~300 algorithms published in the past seven decades with fierce competition in accuracy. In the first 60 years, the accuracy of three-state SSP rose from ~56% to 81%; after that, it has long stayed at 81-86%. In the 1990s, the theoretical limit of three-state SSP accuracy had been estimated to be 88%. Thus, SSP is now generally considered not challenging or too challenging to improve. However, we found that the limit of three-state SSP might be underestimated. Besides, there is still much room for improving segment-based and eight-state SSPs, but the limits of these emerging topics have not been determined. This work performs large-scale sequence and structural analyses to estimate SSP accuracy limits and assess state-of-the-art SSP methods. The limit of three-state SSP is re-estimated to be ~92%, 4-5% higher than previously expected, indicating that SSP is still challenging. The estimated limit of eight-state SSP is 84-87%. Several proposals for improving future SSP algorithms are made based on our results. We hope that these findings will help move forward the development of SSP and all its applications.
Collapse
Affiliation(s)
- Chia-Tzu Ho
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
| | - Yu-Wei Huang
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
| | - Teng-Ruei Chen
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
| | - Chia-Hua Lo
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan
| | - Wei-Cheng Lo
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan
- The Center for Bioinformatics Research, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan
| |
Collapse
|
16
|
Chen TR, Juan SH, Huang YW, Lin YC, Lo WC. A secondary structure-based position-specific scoring matrix applied to the improvement in protein secondary structure prediction. PLoS One 2021; 16:e0255076. [PMID: 34320027 PMCID: PMC8318245 DOI: 10.1371/journal.pone.0255076] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 07/11/2021] [Indexed: 11/18/2022] Open
Abstract
Protein secondary structure prediction (SSP) has a variety of applications; however, there has been relatively limited improvement in accuracy for years. With a vision of moving forward all related fields, we aimed to make a fundamental advance in SSP. There have been many admirable efforts made to improve the machine learning algorithm for SSP. This work thus took a step back by manipulating the input features. A secondary structure element-based position-specific scoring matrix (SSE-PSSM) is proposed, based on which a new set of machine learning features can be established. The feasibility of this new PSSM was evaluated by rigid independent tests with training and testing datasets sharing <25% sequence identities. In all experiments, the proposed PSSM outperformed the traditional amino acid PSSM. This new PSSM can be easily combined with the amino acid PSSM, and the improvement in accuracy was remarkable. Preliminary tests made by combining the SSE-PSSM and well-known SSP methods showed 2.0% and 5.2% average improvements in three- and eight-state SSP accuracies, respectively. If this PSSM can be integrated into state-of-the-art SSP methods, the overall accuracy of SSP may break the current restriction and eventually bring benefit to all research and applications where secondary structure prediction plays a vital role during development. To facilitate the application and integration of the SSE-PSSM with modern SSP methods, we have established a web server and standalone programs for generating SSE-PSSM available at http://10.life.nctu.edu.tw/SSE-PSSM.
Collapse
Affiliation(s)
- Teng-Ruei Chen
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Sheng-Hung Juan
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Yu-Wei Huang
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Yen-Cheng Lin
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Wei-Cheng Lo
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- The Center for Bioinformatics Research, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- * E-mail:
| |
Collapse
|
17
|
Chen TR, Lo CH, Juan SH, Lo WC. The influence of dataset homology and a rigorous evaluation strategy on protein secondary structure prediction. PLoS One 2021; 16:e0254555. [PMID: 34260641 PMCID: PMC8279362 DOI: 10.1371/journal.pone.0254555] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Accepted: 06/29/2021] [Indexed: 11/28/2022] Open
Abstract
The secondary structure prediction (SSP) of proteins has long been an essential structural biology technique with various applications. Despite its vital role in many research and industrial fields, in recent years, as the accuracy of state-of-the-art secondary structure predictors approaches the theoretical upper limit, SSP has been considered no longer challenging or too challenging to make advances. With the belief that the substantial improvement of SSP will move forward many fields depending on it, we conducted this study, which focused on three issues that have not been noticed or thoroughly examined yet but may have affected the reliability of the evaluation of previous SSP algorithms. These issues are all about the sequence homology between or within the developmental and evaluation datasets. We thus designed many different homology layouts of datasets to train and evaluate SSP prediction models. Multiple repeats were performed in each experiment by random sampling. The conclusions obtained with small experimental datasets were verified with large-scale datasets using state-of-the-art SSP algorithms. Very different from the long-established assumption, we discover that the sequence homology between query datasets for training, testing, and independent tests exerts little influence on SSP accuracy. Besides, the sequence homology redundancy between or within most datasets would make the accuracy of an SSP algorithm overestimated, while the redundancy within the reference dataset for extracting predictive features would make the accuracy underestimated. Since the overestimating effects are more significant than the underestimating effect, the accuracy of some SSP methods might have been overestimated. Based on the discoveries, we propose a rigorous procedure for developing SSP algorithms and making reliable evaluations, hoping to bring substantial improvements to future SSP methods and benefit all research and application fields relying on accurate prediction of protein secondary structures.
Collapse
Affiliation(s)
- Teng-Ruei Chen
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Chia-Hua Lo
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan
| | - Sheng-Hung Juan
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
| | - Wei-Cheng Lo
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- The Center for Bioinformatics Research, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| |
Collapse
|
18
|
Li B, Mendenhall J, Capra JA, Meiler J. A Multitask Deep-Learning Method for Predicting Membrane Associations and Secondary Structures of Proteins. J Proteome Res 2021; 20:4089-4100. [PMID: 34236204 PMCID: PMC8650144 DOI: 10.1021/acs.jproteome.1c00410] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Prediction of residue-level structural attributes and protein-level structural classes helps model protein tertiary structures and understand protein functions. Existing methods are either specialized on only one class of proteins or developed to predict only a specific type of residue-level attribute. In this work, we develop a new deep-learning method, named Membrane Association and Secondary Structure Predictor (MASSP), for accurately predicting both residue-level structural attributes (secondary structure, location, orientation, and topology) and protein-level structural classes (bitopic, α-helical, β-barrel, and soluble). MASSP integrates a multilayer two-dimensional convolutional neural network (2D-CNN) with a long short-term memory (LSTM) neural network into a multitasking framework. Our comparison shows that MASSP performs equally well or better than the state-of-the-art methods in predicting residue-level secondary structures, boundaries of transmembrane segments, and topology. Furthermore, it achieves outstanding accuracy in predicting protein-level structural classes. MASSP automatically distinguishes the structural classes of input sequences and identifies transmembrane segments and topologies if present, making it broadly applicable to different classes of proteins. In summary, MASSP's good performance and broad applicability make it well suited for annotating residue-level attributes and protein-level structural classes at the proteome scale.
Collapse
Affiliation(s)
- Bian Li
- Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee 37203, United States.,Center for Structural Biology, Vanderbilt University, Nashville, Tennessee 37203, United States
| | - Jeffrey Mendenhall
- Center for Structural Biology, Vanderbilt University, Nashville, Tennessee 37203, United States.,Department of Chemistry, Vanderbilt University, Nashville, Tennessee 37203, United States
| | - John A Capra
- Bakar Computational Health Sciences Institute and Department of Epidemiology and Biostatistics, University of California, San Francisco, California 94143, United States
| | - Jens Meiler
- Center for Structural Biology, Vanderbilt University, Nashville, Tennessee 37203, United States.,Department of Chemistry, Vanderbilt University, Nashville, Tennessee 37203, United States.,Institute for Drug Discovery, University Leipzig Medical School, Leipzig 04109, Germany
| |
Collapse
|
19
|
Prabakaran R, Rawat P, Kumar S, Gromiha MM. Evaluation of in silico tools for the prediction of protein and peptide aggregation on diverse datasets. Brief Bioinform 2021; 22:6309925. [PMID: 34181000 DOI: 10.1093/bib/bbab240] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Revised: 05/18/2021] [Accepted: 06/02/2021] [Indexed: 01/09/2023] Open
Abstract
Several prediction algorithms and tools have been developed in the last two decades to predict protein and peptide aggregation. These in silico tools aid to predict the aggregation propensity and amyloidogenicity as well as the identification of aggregation-prone regions. Despite the immense interest in the field, it is of prime importance to systematically compare these algorithms for their performance. In this review, we have provided a rigorous performance analysis of nine prediction tools using a variety of assessments. The assessments were carried out on several non-redundant datasets ranging from hexapeptides to protein sequences as well as amyloidogenic antibody light chains to soluble protein sequences. Our analysis reveals the robustness of the current prediction tools and the scope for improvement in their predictive performances. Insights gained from this work provide critical guidance to the scientific community on advantages and limitations of different aggregation prediction methods and make informed decisions about their research needs.
Collapse
Affiliation(s)
| | | | - Sandeep Kumar
- Department of Biotherapeutics Discovery in Boehringer-Ingelheim Pharmaceutical Inc., Ridgefield, CT, USA
| | | |
Collapse
|
20
|
Simm D, Hatje K, Waack S, Kollmar M. Critical assessment of coiled-coil predictions based on protein structure data. Sci Rep 2021; 11:12439. [PMID: 34127723 PMCID: PMC8203680 DOI: 10.1038/s41598-021-91886-w] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 05/28/2021] [Indexed: 02/05/2023] Open
Abstract
Coiled-coil regions were among the first protein motifs described structurally and theoretically. The simplicity of the motif promises that coiled-coil regions can be detected with reasonable accuracy and precision in any protein sequence. Here, we re-evaluated the most commonly used coiled-coil prediction tools with respect to the most comprehensive reference data set available, the entire Protein Data Bank, down to each amino acid and its secondary structure. Apart from the 30-fold difference in minimum and maximum number of coiled coils predicted the tools strongly vary in where they predict coiled-coil regions. Accordingly, there is a high number of false predictions and missed, true coiled-coil regions. The evaluation of the binary classification metrics in comparison with naïve coin-flip models and the calculation of the Matthews correlation coefficient, the most reliable performance metric for imbalanced data sets, suggests that the tested tools' performance is close to random. This implicates that the tools' predictions have only limited informative value. Coiled-coil predictions are often used to interpret biochemical data and are part of in-silico functional genome annotation. Our results indicate that these predictions should be treated very cautiously and need to be supported and validated by experimental evidence.
Collapse
Affiliation(s)
- Dominic Simm
- grid.418140.80000 0001 2104 4211Group Systems Biology of Motor Proteins, Department of NMR-Based Structural Biology, Max-Planck-Institute for Biophysical Chemistry, Göttingen, Germany ,grid.7450.60000 0001 2364 4210Theoretical Computer Science and Algorithmic Methods, Institute of Computer Science, Georg-August-University Göttingen, Göttingen, Germany
| | - Klas Hatje
- grid.418140.80000 0001 2104 4211Group Systems Biology of Motor Proteins, Department of NMR-Based Structural Biology, Max-Planck-Institute for Biophysical Chemistry, Göttingen, Germany ,grid.417570.00000 0004 0374 1269Present Address: Roche Pharmaceutical Research and Early Development, Pharmaceutical Sciences, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Stephan Waack
- grid.7450.60000 0001 2364 4210Theoretical Computer Science and Algorithmic Methods, Institute of Computer Science, Georg-August-University Göttingen, Göttingen, Germany
| | - Martin Kollmar
- grid.418140.80000 0001 2104 4211Group Systems Biology of Motor Proteins, Department of NMR-Based Structural Biology, Max-Planck-Institute for Biophysical Chemistry, Göttingen, Germany ,grid.7450.60000 0001 2364 4210Theoretical Computer Science and Algorithmic Methods, Institute of Computer Science, Georg-August-University Göttingen, Göttingen, Germany
| |
Collapse
|
21
|
Görmez Y, Sabzekar M, Aydın Z. IGPRED: Combination of convolutional neural and graph convolutional networks for protein secondary structure prediction. Proteins 2021; 89:1277-1288. [PMID: 33993559 DOI: 10.1002/prot.26149] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2021] [Revised: 04/21/2021] [Accepted: 05/11/2021] [Indexed: 11/10/2022]
Abstract
There is a close relationship between the tertiary structure and the function of a protein. One of the important steps to determine the tertiary structure is protein secondary structure prediction (PSSP). For this reason, predicting secondary structure with higher accuracy will give valuable information about the tertiary structure. Recently, deep learning techniques have obtained promising improvements in several machine learning applications including PSSP. In this article, a novel deep learning model, based on convolutional neural network and graph convolutional network is proposed. PSIBLAST PSSM, HHMAKE PSSM, physico-chemical properties of amino acids are combined with structural profiles to generate a rich feature set. Furthermore, the hyper-parameters of the proposed network are optimized using Bayesian optimization. The proposed model IGPRED obtained 89.19%, 86.34%, 87.87%, 85.76%, and 86.54% Q3 accuracies for CullPDB, EVAset, CASP10, CASP11, and CASP12 datasets, respectively.
Collapse
Affiliation(s)
- Yasin Görmez
- Faculty of Economics and Administrative Sciences, Management Information Systems, Sivas Cumhuriyet University, Sivas, Turkey
| | - Mostafa Sabzekar
- Department of Computer Engineering, Birjand University of Technology, Birjand, Iran
| | - Zafer Aydın
- Engineering Faculty, Computer Engineering Department, Abdullah Gül University, Kayseri, Turkey
| |
Collapse
|
22
|
Prabakaran R, Rawat P, Thangakani AM, Kumar S, Gromiha MM. Protein aggregation: in silico algorithms and applications. Biophys Rev 2021; 13:71-89. [PMID: 33747245 PMCID: PMC7930180 DOI: 10.1007/s12551-021-00778-w] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2020] [Accepted: 01/01/2021] [Indexed: 01/08/2023] Open
Abstract
Protein aggregation is a topic of immense interest to the scientific community due to its role in several neurodegenerative diseases/disorders and industrial importance. Several in silico techniques, tools, and algorithms have been developed to predict aggregation in proteins and understand the aggregation mechanisms. This review attempts to provide an essence of the vast developments in in silico approaches, resources available, and future perspectives. It reviews aggregation-related databases, mechanistic models (aggregation-prone region and aggregation propensity prediction), kinetic models (aggregation rate prediction), and molecular dynamics studies related to aggregation. With a multitude of prediction models related to aggregation already available to the scientific community, the field of protein aggregation is rapidly maturing to tackle new applications.
Collapse
Affiliation(s)
- R. Prabakaran
- Department of Biotechnology, Indian Institute of Technology Madras, Chennai, Tamil Nadu India
| | - Puneet Rawat
- Department of Biotechnology, Indian Institute of Technology Madras, Chennai, Tamil Nadu India
| | - A. Mary Thangakani
- Department of Biotechnology, Indian Institute of Technology Madras, Chennai, Tamil Nadu India
| | - Sandeep Kumar
- Biotherapeutics Discovery, Boehringer Ingelheim Pharmaceutical Inc., Ridgefield, CT USA
| | - M. Michael Gromiha
- Department of Biotechnology, Indian Institute of Technology Madras, Chennai, Tamil Nadu India
- School of Computing, Institute of Innovative Research, Tokyo Institute of Technology, Yokohama, Kanagawa Japan
| |
Collapse
|
23
|
Madeo G, Savojardo C, Martelli PL, Casadio R. BetAware-Deep: An Accurate Web Server for Discrimination and Topology Prediction of Prokaryotic Transmembrane β-barrel Proteins. J Mol Biol 2020; 433:166729. [PMID: 33972021 DOI: 10.1016/j.jmb.2020.166729] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Revised: 11/27/2020] [Accepted: 11/30/2020] [Indexed: 11/25/2022]
Abstract
TransMembrane β-Barrel (TMBB) proteins located in the outer membranes of Gram-negative bacteria are crucial for many important biological processes and primary candidates as drug targets. Structure determination of TMBB proteins is challenging and hence computational methods devised for the analysis of TMBB proteins are important for complementing experimental approaches. Here, we present a novel web server called BetAware-Deep that is able to accurately identify the topology of TMBB proteins (i.e. the number and orientation of membrane-spanning segments along the protein sequence) and to discriminate them from other protein types. The method in BetAware-Deep defines new features by exploiting a non-canonical computation of the hydrophobic moment and by adopting sequence-profile weighting of the White&Wimley hydrophobicity scale. These features are processed using a two-step approach based on deep learning and probabilistic graphical models. BetAware-Deep has been trained on a dataset comprising 58 TMBBs and benchmarked on a novel set of 15 TMBB proteins. Results showed that BetAware-Deep outperforms two recently released state-of-the-art methods for topology prediction, predicting correct topologies of 10 out of 15 proteins. TMBB detection was also assessed on a larger dataset comprising 1009 TMBB proteins and 7571 non-TMBB proteins. Even in this benchmark, BetAware-Deep scored at the level of top-performing methods. A web server has been developed allowing users to analyze input protein sequences and providing topology prediction together with a rich set of information including a graphical representation of the residue-level annotations and prediction probabilities. BetAware-Deep is available at https://busca.biocomp.unibo.it/betaware2.
Collapse
Affiliation(s)
- Giovanni Madeo
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Italy
| | - Castrense Savojardo
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Italy
| | - Pier Luigi Martelli
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Italy.
| | - Rita Casadio
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Italy; Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Bari, Italy
| |
Collapse
|
24
|
Prabakaran R, Rawat P, Kumar S, Michael Gromiha M. ANuPP: A Versatile Tool to Predict Aggregation Nucleating Regions in Peptides and Proteins. J Mol Biol 2020; 433:166707. [PMID: 33972019 DOI: 10.1016/j.jmb.2020.11.006] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2020] [Revised: 10/28/2020] [Accepted: 11/05/2020] [Indexed: 12/22/2022]
Abstract
Short aggregation prone sequence motifs can trigger aggregation in peptide and protein sequences. Most algorithms developed so far to identify potential aggregation prone regions (APRs) use amino acid residue composition and/or sequence pattern features. In this work, we have investigated the importance of atomic-level characteristics rather than residue level to understand the initiation of aggregation in proteins and peptides. Using atomic-level features an ensemble-classifier, ANuPP has been developed to predict the aggregation-nucleating regions in peptides and proteins. In a dataset of 1279 hexapeptides, ANuPP achieved an area under the curve (AUC) of 0.831 with 77% accuracy on 10-fold cross-validation and an AUC of 0.883 with 83% accuracy in a blind test dataset of 142 hexapeptides. Further, it showed an average SOV of 48.7% on identifying APR regions in 37 proteins. The performance of ANuPP is better than other methods reported in the literature on both amyloidogenic hexapeptide prediction and APR identification. We have developed a web server for ANuPP and it is available at https://web.iitm.ac.in/bioinfo2/ANuPP/. Insights gained from this work demonstrate the importance of atomic and functional group characteristics towards diversity of atomic level origins as well as mechanisms of protein aggregation.
Collapse
Affiliation(s)
- R Prabakaran
- Protein Bioinformatics Lab, Department of Biotechnology, Indian Institute of Technology Madras, Chennai, Tamil Nadu, India
| | - Puneet Rawat
- Protein Bioinformatics Lab, Department of Biotechnology, Indian Institute of Technology Madras, Chennai, Tamil Nadu, India
| | - Sandeep Kumar
- Biotherapeutics Discovery, Boehringer Ingelheim Pharmaceutical Inc., Ridgefield, CT, USA.
| | - M Michael Gromiha
- Protein Bioinformatics Lab, Department of Biotechnology, Indian Institute of Technology Madras, Chennai, Tamil Nadu, India; School of Computing, Institute of Innovative Research, Tokyo Institute of Technology, Yokohama, Kanagawa, Japan.
| |
Collapse
|
25
|
Zhang C, Zheng W, Mortuza SM, Li Y, Zhang Y. DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins. Bioinformatics 2020; 36:2105-2112. [PMID: 31738385 DOI: 10.1093/bioinformatics/btz863] [Citation(s) in RCA: 110] [Impact Index Per Article: 27.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Revised: 10/17/2019] [Accepted: 11/15/2019] [Indexed: 12/23/2022] Open
Abstract
MOTIVATION The success of genome sequencing techniques has resulted in rapid explosion of protein sequences. Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins. There are however no standard and efficient pipeline available for sensitive multiple sequence alignment (MSA) collection. This is particularly challenging when large whole-genome and metagenome databases are involved. RESULTS We developed DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. The practical usefulness of the pipeline was examined in three large-scale benchmark experiments based on 614 non-redundant proteins. First, DeepMSA was utilized to generate MSAs for residue-level contact prediction by six coevolution and deep learning-based programs, which resulted in an accuracy increase in long-range contacts by up to 24.4% compared to the default programs. Next, multiple threading programs are performed for homologous structure identification, where the average TM-score of the template alignments has over 7.5% increases with the use of the new DeepMSA profiles. Finally, DeepMSA was used for secondary structure prediction and resulted in statistically significant improvements in the Q3 accuracy. It is noted that all these improvements were achieved without re-training the parameters and neural-network models, demonstrating the robustness and general usefulness of the DeepMSA in protein structural bioinformatics applications, especially for targets without homologous templates in the PDB library. AVAILABILITY AND IMPLEMENTATION https://zhanglab.ccmb.med.umich.edu/DeepMSA/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Wei Zheng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - S M Mortuza
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.,School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
26
|
Guo Z, Hou J, Cheng J. DNSS2: Improved ab initio protein secondary structure prediction using advanced deep learning architectures. Proteins 2020; 89:207-217. [PMID: 32893403 DOI: 10.1002/prot.26007] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2020] [Revised: 07/07/2020] [Accepted: 09/02/2020] [Indexed: 12/27/2022]
Abstract
Accurate prediction of protein secondary structure (alpha-helix, beta-strand and coil) is a crucial step for protein inter-residue contact prediction and ab initio tertiary structure prediction. In a previous study, we developed a deep belief network-based protein secondary structure method (DNSS1) and successfully advanced the prediction accuracy beyond 80%. In this work, we developed multiple advanced deep learning architectures (DNSS2) to further improve secondary structure prediction. The major improvements over the DNSS1 method include (a) designing and integrating six advanced one-dimensional deep convolutional/recurrent/residual/memory/fractal/inception networks to predict 3-state and 8-state secondary structure, and (b) using more sensitive profile features inferred from Hidden Markov model (HMM) and multiple sequence alignment (MSA). Most of the deep learning architectures are novel for protein secondary structure prediction. DNSS2 was systematically benchmarked on independent test data sets with eight state-of-art tools and consistently ranked as one of the best methods. Particularly, DNSS2 was tested on the protein targets of 2018 CASP13 experiment and achieved the Q3 score of 81.62%, SOV score of 72.19%, and Q8 score of 73.28%. DNSS2 is freely available at: https://github.com/multicom-toolbox/DNSS2.
Collapse
Affiliation(s)
- Zhiye Guo
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| | - Jie Hou
- Department of Computer Science, Saint Louis University, St. Louis, Missouri, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| |
Collapse
|
27
|
Kashani-Amin E, Tabatabaei-Malazy O, Sakhteman A, Larijani B, Ebrahim-Habibi A. A Systematic Review on Popularity, Application and Characteristics of Protein Secondary Structure Prediction Tools. Curr Drug Discov Technol 2020; 16:159-172. [PMID: 29493456 DOI: 10.2174/1570163815666180227162157] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2017] [Revised: 02/15/2018] [Accepted: 02/22/2018] [Indexed: 01/22/2023]
Abstract
BACKGROUND Prediction of proteins' secondary structure is one of the major steps in the generation of homology models. These models provide structural information which is used to design suitable ligands for potential medicinal targets. However, selecting a proper tool between multiple Secondary Structure Prediction (SSP) options is challenging. The current study is an insight into currently favored methods and tools, within various contexts. OBJECTIVE A systematic review was performed for a comprehensive access to recent (2013-2016) studies which used or recommended protein SSP tools. METHODS Three databases, Web of Science, PubMed and Scopus were systematically searched and 99 out of the 209 studies were finally found eligible to extract data. RESULTS Four categories of applications for 59 retrieved SSP tools were: (I) prediction of structural features of a given sequence, (II) evaluation of a method, (III) providing input for a new SSP method and (IV) integrating an SSP tool as a component for a program. PSIPRED was found to be the most popular tool in all four categories. JPred and tools utilizing PHD (Profile network from HeiDelberg) method occupied second and third places of popularity in categories I and II. JPred was only found in the two first categories, while PHD was present in three fields. CONCLUSION This study provides a comprehensive insight into the recent usage of SSP tools which could be helpful for selecting a proper tool.
Collapse
Affiliation(s)
- Elaheh Kashani-Amin
- Biosensor Research Center, Endocrinology and Metabolism Molecular-Cellular Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran
| | - Ozra Tabatabaei-Malazy
- Non-Communicable Diseases Research Center, Endocrinology and Metabolism Population Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran.,Endocrinology and Metabolism Research Center, Endocrinology and Metabolism Clinical Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran
| | - Amirhossein Sakhteman
- Department of Medicinal Chemistry, School of Pharmacy, Shiraz University of Medical Sciences, Shiraz, Iran.,Medicinal Chemistry and Natural Products Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Bagher Larijani
- Endocrinology and Metabolism Research Center, Endocrinology and Metabolism Clinical Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran
| | - Azadeh Ebrahim-Habibi
- Biosensor Research Center, Endocrinology and Metabolism Molecular-Cellular Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran
| |
Collapse
|
28
|
Tamposis IA, Tsirigos KD, Theodoropoulou MC, Kontou PI, Tsaousis GN, Sarantopoulou D, Litou ZI, Bagos PG. JUCHMME: a Java Utility for Class Hidden Markov Models and Extensions for biological sequence analysis. Bioinformatics 2020; 35:5309-5312. [PMID: 31250907 DOI: 10.1093/bioinformatics/btz533] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2019] [Revised: 05/04/2019] [Accepted: 06/25/2019] [Indexed: 11/14/2022] Open
Abstract
SUMMARY JUCHMME is an open-source software package designed to fit arbitrary custom Hidden Markov Models (HMMs) with a discrete alphabet of symbols. We incorporate a large collection of standard algorithms for HMMs as well as a number of extensions and evaluate the software on various biological problems. Importantly, the JUCHMME toolkit includes several additional features that allow for easy building and evaluation of custom HMMs, which could be a useful resource for the research community. AVAILABILITY AND IMPLEMENTATION http://www.compgen.org/tools/juchmme, https://github.com/pbagos/juchmme. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ioannis A Tamposis
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| | - Konstantinos D Tsirigos
- Department of Health Technology, Section for Bioinformatics, Technical University of Denmark, Kgs Lyngby, Denmark
| | | | - Panagiota I Kontou
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| | | | - Dimitra Sarantopoulou
- Institute of Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Zoi I Litou
- Section of Cell Biology and Biophysics, Department of Biology, School of Science, National and Kapodistrian University of Athens, Athens, Greece
| | - Pantelis G Bagos
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| |
Collapse
|
29
|
Liu T, Wang Z. MASS: predict the global qualities of individual protein models using random forests and novel statistical potentials. BMC Bioinformatics 2020; 21:246. [PMID: 32631256 PMCID: PMC7336608 DOI: 10.1186/s12859-020-3383-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Accepted: 01/22/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein model quality assessment (QA) is an essential procedure in protein structure prediction. QA methods can predict the qualities of protein models and identify good models from decoys. Clustering-based methods need a certain number of models as input. However, if a pool of models are not available, methods that only need a single model as input are indispensable. RESULTS We developed MASS, a QA method to predict the global qualities of individual protein models using random forests and various novel energy functions. We designed six novel energy functions or statistical potentials that can capture the structural characteristics of a protein model, which can also be used in other protein-related bioinformatics research. MASS potentials demonstrated higher importance than the energy functions of RWplus, GOAP, DFIRE and Rosetta when the scores they generated are used as machine learning features. MASS outperforms almost all of the four CASP11 top-performing single-model methods for global quality assessment in terms of all of the four evaluation criteria officially used by CASP, which measure the abilities to assign relative and absolute scores, identify the best model from decoys, and distinguish between good and bad models. MASS has also achieved comparable performances with the leading QA methods in CASP12 and CASP13. CONCLUSIONS MASS and the source code for all MASS potentials are publicly available at http://dna.cs.miami.edu/MASS/ .
Collapse
Affiliation(s)
- Tong Liu
- Department of Computer Science, University of Miami, 1365 Memorial Drive, P.O. Box 248154, Coral Gables, FL, 33124, USA
| | - Zheng Wang
- Department of Computer Science, University of Miami, 1365 Memorial Drive, P.O. Box 248154, Coral Gables, FL, 33124, USA.
| |
Collapse
|
30
|
Juan SH, Chen TR, Lo WC. A simple strategy to enhance the speed of protein secondary structure prediction without sacrificing accuracy. PLoS One 2020; 15:e0235153. [PMID: 32603341 PMCID: PMC7326220 DOI: 10.1371/journal.pone.0235153] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Accepted: 06/09/2020] [Indexed: 01/06/2023] Open
Abstract
The secondary structure prediction of proteins is a classic topic of computational structural biology with a variety of applications. During the past decade, the accuracy of prediction achieved by state-of-the-art algorithms has been >80%; meanwhile, the time cost of prediction increased rapidly because of the exponential growth of fundamental protein sequence data. Based on literature studies and preliminary observations on the relationships between the size/homology of the fundamental protein dataset and the speed/accuracy of predictions, we raised two hypotheses that might be helpful to determine the main influence factors of the efficiency of secondary structure prediction. Experimental results of size and homology reductions of the fundamental protein dataset supported those hypotheses. They revealed that shrinking the size of the dataset could substantially cut down the time cost of prediction with a slight decrease of accuracy, which could be increased on the contrary by homology reduction of the dataset. Moreover, the Shannon information entropy could be applied to explain how accuracy was influenced by the size and homology of the dataset. Based on these findings, we proposed that a proper combination of size and homology reductions of the protein dataset could speed up the secondary structure prediction while preserving the high accuracy of state-of-the-art algorithms. Testing the proposed strategy with the fundamental protein dataset of the year 2018 provided by the Universal Protein Resource, the speed of prediction was enhanced over 20 folds while all accuracy measures remained equivalently high. These findings are supposed helpful for improving the efficiency of researches and applications depending on the secondary structure prediction of proteins. To make future implementations of the proposed strategy easy, we have established a database of size and homology reduced protein datasets at http://10.life.nctu.edu.tw/UniRefNR.
Collapse
Affiliation(s)
- Sheng-Hung Juan
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
| | - Teng-Ruei Chen
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
| | - Wei-Cheng Lo
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan
- The Center for Bioinformatics Research, National Chiao Tung University, Hsinchu, Taiwan
| |
Collapse
|
31
|
Horvath A, Miskei M, Ambrus V, Vendruscolo M, Fuxreiter M. Sequence-based prediction of protein binding mode landscapes. PLoS Comput Biol 2020; 16:e1007864. [PMID: 32453748 PMCID: PMC7304629 DOI: 10.1371/journal.pcbi.1007864] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2020] [Revised: 06/19/2020] [Accepted: 04/09/2020] [Indexed: 02/04/2023] Open
Abstract
Interactions between disordered proteins involve a wide range of changes in the structure and dynamics of the partners involved. These changes can be classified in terms of binding modes, which include disorder-to-order (DO) transitions, when proteins fold upon binding, as well as disorder-to-disorder (DD) transitions, when the conformational heterogeneity is maintained in the bound states. Furthermore, systematic studies of these interactions are revealing that proteins may exhibit different binding modes with different partners. Proteins that exhibit this context-dependent binding can be referred to as fuzzy proteins. Here we investigate amino acid code for fuzzy binding in terms of the entropy of the probability distribution of transitions towards decreasing order. We implement these entropy calculations into the FuzPred (http://protdyn-fuzpred.org) algorithm to predict the range of context-dependent binding modes of proteins from their amino acid sequences. As we illustrate through a variety of examples, this method identifies those binding sites that are sensitive to the cellular context or post-translational modifications, and may serve as regulatory points of cellular pathways. Great advances have been made in the last several decades in deciphering how the behavior of proteins is encoded in their amino acid sequences. A variety of sequence-based prediction methods have been developed to estimate a wide range of properties of proteins, including secondary structure propensity, native state structures, preference for being disordered and tendency to aggregate. Much less is known, however, about the rules that regulate the conformational changes of proteins upon binding. In particular, many proteins change their binding modes upon interacting with different partners, or as a consequence of post-translational modifications or changes in the cellular milieu. Here we address the problem of how amino acid sequences can encode different binding modes depending on their binding partners, and describe the FuzPred method of predicting context-dependent binding modes.
Collapse
Affiliation(s)
- Attila Horvath
- MTA-DE Laboratory of Protein Dynamics, Department of Biochemistry and Molecular Biology, University of Debrecen, Debrecen, Hungary
- The John Curtin School of Medical Research, The Australian National University, Canberra, Australia
| | - Marton Miskei
- MTA-DE Laboratory of Protein Dynamics, Department of Biochemistry and Molecular Biology, University of Debrecen, Debrecen, Hungary
| | - Viktor Ambrus
- MTA-DE Laboratory of Protein Dynamics, Department of Biochemistry and Molecular Biology, University of Debrecen, Debrecen, Hungary
| | - Michele Vendruscolo
- Centre for Misfolding Diseases, Department of Chemistry, University of Cambridge, Cambridge, United Kingdom
- * E-mail: (MV); (MF)
| | - Monika Fuxreiter
- MTA-DE Laboratory of Protein Dynamics, Department of Biochemistry and Molecular Biology, University of Debrecen, Debrecen, Hungary
- * E-mail: (MV); (MF)
| |
Collapse
|
32
|
Matsuo K, Gekko K. Circular-Dichroism and Synchrotron-Radiation Circular-Dichroism Spectroscopy as Tools to Monitor Protein Structure in a Lipid Environment. Methods Mol Biol 2020; 2003:253-279. [PMID: 31218622 DOI: 10.1007/978-1-4939-9512-7_12] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/31/2023]
Abstract
Circular-dichroism (CD) spectroscopy is a powerful tool for the secondary-structure analysis of proteins. The structural information obtained by CD does not have atomic-level resolution (unlike X-ray crystallography and NMR spectroscopy), but it has the great advantage of being applicable to both nonnative and native proteins in a wide range of solution conditions containing lipids and detergents. The development of synchrotron-radiation CD (SRCD) instruments has greatly expanded the utility of this method by extending the spectra to the vacuum-ultraviolet region below 190 nm and producing information that is unobtainable by conventional CD instruments. Combining SRCD data with bioinformatics provides new insight into the conformational changes of proteins in a membrane environment.
Collapse
Affiliation(s)
- Koichi Matsuo
- Hiroshima Synchrotron Radiation Center, Hiroshima University, Higashi-Hiroshima, Japan
| | - Kunihiko Gekko
- Hiroshima Synchrotron Radiation Center, Hiroshima University, Higashi-Hiroshima, Japan.
| |
Collapse
|
33
|
Smolarczyk T, Roterman-Konieczna I, Stapor K. Protein Secondary Structure Prediction: A Review of Progress and Directions. Curr Bioinform 2020. [DOI: 10.2174/1574893614666191017104639] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Over the last few decades, a search for the theory of protein folding has
grown into a full-fledged research field at the intersection of biology, chemistry and informatics.
Despite enormous effort, there are still open questions and challenges, like understanding the rules
by which amino acid sequence determines protein secondary structure.
Objective:
In this review, we depict the progress of the prediction methods over the years and
identify sources of improvement.
Methods:
The protein secondary structure prediction problem is described followed by the discussion
on theoretical limitations, description of the commonly used data sets, features and a review
of three generations of methods with the focus on the most recent advances. Additionally, methods
with available online servers are assessed on the independent data set.
Results:
The state-of-the-art methods are currently reaching almost 88% for 3-class prediction and
76.5% for an 8-class prediction.
Conclusion:
This review summarizes recent advances and outlines further research directions.
Collapse
Affiliation(s)
- Tomasz Smolarczyk
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
| | - Irena Roterman-Konieczna
- Department of Bioinformatics and Telemedicine, Jagiellonian University Medical College, Krakow, Poland
| | - Katarzyna Stapor
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
| |
Collapse
|
34
|
Sequence-Based Prediction of Fuzzy Protein Interactions. J Mol Biol 2020; 432:2289-2303. [DOI: 10.1016/j.jmb.2020.02.017] [Citation(s) in RCA: 64] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2019] [Revised: 01/24/2020] [Accepted: 02/14/2020] [Indexed: 12/31/2022]
|
35
|
Long S, Tian P. Protein secondary structure prediction with context convolutional neural network. RSC Adv 2019; 9:38391-38396. [PMID: 35540205 PMCID: PMC9075825 DOI: 10.1039/c9ra05218f] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2019] [Accepted: 11/18/2019] [Indexed: 11/21/2022] Open
Abstract
Protein secondary structure (SS) prediction is important for studying protein structure and function. Both traditional machine learning methods and deep learning neural networks have been utilized and great progress has been achieved in approaching the theoretical limit. Convolutional and recurrent neural networks are two major types of deep learning architectures with comparable prediction accuracy but different training procedures to achieve optimal performance. We are interested in seeking a novel architectural style with competitive performance and in understanding the performance of different architectures with similar training procedures. We constructed a context convolutional neural network (Contextnet) and compared its performance with popular models (e.g. convolutional neural network, recurrent neural network, conditional neural fields…) under similar training procedures on a Jpred dataset. The Contextnet was proven to be highly competitive. Additionally, we retrained the network with the Cullpdb dataset and compared with Jpred, ReportX, Spider3 server and MUFold-SS method, the Contextnet was found to be more Q3 accurate on a CASP13 dataset. Training procedures were found to have significant impact on the accuracy of the Contextnet. Protein secondary structure prediction using context convolutional neural network.![]()
Collapse
Affiliation(s)
| | - Pu Tian
- School of Life Science, School of Artificial Intelligence, Jilin University 2699 Qian-jin Street Changchun China 130012
| |
Collapse
|
36
|
Ziȩba K, Ślusarz M, Ślusarz R, Liwo A, Czaplewski C, Sieradzan AK. Extension of the UNRES Coarse-Grained Force Field to Membrane Proteins in the Lipid Bilayer. J Phys Chem B 2019; 123:7829-7839. [PMID: 31454484 DOI: 10.1021/acs.jpcb.9b06700] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
The physics-based UNRES coarse-grained force field for the simulations of protein structure and dynamics has been extended to treat membrane proteins. The lipid bilayer has been modeled by introducing a continuous nonpolar phase with the water-interface region of appropriate thickness. The potentials for average electrostatic and correlation interactions of the peptide groups have been rescaled to account for the reduction of the dielectric permittivity compared to the water phase and new potentials for protein side-chain-side-chain interactions inside and across the lipid phase have been introduced. The model was implemented in the UNRES package for coarse-grained simulations of proteins, and the package with the new functionality was tested for total energy conservation and thermostat behavior in microcanonical and canonical molecular dynamics simulations runs, respectively. The method was validated by running unrestricted ab initio blind-prediction tests of 10 short α-helical membrane proteins, all runs started from the extended structures. The modified UNRES force field was able to predict correctly the overall folds of the membrane proteins studied.
Collapse
Affiliation(s)
- Karolina Ziȩba
- Faculty of Chemistry , University of Gdańsk , Wita Stwosza 63 , 80-308 Gdańsk , Poland
| | - Magdalena Ślusarz
- Faculty of Chemistry , University of Gdańsk , Wita Stwosza 63 , 80-308 Gdańsk , Poland
| | - Rafał Ślusarz
- Faculty of Chemistry , University of Gdańsk , Wita Stwosza 63 , 80-308 Gdańsk , Poland
| | - Adam Liwo
- Faculty of Chemistry , University of Gdańsk , Wita Stwosza 63 , 80-308 Gdańsk , Poland
| | - Cezary Czaplewski
- Faculty of Chemistry , University of Gdańsk , Wita Stwosza 63 , 80-308 Gdańsk , Poland
| | - Adam K Sieradzan
- Faculty of Chemistry , University of Gdańsk , Wita Stwosza 63 , 80-308 Gdańsk , Poland
| |
Collapse
|
37
|
A Bi-LSTM Based Ensemble Algorithm for Prediction of Protein Secondary Structure. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9173538] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The prediction of protein secondary structure continues to be an active area of research in bioinformatics. In this paper, a Bi-LSTM based ensemble model is developed for the prediction of protein secondary structure. The ensemble model with dual loss function consists of five sub-models, which are finally joined by a Bi-LSTM layer. In contrast to existing ensemble methods, which generally train each sub-model and then join them as a whole, this ensemble model and sub-models can be trained simultaneously and the performance of each model can be observed and compared during the training process. Three independent test sets (e.g., data1199, 513 protein Cuff & Barton set (CB513) and 203 proteins from Critical Appraisals Skills Programme (CASP203)) are employed to test the method. On average, the ensemble model achieved 84.3% in Q 3 accuracy and 81.9% in segment overlap measure ( SOV ) score by using 10-fold cross validation. There is an improvement of up to 1% over some state-of-the-art prediction methods of protein secondary structure.
Collapse
|
38
|
Torrisi M, Kaleel M, Pollastri G. Deeper Profiles and Cascaded Recurrent and Convolutional Neural Networks for state-of-the-art Protein Secondary Structure Prediction. Sci Rep 2019; 9:12374. [PMID: 31451723 PMCID: PMC6710256 DOI: 10.1038/s41598-019-48786-x] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2019] [Accepted: 08/12/2019] [Indexed: 01/10/2023] Open
Abstract
Protein Secondary Structure prediction has been a central topic of research in Bioinformatics for decades. In spite of this, even the most sophisticated ab initio SS predictors are not able to reach the theoretical limit of three-state prediction accuracy (88–90%), while only a few predict more than the 3 traditional Helix, Strand and Coil classes. In this study we present tests on different models trained both on single sequence and evolutionary profile-based inputs and develop a new state-of-the-art system with Porter 5. Porter 5 is composed of ensembles of cascaded Bidirectional Recurrent Neural Networks and Convolutional Neural Networks, incorporates new input encoding techniques and is trained on a large set of protein structures. Porter 5 achieves 84% accuracy (81% SOV) when tested on 3 classes and 73% accuracy (70% SOV) on 8 classes on a large independent set. In our tests Porter 5 is 2% more accurate than its previous version and outperforms or matches the most recent predictors of secondary structure we tested. When Porter 5 is retrained on SCOPe based sets that eliminate homology between training/testing samples we obtain similar results. Porter is available as a web server and standalone program at http://distilldeep.ucd.ie/porter/ alongside all the datasets and alignments.
Collapse
Affiliation(s)
- Mirko Torrisi
- School of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland
| | - Manaz Kaleel
- School of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland
| | - Gianluca Pollastri
- School of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland.
| |
Collapse
|
39
|
Ludwiczak J, Winski A, Szczepaniak K, Alva V, Dunin-Horkawicz S. DeepCoil—a fast and accurate prediction of coiled-coil domains in protein sequences. Bioinformatics 2019; 35:2790-2795. [DOI: 10.1093/bioinformatics/bty1062] [Citation(s) in RCA: 86] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Revised: 12/19/2018] [Accepted: 12/27/2018] [Indexed: 01/30/2023] Open
Abstract
Abstract
Motivation
Coiled coils are protein structural domains that mediate a plethora of biological interactions, and thus their reliable annotation is crucial for studies of protein structure and function.
Results
Here, we report DeepCoil, a new neural network-based tool for the detection of coiled-coil domains in protein sequences. In our benchmarks, DeepCoil significantly outperformed current state-of-the-art tools, such as PCOILS and Marcoil, both in the prediction of canonical and non-canonical coiled coils. Furthermore, in a scan of the human genome with DeepCoil, we detected many coiled-coil domains that remained undetected by other methods. This higher sensitivity of DeepCoil should make it a method of choice for accurate genome-wide detection of coiled-coil domains.
Availability and implementation
DeepCoil is written in Python and utilizes the Keras machine learning library. A web server is freely available at https://toolkit.tuebingen.mpg.de/#/tools/deepcoil and a standalone version can be downloaded at https://github.com/labstructbioinf/DeepCoil.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jan Ludwiczak
- Laboratory of Structural Bioinformatics, Centre of New Technologies, University of Warsaw, Warsaw, Poland
- Laboratory of Bioinformatics, Nencki Institute of Experimental Biology, Warsaw, Poland
| | - Aleksander Winski
- Laboratory of Structural Bioinformatics, Centre of New Technologies, University of Warsaw, Warsaw, Poland
| | - Krzysztof Szczepaniak
- Laboratory of Structural Bioinformatics, Centre of New Technologies, University of Warsaw, Warsaw, Poland
| | - Vikram Alva
- Department of Protein Evolution, Max Planck Institute for Developmental Biology, Tübingen, Germany
| | - Stanislaw Dunin-Horkawicz
- Laboratory of Structural Bioinformatics, Centre of New Technologies, University of Warsaw, Warsaw, Poland
| |
Collapse
|
40
|
Hanson J, Paliwal K, Litfin T, Yang Y, Zhou Y. Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks. Bioinformatics 2018; 35:2403-2410. [DOI: 10.1093/bioinformatics/bty1006] [Citation(s) in RCA: 115] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2018] [Revised: 11/02/2018] [Accepted: 12/06/2018] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Sequence-based prediction of one dimensional structural properties of proteins has been a long-standing subproblem of protein structure prediction. Recently, prediction accuracy has been significantly improved due to the rapid expansion of protein sequence and structure libraries and advances in deep learning techniques, such as residual convolutional networks (ResNets) and Long-Short-Term Memory Cells in Bidirectional Recurrent Neural Networks (LSTM-BRNNs). Here we leverage an ensemble of LSTM-BRNN and ResNet models, together with predicted residue-residue contact maps, to continue the push towards the attainable limit of prediction for 3- and 8-state secondary structure, backbone angles (θ, τ, ϕ and ψ), half-sphere exposure, contact numbers and solvent accessible surface area (ASA).
Results
The new method, named SPOT-1D, achieves similar, high performance on a large validation set and test set (≈1000 proteins in each set), suggesting robust performance for unseen data. For the large test set, it achieves 87% and 77% in 3- and 8-state secondary structure prediction and 0.82 and 0.86 in correlation coefficients between predicted and measured ASA and contact numbers, respectively. Comparison to current state-of-the-art techniques reveals substantial improvement in secondary structure and backbone angle prediction. In particular, 44% of 40-residue fragment structures constructed from predicted backbone Cα-based θ and τ angles are less than 6 Å root-mean-squared-distance from their native conformations, nearly 20% better than the next best. The method is expected to be useful for advancing protein structure and function prediction.
Availability and implementation
SPOT-1D and its data is available at: http://sparks-lab.org/.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jack Hanson
- Signal Processing Laboratory, Griffith University, Brisbane, QLD, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, Griffith University, Brisbane, QLD, Australia
| | - Thomas Litfin
- School of Information and Communication Technology, Griffith University, Gold Coast, QLD, Australia
| | - Yuedong Yang
- School of Data and Computer Science, Sun-Yat Sen University, Guangzhou, Guangdong, China
| | - Yaoqi Zhou
- School of Information and Communication Technology, Griffith University, Gold Coast, QLD, Australia
- Institute for Glycomics, Griffith University, Gold Coast, QLD, Australia
| |
Collapse
|
41
|
Tamposis IA, Theodoropoulou MC, Tsirigos KD, Bagos PG. Extending hidden Markov models to allow conditioning on previous observations. J Bioinform Comput Biol 2018; 16:1850019. [DOI: 10.1142/s0219720018500191] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Hidden Markov Models (HMMs) are probabilistic models widely used in computational molecular biology. However, the Markovian assumption regarding transition probabilities which dictates that the observed symbol depends only on the current state may not be sufficient for some biological problems. In order to overcome the limitations of the first order HMM, a number of extensions have been proposed in the literature to incorporate past information in HMMs conditioning either on the hidden states, or on the observations, or both. Here, we implement a simple extension of the standard HMM in which the current observed symbol (amino acid residue) depends both on the current state and on a series of observed previous symbols. The major advantage of the method is the simplicity in the implementation, which is achieved by properly transforming the observation sequence, using an extended alphabet. Thus, it can utilize all the available algorithms for the training and decoding of HMMs. We investigated the use of several encoding schemes and performed tests in a number of important biological problems previously studied by our team (prediction of transmembrane proteins and prediction of signal peptides). The evaluation shows that, when enough data are available, the performance increased by 1.8%–8.2% and the existing prediction methods may improve using this approach. The methods, for which the improvement was significant (PRED-TMBB2, PRED-TAT and HMM-TM), are available as web-servers freely accessible to academic users at www.compgen.org/tools/ .
Collapse
Affiliation(s)
- Ioannis A. Tamposis
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Papasiopoulou 2-4, 35100 Lamia, Greece
| | - Margarita C. Theodoropoulou
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Papasiopoulou 2-4, 35100 Lamia, Greece
| | - Konstantinos D. Tsirigos
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Papasiopoulou 2-4, 35100 Lamia, Greece
| | - Pantelis G. Bagos
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Papasiopoulou 2-4, 35100 Lamia, Greece
| |
Collapse
|
42
|
Aydin Z, Kaynar O, Görmez Y. Dimensionality reduction for protein secondary structure and solvent accesibility prediction. J Bioinform Comput Biol 2018; 16:1850020. [PMID: 30353781 DOI: 10.1142/s0219720018500208] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Secondary structure and solvent accessibility prediction provide valuable information for estimating the three dimensional structure of a protein. As new feature extraction methods are developed the dimensionality of the input feature space increases steadily. Reducing the number of dimensions provides several advantages such as faster model training, faster prediction and noise elimination. In this work, several dimensionality reduction techniques have been employed including various feature selection methods, autoencoders and PCA for protein secondary structure and solvent accessibility prediction. The reduced feature set is used to train a support vector machine at the second stage of a hybrid classifier. Cross-validation experiments on two difficult benchmarks demonstrate that the dimension of the input space can be reduced substantially while maintaining the prediction accuracy. This will enable the incorporation of additional informative features derived for predicting the structural properties of proteins without reducing the accuracy due to overfitting.
Collapse
Affiliation(s)
- Zafer Aydin
- * Department of Computer Engineering, Abdullah Gul University, Kayseri 38080, Turkey
| | - Oğuz Kaynar
- † Department of Management Information Systems, Cumhuriyet University, Sivas 58000, Turkey
| | - Yasin Görmez
- † Department of Management Information Systems, Cumhuriyet University, Sivas 58000, Turkey
| |
Collapse
|
43
|
Deléage G. ALIGNSEC: viewing protein secondary structure predictions within large multiple sequence alignments. Bioinformatics 2018; 33:3991-3992. [PMID: 28961944 DOI: 10.1093/bioinformatics/btx521] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Accepted: 08/14/2017] [Indexed: 11/12/2022] Open
Abstract
Motivation ALIGNSEC is a module within ANTHEPROT designed for the interactive display, edition and printing of large-scale multiple alignments integrating secondary structure predictions. Availability and implementation The ALIGNSEC module is part of the ANTHEPROT package (http://antheprot-pbil.ibcp.fr) which can be used freely for academic users. It is running on Windows Operating systems. For commercial use, please contact the author. Contact gilbert.deleage@ibcp.fr.
Collapse
Affiliation(s)
- Gilbert Deléage
- Lyon University-CNRS, LBTI-UMR5305, Institute of Biology and Chemistry of Proteins, 7 passage du Vercors, 69367 Lyon cedex 07, France
| |
Collapse
|
44
|
Zhang B, Li J, Lü Q. Prediction of 8-state protein secondary structures by a novel deep learning architecture. BMC Bioinformatics 2018; 19:293. [PMID: 30075707 PMCID: PMC6090794 DOI: 10.1186/s12859-018-2280-5] [Citation(s) in RCA: 60] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2018] [Accepted: 07/09/2018] [Indexed: 11/16/2022] Open
Abstract
Background Protein secondary structure can be regarded as an information bridge that links the primary sequence and tertiary structure. Accurate 8-state secondary structure prediction can significantly give more precise and high resolution on structure-based properties analysis. Results We present a novel deep learning architecture which exploits an integrative synergy of prediction by a convolutional neural network, residual network, and bidirectional recurrent neural network to improve the performance of protein secondary structure prediction. A local block comprised of convolutional filters and original input is designed for capturing local sequence features. The subsequent bidirectional recurrent neural network consisting of gated recurrent units can capture global context features. Furthermore, the residual network can improve the information flow between the hidden layers and the cascaded recurrent neural network. Our proposed deep network achieved 71.4% accuracy on the benchmark CB513 dataset for the 8-state prediction; and the ensemble learning by our model achieved 74% accuracy. Our model generalization capability is also evaluated on other three independent datasets CASP10, CASP11 and CASP12 for both 8- and 3-state prediction. These prediction performances are superior to the state-of-the-art methods. Conclusion Our experiment demonstrates that it is a valuable method for predicting protein secondary structure, and capturing local and global features concurrently is very useful in deep learning. Electronic supplementary material The online version of this article (10.1186/s12859-018-2280-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Buzhong Zhang
- School of Computer Science and Technology, Soochow University, Suzhou, China.,School of Computer and Information, and the University Key Laboratory of Intelligent Perception and Computing of Anhui Province, Anqing Normal University, Anqing, 246011, China
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Broadway, NSW 2007, Sydney, PO Box 123, Australia
| | - Qiang Lü
- School of Computer Science and Technology, Soochow University, Suzhou, China.
| |
Collapse
|
45
|
Liu T, Wang Z. SOV_refine: A further refined definition of segment overlap score and its significance for protein structure similarity. SOURCE CODE FOR BIOLOGY AND MEDICINE 2018; 13:1. [PMID: 29713370 PMCID: PMC5909207 DOI: 10.1186/s13029-018-0068-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/30/2016] [Accepted: 04/02/2018] [Indexed: 11/22/2022]
Abstract
Background The segment overlap score (SOV) has been used to evaluate the predicted protein secondary structures, a sequence composed of helix (H), strand (E), and coil (C), by comparing it with the native or reference secondary structures, another sequence of H, E, and C. SOV’s advantage is that it can consider the size of continuous overlapping segments and assign extra allowance to longer continuous overlapping segments instead of only judging from the percentage of overlapping individual positions as Q3 score does. However, we have found a drawback from its previous definition, that is, it cannot ensure increasing allowance assignment when more residues in a segment are further predicted accurately. Results A new way of assigning allowance has been designed, which keeps all the advantages of the previous SOV score definitions and ensures that the amount of allowance assigned is incremental when more elements in a segment are predicted accurately. Furthermore, our improved SOV has achieved a higher correlation with the quality of protein models measured by GDT-TS score and TM-score, indicating its better abilities to evaluate tertiary structure quality at the secondary structure level. We analyzed the statistical significance of SOV scores and found the threshold values for distinguishing two protein structures (SOV_refine > 0.19) and indicating whether two proteins are under the same CATH fold (SOV_refine > 0.94 and > 0.90 for three- and eight-state secondary structures respectively). We provided another two example applications, which are when used as a machine learning feature for protein model quality assessment and comparing different definitions of topologically associating domains. We proved that our newly defined SOV score resulted in better performance. Conclusions The SOV score can be widely used in bioinformatics research and other fields that need to compare two sequences of letters in which continuous segments have important meanings. We also generalized the previous SOV definitions so that it can work for sequences composed of more than three states (e.g., it can work for the eight-state definition of protein secondary structures). A standalone software package has been implemented in Perl with source code released. The software can be downloaded from http://dna.cs.miami.edu/SOV/.
Collapse
Affiliation(s)
- Tong Liu
- Department of Computer Science, University of Miami, 1365 Memorial Drive, Coral Gables, FL 33124 USA
| | - Zheng Wang
- Department of Computer Science, University of Miami, 1365 Memorial Drive, Coral Gables, FL 33124 USA
| |
Collapse
|
46
|
Predicting Alpha Helical Transmembrane Proteins Using HMMs. Methods Mol Biol 2018; 1552:63-82. [PMID: 28224491 DOI: 10.1007/978-1-4939-6753-7_5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/12/2023]
Abstract
Alpha helical transmembrane (TM) proteins constitute an important structural class of membrane proteins involved in a wide variety of cellular functions. The prediction of their transmembrane topology, as well as their discrimination in newly sequenced genomes, is of great importance for the elucidation of their structure and function. Several methods have been applied for the prediction of the transmembrane segments and the topology of alpha helical transmembrane proteins utilizing different algorithmic techniques. Hidden Markov Models (HMMs) have been efficiently used in the development of several computational methods used for this task. In this chapter we give a brief review of different available prediction methods for alpha helical transmembrane proteins pointing out sequence and structural features that should be incorporated in a prediction method. We then describe the procedure of the design and development of a Hidden Markov Model capable of predicting the transmembrane alpha helices in proteins and discriminating them from globular proteins.
Collapse
|
47
|
Xie S, Li Z, Hu H. Protein secondary structure prediction based on the fuzzy support vector machine with the hyperplane optimization. Gene 2017; 642:74-83. [PMID: 29104167 DOI: 10.1016/j.gene.2017.11.005] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2017] [Revised: 10/29/2017] [Accepted: 11/02/2017] [Indexed: 11/30/2022]
Abstract
The prediction of the protein secondary structure is a crucial point in bioinformatics and related fields. In the last years, machine learning methods have become a valuable tool, achieving satisfactory results. However, the prediction accuracy needs to be further ameliorated. This paper proposes a new method based on an improved fuzzy support vector machine (FSVM) for the prediction of the secondary structure of proteins. Unlike traditional methods to set the membership function, it firstly constructs an approximate optimal separating hyperplane by iterating the class centers in the feature space. Then sample points close to this hyperplane are assigned with large membership values, while outliers with small membership values according to the K-nearest neighbor. And some sample points with low membership values are removed, reducing the training time and improving the prediction accuracy. To optimize the prediction results, our method also exploits information on sequence-based structural similarity. We used three databases (e.g. RS126, CB513 and data1199) to test this method, showing the achievement of 94.2%, 93.1%, 96.7% Q3 accuracy and 91.7%, 89.7%, 94.1% SOV values for the three datasets, respectively. Overall, our method results are comparable to or often better than commonly used methods (Magnan & Baldi, 2014; Sheng et al., 2016) for secondary structure prediction.
Collapse
Affiliation(s)
- Shangxin Xie
- School of Science, Zhejiang Sci-Tech University, Hangzhou, Zhejiang, 310018, China
| | - Zhong Li
- School of Science, Zhejiang Sci-Tech University, Hangzhou, Zhejiang, 310018, China.
| | - Hailong Hu
- School of Science, Zhejiang Sci-Tech University, Hangzhou, Zhejiang, 310018, China; School of Science, Zhejiang A&F University, Lin'an, Zhejiang 311300, China
| |
Collapse
|
48
|
Sieradzan AK, Lipska AG, Lubecka EA. Shielding effect in protein folding. J Mol Graph Model 2017; 79:118-132. [PMID: 29161634 DOI: 10.1016/j.jmgm.2017.10.018] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2017] [Revised: 10/19/2017] [Accepted: 10/20/2017] [Indexed: 01/01/2023]
Abstract
One of the most important interactions responsible for protein folding and stability are hydrogen bonds between peptide groups. There is a constant competition between the water molecules and peptide groups in a hydrogen bond formation. Also side-chains take part in this process by reducing hydration of peptide group (shielding effect) that promotes the protein folding. In this paper, a new approach to take into account a shielding effect is presented. A modification of the energy function is derived and incorporated into the UNited RESidue (UNRES) force field. Canonical Molecular Dynamics and Replica Exchange Molecular Dynamics with UNRES force field is applied to study the influence of this effect on protein structure, folding kinetics and free energy landscapes. The results of test calculations suggest that even small contribution of this effect into energy function changes force field behavior as well as speeds up the folding process significantly.
Collapse
Affiliation(s)
- Adam K Sieradzan
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308 Gdańsk, Poland.
| | - Agnieszka G Lipska
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308 Gdańsk, Poland
| | - Emilia A Lubecka
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308 Gdańsk, Poland; Institute of Informatics, University of Gdańsk, Wita Stwosza 57, 80-308 Gdańsk, Poland
| |
Collapse
|
49
|
Li Z, Wang J, Zhang S, Zhang Q, Wu W. A new hybrid coding for protein secondary structure prediction based on primary structure similarity. Gene 2017; 618:8-13. [DOI: 10.1016/j.gene.2017.03.011] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2016] [Revised: 02/02/2017] [Accepted: 03/15/2017] [Indexed: 11/16/2022]
|
50
|
Venko K, Roy Choudhury A, Novič M. Computational Approaches for Revealing the Structure of Membrane Transporters: Case Study on Bilitranslocase. Comput Struct Biotechnol J 2017; 15:232-242. [PMID: 28228927 PMCID: PMC5312651 DOI: 10.1016/j.csbj.2017.01.008] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2016] [Revised: 01/19/2017] [Accepted: 01/20/2017] [Indexed: 11/23/2022] Open
Abstract
The structural and functional details of transmembrane proteins are vastly underexplored, mostly due to experimental difficulties regarding their solubility and stability. Currently, the majority of transmembrane protein structures are still unknown and this present a huge experimental and computational challenge. Nowadays, thanks to X-ray crystallography or NMR spectroscopy over 3000 structures of membrane proteins have been solved, among them only a few hundred unique ones. Due to the vast biological and pharmaceutical interest in the elucidation of the structure and the functional mechanisms of transmembrane proteins, several computational methods have been developed to overcome the experimental gap. If combined with experimental data the computational information enables rapid, low cost and successful predictions of the molecular structure of unsolved proteins. The reliability of the predictions depends on the availability and accuracy of experimental data associated with structural information. In this review, the following methods are proposed for in silico structure elucidation: sequence-dependent predictions of transmembrane regions, predictions of transmembrane helix–helix interactions, helix arrangements in membrane models, and testing their stability with molecular dynamics simulations. We also demonstrate the usage of the computational methods listed above by proposing a model for the molecular structure of the transmembrane protein bilitranslocase. Bilitranslocase is bilirubin membrane transporter, which shares similar tissue distribution and functional properties with some of the members of the Organic Anion Transporter family and is the only member classified in the Bilirubin Transporter Family. Regarding its unique properties, bilitranslocase is a potentially interesting drug target.
Collapse
Affiliation(s)
- Katja Venko
- Department of Cheminformatics, National Institute of Chemistry, Ljubljana, Slovenia
| | - A Roy Choudhury
- Department of Cheminformatics, National Institute of Chemistry, Ljubljana, Slovenia
| | - Marjana Novič
- Department of Cheminformatics, National Institute of Chemistry, Ljubljana, Slovenia
| |
Collapse
|