51
|
Enhancing protein backbone angle prediction by using simpler models of deep neural networks. Sci Rep 2020; 10:19430. [PMID: 33173130 PMCID: PMC7655839 DOI: 10.1038/s41598-020-76317-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2020] [Accepted: 10/23/2020] [Indexed: 11/09/2022] Open
Abstract
Protein structure prediction is a grand challenge. Prediction of protein structures via the representations using backbone dihedral angles has recently achieved significant progress along with the on-going surge of deep neural network (DNN) research in general. However, we observe that in the protein backbone angle prediction research, there is an overall trend to employ more and more complex neural networks and then to throw more and more features to the neural networks. While more features might add more predictive power to the neural network, we argue that redundant features could rather clutter the scenario and more complex neural networks then just could counterbalance the noise. From artificial intelligence and machine learning perspectives, problem representations and solution approaches do mutually interact and thus affect performance. We also argue that comparatively simpler predictors can more easily be reconstructed than the more complex ones. With these arguments in mind, we present a deep learning method named Simpler Angle Predictor (SAP) to train simpler DNN models that enhance protein backbone angle prediction. We then empirically show that SAP significantly outperforms existing state-of-the-art methods on well-known benchmark datasets: for some types of angles, the differences are above 3 in mean absolute error (MAE). The SAP program along with its data is available from the website https://gitlab.com/mahnewton/sap.
Collapse
|
52
|
Kotowski K, Smolarczyk T, Roterman-Konieczna I, Stapor K. ProteinUnet-An efficient alternative to SPIDER3-single for sequence-based prediction of protein secondary structures. J Comput Chem 2020; 42:50-59. [PMID: 33058261 PMCID: PMC7756333 DOI: 10.1002/jcc.26432] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 09/21/2020] [Accepted: 09/23/2020] [Indexed: 12/16/2022]
Abstract
Predicting protein function and structure from sequence remains an unsolved problem in bioinformatics. The best performing methods rely heavily on evolutionary information from multiple sequence alignments, which means their accuracy deteriorates for sequences with a few homologs, and given the increasing sequence database sizes requires long computation times. Here, a single‐sequence‐based prediction method is presented, called ProteinUnet, leveraging an U‐Net convolutional network architecture. It is compared to SPIDER3‐Single model, based on long short‐term memory‐bidirectional recurrent neural networks architecture. Both methods achieve similar results for prediction of secondary structures (both three‐ and eight‐state), half‐sphere exposure, and contact number, but ProteinUnet has two times fewer parameters, 17 times shorter inference time, and can be trained 11 times faster. Moreover, ProteinUnet tends to be better for short sequences and residues with a low number of local contacts. Additionally, the method of loss weighting is presented as an effective way of increasing accuracy for rare secondary structures.
Collapse
Affiliation(s)
- Krzysztof Kotowski
- Department of Applied Informatics, Silesian University of Technology, Gliwice, Poland
| | - Tomasz Smolarczyk
- Department of Applied Informatics, Silesian University of Technology, Gliwice, Poland
| | - Irena Roterman-Konieczna
- Department of Bioinformatics and Telemedicine, Jagiellonian University Medical College, Kraków, Poland
| | - Katarzyna Stapor
- Department of Applied Informatics, Silesian University of Technology, Gliwice, Poland
| |
Collapse
|
53
|
Urban G, Torrisi M, Magnan CN, Pollastri G, Baldi P. Protein profiles: Biases and protocols. Comput Struct Biotechnol J 2020; 18:2281-2289. [PMID: 32994887 PMCID: PMC7486441 DOI: 10.1016/j.csbj.2020.08.015] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2020] [Revised: 08/14/2020] [Accepted: 08/15/2020] [Indexed: 11/13/2022] Open
Abstract
The use of evolutionary profiles to predict protein secondary structure, as well as other protein structural features, has been standard practice since the 1990s. Using profiles in the input of such predictors, in place or in addition to the sequence itself, leads to significantly more accurate predictions. While profiles can enhance structural signals, their role remains somewhat surprising as proteins do not use profiles when folding in vivo. Furthermore, the same sequence-based redundancy reduction protocols initially derived to train and evaluate sequence-based predictors, have been applied to train and evaluate profile-based predictors. This can lead to unfair comparisons since profiles may facilitate the bleeding of information between training and test sets. Here we use the extensively studied problem of secondary structure prediction to better evaluate the role of profiles and show that: (1) high levels of profile similarity between training and test proteins are observed when using standard sequence-based redundancy protocols; (2) the gain in accuracy for profile-based predictors, over sequence-based predictors, strongly relies on these high levels of profile similarity between training and test proteins; and (3) the overall accuracy of a profile-based predictor on a given protein dataset provides a biased measure when trying to estimate the actual accuracy of the predictor, or when comparing it to other predictors. We show, however, that this bias can be mitigated by implementing a new protocol (EVALpro) which evaluates the accuracy of profile-based predictors as a function of the profile similarity between training and test proteins. Such a protocol not only allows for a fair comparison of the predictors on equally hard or easy examples, but also reduces the impact of choosing a given similarity cutoff when selecting test proteins. The EVALpro program is available in the SCRATCH suite ( www.scratch.proteomics.ics.uci.edu) and can be downloaded at: www.download.igb.uci.edu/#evalpro.
Collapse
Affiliation(s)
- Gregor Urban
- Department of Computer Science & Institute for Genomics and Bioinformatics, University of California, Irvine, CA 92697, USA
| | - Mirko Torrisi
- UCD Institute for Discovery, University College Dublin, Dublin, 4, Ireland
| | - Christophe N Magnan
- Department of Computer Science & Institute for Genomics and Bioinformatics, University of California, Irvine, CA 92697, USA
| | - Gianluca Pollastri
- UCD Institute for Discovery, University College Dublin, Dublin, 4, Ireland
| | - Pierre Baldi
- Department of Computer Science & Institute for Genomics and Bioinformatics, University of California, Irvine, CA 92697, USA
| |
Collapse
|
54
|
Jakob T, Rauber MM, Perez-Riverol A, Spillner E, Blank S. The Honeybee Venom Major Allergen Api m 10 (Icarapin) and Its Role in Diagnostics and Treatment of Hymenoptera Venom Allergy. Curr Allergy Asthma Rep 2020; 20:48. [PMID: 32548726 PMCID: PMC7297703 DOI: 10.1007/s11882-020-00943-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
PURPOSE OF REVIEW In Hymenoptera venom allergy, the research focus has moved from whole venoms to individual allergenic molecules. Api m 10 (icarapin) has been described as a major allergen of honeybee venom (HBV) with potentially high relevance for diagnostics and therapy of venom allergy. Here, we review recent studies on Api m 10 characteristics as well as its role in component-resolved diagnostics and potential implications for venom-specific immunotherapy (VIT). RECENT FINDINGS Api m 10 is a major allergen of low abundance in HBV. It is an obviously unstable protein of unknown function that exhibits homologs in other insect species. Despite its low abundance in HBV, 35 to 72% of HBV-allergic patients show relevant sensitization to this allergen. Api m 10 is a marker allergen for HBV sensitization, which in many cases can help to identify primary sensitization to HBV and, hence, to discriminate between genuine sensitization and cross-reactivity. Moreover, Api m 10 might support personalized risk stratification in VIT, as dominant sensitization to Api m 10 has been identified as risk factor for treatment failure. This might be of particular importance since Api m 10 is strongly underrepresented in some therapeutic preparations commonly used for VIT. Although the role of Api m 10 in HBV allergy and tolerance induction during VIT is not fully understood, it certainly is a useful tool to unravel primary sensitization and individual sensitization profiles in component-resolved diagnostics (CRD). Moreover, a potential of Api m 10 to contribute to personalized treatment strategies in HBV allergy is emerging.
Collapse
Affiliation(s)
- Thilo Jakob
- Department of Dermatology and Allergology, Experimental Dermatology and Allergy Research Group, Justus-Liebig-University Gießen, Gießen, Germany.
| | - Michèle Myriam Rauber
- Department of Dermatology and Allergology, Experimental Dermatology and Allergy Research Group, Justus-Liebig-University Gießen, Gießen, Germany
| | - Amilcar Perez-Riverol
- Department of Dermatology and Allergology, Experimental Dermatology and Allergy Research Group, Justus-Liebig-University Gießen, Gießen, Germany
| | - Edzard Spillner
- Department of Engineering Immunological Biotechnology, Aarhus University, Aarhus, Denmark
| | - Simon Blank
- Center of Allergy and Environment (ZAUM), Technical University of Munich, School of Medicine and Helmholtz Center Munich, German Research Center for Environmental Health, Member of the German Center of Lung Research (DZL), Munich, Germany
| |
Collapse
|
55
|
Getting to Know Your Neighbor: Protein Structure Prediction Comes of Age with Contextual Machine Learning. J Comput Biol 2020; 27:796-814. [DOI: 10.1089/cmb.2019.0193] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
|
56
|
Chen MC, Li Y, Zhu YH, Ge F, Yu DJ. SSCpred: Single-Sequence-Based Protein Contact Prediction Using Deep Fully Convolutional Network. J Chem Inf Model 2020; 60:3295-3303. [DOI: 10.1021/acs.jcim.9b01207] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Ming-Cai Chen
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Xiaolingwei 200, Nanjing 210094, P. R. China
| | - Yang Li
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Xiaolingwei 200, Nanjing 210094, P. R. China
- Department of Computational Medicine and Bioinformatics, University of Michigan, Washtenaw 100, Ann Arbor, Michigan 48109-2218, United States
| | - Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Xiaolingwei 200, Nanjing 210094, P. R. China
| | - Fang Ge
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Xiaolingwei 200, Nanjing 210094, P. R. China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Xiaolingwei 200, Nanjing 210094, P. R. China
| |
Collapse
|
57
|
Lv X, Chen J, Lu Y, Chen Z, Xiao N, Yang Y. Accurately Predicting Mutation-Caused Stability Changes from Protein Sequences Using Extreme Gradient Boosting. J Chem Inf Model 2020; 60:2388-2395. [PMID: 32203653 DOI: 10.1021/acs.jcim.0c00064] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Accurately predicting the impact of point mutation on protein stability has crucial roles in protein design and engineering. In this study, we proposed a novel method (BoostDDG) to predict stability changes upon point mutations from protein sequences based on the extreme gradient boosting. We extracted features comprehensively from evolutional information and predicted structures and performed feature selection by a strategy of sequential forward selection. The features and parameters were optimized by homologue-based cross-validation to avoid overfitting. Finally, we found that 14 features from six groups led to the highest Pearson correlation coefficient (PCC) of 0.535, which is consistent with the 0.540 on an independent test. Our method was indicated to consistently outperform other sequence-based methods on three precompiled test sets, and 7363 variants on two proteins (PTEN and TPMT). These results highlighted that BoostDDG is a powerful tool for predicting stability changes upon point mutations from protein sequences.
Collapse
Affiliation(s)
- Xuan Lv
- State Key Laboratory of High-Performance Computing, School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, China
| | - Jianwen Chen
- School of Data and Computer Science, Sun Yat-sen University, Guangzhou, Guangdong 510275, China
| | - Yutong Lu
- School of Data and Computer Science, Sun Yat-sen University, Guangzhou, Guangdong 510275, China
| | - Zhiguang Chen
- School of Data and Computer Science, Sun Yat-sen University, Guangzhou, Guangdong 510275, China
| | - Nong Xiao
- State Key Laboratory of High-Performance Computing, School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, China.,School of Data and Computer Science, Sun Yat-sen University, Guangzhou, Guangdong 510275, China
| | - Yuedong Yang
- School of Data and Computer Science, Sun Yat-sen University, Guangzhou, Guangdong 510275, China.,Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-sen University, Ministry of Education, Guangzhou, Guangdong 510275, China
| |
Collapse
|
58
|
Shi Q, Chen W, Huang S, Wang Y, Xue Z. Deep learning for mining protein data. Brief Bioinform 2019; 22:194-218. [PMID: 31867611 DOI: 10.1093/bib/bbz156] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2019] [Revised: 10/21/2019] [Accepted: 11/07/2019] [Indexed: 01/16/2023] Open
Abstract
The recent emergence of deep learning to characterize complex patterns of protein big data reveals its potential to address the classic challenges in the field of protein data mining. Much research has revealed the promise of deep learning as a powerful tool to transform protein big data into valuable knowledge, leading to scientific discoveries and practical solutions. In this review, we summarize recent publications on deep learning predictive approaches in the field of mining protein data. The application architectures of these methods include multilayer perceptrons, stacked autoencoders, deep belief networks, two- or three-dimensional convolutional neural networks, recurrent neural networks, graph neural networks, and complex neural networks and are described from five perspectives: residue-level prediction, sequence-level prediction, three-dimensional structural analysis, interaction prediction, and mass spectrometry data mining. The advantages and deficiencies of these architectures are presented in relation to various tasks in protein data mining. Additionally, some practical issues and their future directions are discussed, such as robust deep learning for protein noisy data, architecture optimization for specific tasks, efficient deep learning for limited protein data, multimodal deep learning for heterogeneous protein data, and interpretable deep learning for protein understanding. This review provides comprehensive perspectives on general deep learning techniques for protein data analysis.
Collapse
Affiliation(s)
- Qiang Shi
- School of Software Engineering, Huazhong University of Science and Technology. His main interests cover machine learning especially deep learning, protein data analysis, and big data mining
| | - Weiya Chen
- School of Software Engineering, Huazhong University of Science & Technology, Wuhan, China. His research interests cover bioinformatics, virtual reality, and data visualization
| | - Siqi Huang
- Software Engineering at Huazhong University of science and technology, focusing on Machine learning and data mining
| | - Yan Wang
- School of life, University of Science & Technology; her main interests cover protein structure and function prediction and big data mining
| | - Zhidong Xue
- School of Software Engineering, Huazhong University of Science & Technology, Wuhan, China. His research interests cover bioinformatics, machine learning, and image processing
| |
Collapse
|
59
|
Gattani S, Mishra A, Hoque MT. StackCBPred: A stacking based prediction of protein-carbohydrate binding sites from sequence. Carbohydr Res 2019; 486:107857. [DOI: 10.1016/j.carres.2019.107857] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2019] [Revised: 10/05/2019] [Accepted: 10/23/2019] [Indexed: 11/26/2022]
|
60
|
Singh J, Hanson J, Paliwal K, Zhou Y. RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nat Commun 2019; 10:5407. [PMID: 31776342 PMCID: PMC6881452 DOI: 10.1038/s41467-019-13395-9] [Citation(s) in RCA: 152] [Impact Index Per Article: 30.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2019] [Accepted: 11/01/2019] [Indexed: 01/03/2023] Open
Abstract
The majority of our human genome transcribes into noncoding RNAs with unknown structures and functions. Obtaining functional clues for noncoding RNAs requires accurate base-pairing or secondary-structure prediction. However, the performance of such predictions by current folding-based algorithms has been stagnated for more than a decade. Here, we propose the use of deep contextual learning for base-pair prediction including those noncanonical and non-nested (pseudoknot) base pairs stabilized by tertiary interactions. Since only [Formula: see text]250 nonredundant, high-resolution RNA structures are available for model training, we utilize transfer learning from a model initially trained with a recent high-quality bpRNA dataset of [Formula: see text]10,000 nonredundant RNAs made available through comparative analysis. The resulting method achieves large, statistically significant improvement in predicting all base pairs, noncanonical and non-nested base pairs in particular. The proposed method (SPOT-RNA), with a freely available server and standalone software, should be useful for improving RNA structure modeling, sequence alignment, and functional annotations.
Collapse
Affiliation(s)
- Jaswinder Singh
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD, 4111, Australia
| | - Jack Hanson
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD, 4111, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD, 4111, Australia.
| | - Yaoqi Zhou
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr., Southport, QLD, 4222, Australia.
| |
Collapse
|
61
|
Hanson J, Paliwal K, Litfin T, Yang Y, Zhou Y. Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks. Bioinformatics 2018; 35:2403-2410. [DOI: 10.1093/bioinformatics/bty1006] [Citation(s) in RCA: 115] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2018] [Revised: 11/02/2018] [Accepted: 12/06/2018] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Sequence-based prediction of one dimensional structural properties of proteins has been a long-standing subproblem of protein structure prediction. Recently, prediction accuracy has been significantly improved due to the rapid expansion of protein sequence and structure libraries and advances in deep learning techniques, such as residual convolutional networks (ResNets) and Long-Short-Term Memory Cells in Bidirectional Recurrent Neural Networks (LSTM-BRNNs). Here we leverage an ensemble of LSTM-BRNN and ResNet models, together with predicted residue-residue contact maps, to continue the push towards the attainable limit of prediction for 3- and 8-state secondary structure, backbone angles (θ, τ, ϕ and ψ), half-sphere exposure, contact numbers and solvent accessible surface area (ASA).
Results
The new method, named SPOT-1D, achieves similar, high performance on a large validation set and test set (≈1000 proteins in each set), suggesting robust performance for unseen data. For the large test set, it achieves 87% and 77% in 3- and 8-state secondary structure prediction and 0.82 and 0.86 in correlation coefficients between predicted and measured ASA and contact numbers, respectively. Comparison to current state-of-the-art techniques reveals substantial improvement in secondary structure and backbone angle prediction. In particular, 44% of 40-residue fragment structures constructed from predicted backbone Cα-based θ and τ angles are less than 6 Å root-mean-squared-distance from their native conformations, nearly 20% better than the next best. The method is expected to be useful for advancing protein structure and function prediction.
Availability and implementation
SPOT-1D and its data is available at: http://sparks-lab.org/.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jack Hanson
- Signal Processing Laboratory, Griffith University, Brisbane, QLD, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, Griffith University, Brisbane, QLD, Australia
| | - Thomas Litfin
- School of Information and Communication Technology, Griffith University, Gold Coast, QLD, Australia
| | - Yuedong Yang
- School of Data and Computer Science, Sun-Yat Sen University, Guangzhou, Guangdong, China
| | - Yaoqi Zhou
- School of Information and Communication Technology, Griffith University, Gold Coast, QLD, Australia
- Institute for Glycomics, Griffith University, Gold Coast, QLD, Australia
| |
Collapse
|
62
|
Hanson J, Paliwal K, Zhou Y. Accurate Single-Sequence Prediction of Protein Intrinsic Disorder by an Ensemble of Deep Recurrent and Convolutional Architectures. J Chem Inf Model 2018; 58:2369-2376. [DOI: 10.1021/acs.jcim.8b00636] [Citation(s) in RCA: 50] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Affiliation(s)
- Jack Hanson
- Signal Processing Laboratory, Griffith University, Brisbane, Queensland 4122, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, Griffith University, Brisbane, Queensland 4122, Australia
| | - Yaoqi Zhou
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Southport, Queensland 4222, Australia
| |
Collapse
|