1
|
Kim Y, Yoon T, Park WB, Na S. Predicting mechanical properties of silk from its amino acid sequences via machine learning. J Mech Behav Biomed Mater 2023; 140:105739. [PMID: 36871478 DOI: 10.1016/j.jmbbm.2023.105739] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Revised: 02/12/2023] [Accepted: 02/21/2023] [Indexed: 02/25/2023]
Abstract
The silk fiber is increasingly being sought for its superior mechanical properties, biocompatibility, and eco-friendliness, making it promising as a base material for various applications. One of the characteristics of protein fibers, such as silk, is that their mechanical properties are significantly dependent on the amino acid sequence. Numerous studies have been conducted to determine the specific relationship between the amino acid sequence of silk and its mechanical properties. Still, the relationship between the amino acid sequence of silk and its mechanical properties is yet to be clarified. Other fields have adopted machine learning (ML) to establish a relationship between the inputs, such as the ratio of different input material compositions and the resulting mechanical properties. We have proposed a method to convert the amino acid sequence into numerical values for input and succeeded in predicting the mechanical properties of silk from its amino acid sequences. Our study sheds light on predicting mechanical properties of silk fiber from respective amino acid sequences.
Collapse
|
2
|
Gu J, Xu Y, Nie Y. Role of distal sites in enzyme engineering. Biotechnol Adv 2023; 63:108094. [PMID: 36621725 DOI: 10.1016/j.biotechadv.2023.108094] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Revised: 11/15/2022] [Accepted: 01/01/2023] [Indexed: 01/06/2023]
Abstract
The limitations associated with natural enzyme catalysis have triggered the rise of the field of protein engineering. Traditional rational design was based on the analysis of protein structural information and catalytic mechanisms to identify key active sites or ligand binding sites to reshape the substrate pocket. The role and significance of functional sites in the active center have been studied extensively. With a deeper understanding of the structure-catalysis relationship map, the entire protein molecule can be filled with residues that play a substantial role in its structure and function. However, the catalytic mechanism underlying distal mutations remains unclear. The aim of this review was to highlight the criticality of the distal site in enzyme engineering based on the following three aspects: What can distal mutations exert on function from mutability landscape? How do distal sites influence enzyme function? How to predict and design distal mutations? This review provides insights into the catalytic mechanism of enzymes from the global interaction network, knowledge from sequence-structure-dynamics-function relationships, and strategies for distal mutation-based protein engineering.
Collapse
Affiliation(s)
- Jie Gu
- Lab of Brewing Microbiology and Applied Enzymology, School of Biotechnology and Key laboratory of Industrial Biotechnology of Ministry of Education, Jiangnan University, Wuxi 214122, China
| | - Yan Xu
- Lab of Brewing Microbiology and Applied Enzymology, School of Biotechnology and Key laboratory of Industrial Biotechnology of Ministry of Education, Jiangnan University, Wuxi 214122, China; State Key Laboratory of Food Science and Technology, Jiangnan University, Wuxi 214122, China
| | - Yao Nie
- Lab of Brewing Microbiology and Applied Enzymology, School of Biotechnology and Key laboratory of Industrial Biotechnology of Ministry of Education, Jiangnan University, Wuxi 214122, China; Suqian Industrial Technology Research Institute of Jiangnan University, Suqian 223814, China.
| |
Collapse
|
3
|
Yang W, Liu Y, Xiao C. Deep metric learning for accurate protein secondary structure prediction. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
4
|
Mohammadi A, Zahiri J, Mohammadi S, Khodarahmi M, Arab SS. PSSMCOOL: A Comprehensive R Package for Generating Evolutionary-based Descriptors of Protein Sequences from PSSM Profiles. BIOLOGY METHODS AND PROTOCOLS 2022; 7:bpac008. [PMID: 35388370 PMCID: PMC8977839 DOI: 10.1093/biomethods/bpac008] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Revised: 01/21/2022] [Indexed: 11/14/2022]
Abstract
Position-specific scoring matrix (PSSM), also called profile, is broadly used for representing the evolutionary history of a given protein sequence. Several investigations reported that the PSSM-based feature descriptors can improve the prediction of various protein attributes such as interaction, function, subcellular localization, secondary structure, disorder regions, and accessible surface area. While plenty of algorithms have been suggested for extracting evolutionary features from PSSM in recent years, there is not any integrated standalone tool for providing these descriptors. Here, we introduce PSSMCOOL, a flexible comprehensive R package that generates 38 PSSM-based feature vectors. To our best knowledge, PSSMCOOL is the first PSSM-based feature extraction tool implemented in R. With the growing demand for exploiting machine-learning algorithms in computational biology, this package would be a practical tool for machine-learning predictions.
Collapse
Affiliation(s)
- Alireza Mohammadi
- Bioinformatics and Computational Omics Lab (BioCOOL), Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
| | - Javad Zahiri
- Department of Neuroscience, University of California San Diego, California, USA
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Saber Mohammadi
- Bioinformatics and Computational Omics Lab (BioCOOL), Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
| | - Mohsen Khodarahmi
- Department of Radiology, Shahid Madani Hospital, Karaj, Iran
- Bahar Medical Imaging Center, Karaj, Iran
- Dr. Khodarahmi Medical Imaging Center, Karaj, Iran
| | - Seyed Shahriar Arab
- Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
| |
Collapse
|
5
|
Protein secondary structure prediction using a lightweight convolutional network and label distribution aware margin loss. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2021.107771] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
|
6
|
Active instance selection via parametric equation and instance overlap aware scheme. APPL INTELL 2022. [DOI: 10.1007/s10489-021-02395-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
7
|
Chen TR, Juan SH, Huang YW, Lin YC, Lo WC. A secondary structure-based position-specific scoring matrix applied to the improvement in protein secondary structure prediction. PLoS One 2021; 16:e0255076. [PMID: 34320027 PMCID: PMC8318245 DOI: 10.1371/journal.pone.0255076] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 07/11/2021] [Indexed: 11/18/2022] Open
Abstract
Protein secondary structure prediction (SSP) has a variety of applications; however, there has been relatively limited improvement in accuracy for years. With a vision of moving forward all related fields, we aimed to make a fundamental advance in SSP. There have been many admirable efforts made to improve the machine learning algorithm for SSP. This work thus took a step back by manipulating the input features. A secondary structure element-based position-specific scoring matrix (SSE-PSSM) is proposed, based on which a new set of machine learning features can be established. The feasibility of this new PSSM was evaluated by rigid independent tests with training and testing datasets sharing <25% sequence identities. In all experiments, the proposed PSSM outperformed the traditional amino acid PSSM. This new PSSM can be easily combined with the amino acid PSSM, and the improvement in accuracy was remarkable. Preliminary tests made by combining the SSE-PSSM and well-known SSP methods showed 2.0% and 5.2% average improvements in three- and eight-state SSP accuracies, respectively. If this PSSM can be integrated into state-of-the-art SSP methods, the overall accuracy of SSP may break the current restriction and eventually bring benefit to all research and applications where secondary structure prediction plays a vital role during development. To facilitate the application and integration of the SSE-PSSM with modern SSP methods, we have established a web server and standalone programs for generating SSE-PSSM available at http://10.life.nctu.edu.tw/SSE-PSSM.
Collapse
Affiliation(s)
- Teng-Ruei Chen
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Sheng-Hung Juan
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Yu-Wei Huang
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Yen-Cheng Lin
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Wei-Cheng Lo
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- The Center for Bioinformatics Research, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- * E-mail:
| |
Collapse
|
8
|
Uddin MR, Mahbub S, Rahman MS, Bayzid MS. SAINT: self-attention augmented inception-inside-inception network improves protein secondary structure prediction. Bioinformatics 2021; 36:4599-4608. [PMID: 32437517 DOI: 10.1093/bioinformatics/btaa531] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2019] [Revised: 05/10/2020] [Accepted: 05/16/2020] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Protein structures provide basic insight into how they can interact with other proteins, their functions and biological roles in an organism. Experimental methods (e.g. X-ray crystallography and nuclear magnetic resonance spectroscopy) for predicting the secondary structure (SS) of proteins are very expensive and time consuming. Therefore, developing efficient computational approaches for predicting the SS of protein is of utmost importance. Advances in developing highly accurate SS prediction methods have mostly been focused on 3-class (Q3) structure prediction. However, 8-class (Q8) resolution of SS contains more useful information and is much more challenging than the Q3 prediction. RESULTS We present SAINT, a highly accurate method for Q8 structure prediction, which incorporates self-attention mechanism (a concept from natural language processing) with the Deep Inception-Inside-Inception network in order to effectively capture both the short- and long-range interactions among the amino acid residues. SAINT offers a more interpretable framework than the typical black-box deep neural network methods. Through an extensive evaluation study, we report the performance of SAINT in comparison with the existing best methods on a collection of benchmark datasets, namely, TEST2016, TEST2018, CASP12 and CASP13. Our results suggest that self-attention mechanism improves the prediction accuracy and outperforms the existing best alternate methods. SAINT is the first of its kind and offers the best known Q8 accuracy. Thus, we believe SAINT represents a major step toward the accurate and reliable prediction of SSs of proteins. AVAILABILITY AND IMPLEMENTATION SAINT is freely available as an open-source project at https://github.com/SAINTProtein/SAINT.
Collapse
Affiliation(s)
- Mostofa Rafid Uddin
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh.,Department of Computer Science and Engineering, East West University, Dhaka 1212, Bangladesh
| | - Sazan Mahbub
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | - M Saifur Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| |
Collapse
|
9
|
Wang L, Yang L, Feng YL, Zhang H. Evolutionary insights into the active-site structures of the metallo-β-lactamase superfamily from a classification study with support vector machine. J Biol Inorg Chem 2020; 25:1023-1034. [PMID: 32945939 DOI: 10.1007/s00775-020-01822-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Accepted: 09/05/2020] [Indexed: 12/01/2022]
Abstract
The metallo-β-lactamase (MβL) superfamily, which is intriguing due to its enzyme promiscuity, is a good model enzyme superfamily for studies of catalytic function evolution. Our previous study traced the evolution of the phosphotriesterase activity of the MβL superfamily and found that MβLs go through three typical active-site structures in the development of phosphotriesterase activity. In the present study, taking the three typical active-site structures as class labels, the classification and prediction models, which were established by support vector machine and amino acid composition, classified the MβL members into three classes. The indispensable amino acid compositions showed a surprising performance that was remarkably better than the performance of the dispensable amino acid compositions and even equal to the performance of the 20 native amino acids. We further traced the origin of the classification error and found that there was one subclass adopting a type of active-site structure that was the evolutionary transition between these classes. After that, our classification and prediction models were successfully used to predict several MβL active-site structures that lost the dinuclear structures during crystallization. In summary, our studies established a classification and prediction system for active-site structures that well compensated for experimental methods that recognize protein structure details and suggest that the indispensable amino acids contain much more protein structure information than the dispensable amino acids.
Collapse
Affiliation(s)
- Lili Wang
- College of Physics and Electronic Engineering, Northwest Normal University, Lanzhou, 730070, People's Republic of China
| | - Ling Yang
- MIIT Key Laboratory of Critical Materials Technology for New Energy Conversion and Storage, Institute of Theoretical and Simulation Chemistry, School of Chemistry and Chemical Engineering, Harbin Institute of Technology, Harbin, 150080, People's Republic of China
| | - Yu-Lan Feng
- Biomedical Research Center, College of Life Science and Engineering, Northwest Minzu University, Lanzhou, 730030, People's Republic of China
| | - Hao Zhang
- Biomedical Research Center, College of Life Science and Engineering, Northwest Minzu University, Lanzhou, 730030, People's Republic of China.
| |
Collapse
|
10
|
Van Messem A. Support vector machines: A robust prediction method with applications in bioinformatics. HANDBOOK OF STATISTICS 2020. [DOI: 10.1016/bs.host.2019.08.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
|
11
|
Sample Reduction Strategies for Protein Secondary Structure Prediction. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9204429] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Predicting the secondary structure from protein sequence plays a crucial role in estimating the 3D structure, which has applications in drug design and in understanding the function of proteins. As new genes and proteins are discovered, the large size of the protein databases and datasets that can be used for training prediction models grows considerably. A two-stage hybrid classifier, which employs dynamic Bayesian networks and a support vector machine (SVM) has been shown to provide state-of-the-art prediction accuracy for protein secondary structure prediction. However, SVM is not efficient for large datasets due to the quadratic optimization involved in model training. In this paper, two techniques are implemented on CB513 benchmark for reducing the number of samples in the train set of the SVM. The first method randomly selects a fraction of data samples from the train set using a stratified selection strategy. This approach can remove approximately 50% of the data samples from the train set and reduce the model training time by 73.38% on average without decreasing the prediction accuracy significantly. The second method clusters the data samples by a hierarchical clustering algorithm and replaces the train set samples with nearest neighbors of the cluster centers in order to improve the training time. To cluster the feature vectors, the hierarchical clustering method is implemented, for which the number of clusters and the number of nearest neighbors are optimized as hyper-parameters by computing the prediction accuracy on validation sets. It is found that clustering can reduce the size of the train set by 26% without reducing the prediction accuracy. Among the clustering techniques Ward’s method provided the best accuracy on test data.
Collapse
|
12
|
Toussi CA, Haddadnia J. Improving protein secondary structure prediction: the evolutionary optimized classification algorithms. Struct Chem 2019. [DOI: 10.1007/s11224-018-1271-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
13
|
Chen Y, Yuan X, Cang X. Population-based incremental learning for the prediction of Homo sapiens’ protein secondary structure. INT J BIOMATH 2019. [DOI: 10.1142/s1793524519500177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Protein structure prediction is the prediction of the 3D structure of a protein based on its amino acid sequence. It is a key component in disciplines such as medicine, biology, and biochemistry. The prediction of the protein secondary structure of Homo sapiens is one of the more important domains. Many methods have been used to feed forward neural networks or SVMs combined with a sliding window. This method’s mechanisms are too complex to be able to extract clear and straightforward physical meanings from it. This paper explores population-based incremental learning (PBIL), which is a method that combines the mechanisms of a generational genetic algorithm with simple competitive learning. The result shows that its accuracies are particularly associated with the Homo species. This new perspective reveals a number of different possibilities for the purposes of performance improvements.
Collapse
Affiliation(s)
- Ye Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, Jiangsu 221008, P. R. China
| | - Xiaoping Yuan
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, Jiangsu 221008, P. R. China
| | - Xiaohui Cang
- Institute of Genetics, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310058, P. R. China
| |
Collapse
|
14
|
Schaumann R, Dallacker-Losensky K, Rosenkranz C, Genzel GH, Stîngu CS, Schellenberger W, Schulz-Stübner S, Rodloff AC, Eschrich K. Discrimination of Human Pathogen Clostridium Species Especially of the Heterogeneous C. sporogenes and C. botulinum by MALDI-TOF Mass Spectrometry. Curr Microbiol 2018; 75:1506-1515. [PMID: 30120528 DOI: 10.1007/s00284-018-1552-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2018] [Accepted: 08/07/2018] [Indexed: 10/28/2022]
Abstract
Clostridium species cause several local and systemic diseases. Conventional identification of these microorganisms is in part laborious, not always reliable, time consuming or does not always distinguish different species, i.e., C. botulinum and C. sporogenes. All in, there is a high interest to find out a reliable, powerful and rapid method to identify Clostridium spp. not only on genus but also on species level. The aim of the present study was to identify Clostridium spp. strains and also to find differences and metabolic groups of C. botulinum by Matrix-Assisted Laser Desorption/Ionization Time of Flight Mass Spectrometry (MALDI-TOF MS). A total of 123 strains of Clostridium spp. (C. botulinum, n = 40, C. difficile, n = 11, C. tetani, n = 11, C. sordellii, n = 20, C. sporogenes, n = 18, C. innocuum, n = 10, C. perfringens, n = 13) were analyzed by MALDI-TOF MS in combination with methods of multivariate statistical analysis. MALDI-TOF MS analysis in combination with methods of multivariate statistical analysis was able to discriminate between the different tested Clostridium spp., even between species which are closely related and difficult to differentiate by traditional methods, i.e., C. sporogenes and C. botulinum. Furthermore, the method was able to separate the different metabolic groups of C. botulinum. Especially, E gene-positive C. botulinum strains are clearly distinguishable from the other species but also from those producing other toxin types. Thus, MALDI-TOF MS represents a reliable and above all quick method for identification of cultivated Clostridium species.
Collapse
Affiliation(s)
- Reiner Schaumann
- Institute for Medical Microbiology and Epidemiology of Infectious Diseases, University Hospital of Leipzig, Leipzig, Germany
| | - Kevin Dallacker-Losensky
- Department of Trauma Surgery and Orthopedics, Reconstructive and Septic Surgery, and Sports Traumatology, German Armed Forces Hospital Ulm, Ulm, Germany.
| | - Christiane Rosenkranz
- Institute for Medical Microbiology and Epidemiology of Infectious Diseases, University Hospital of Leipzig, Leipzig, Germany
| | | | - Catalina S Stîngu
- Institute for Medical Microbiology and Epidemiology of Infectious Diseases, University Hospital of Leipzig, Leipzig, Germany
| | | | | | - Arne C Rodloff
- Institute for Medical Microbiology and Epidemiology of Infectious Diseases, University Hospital of Leipzig, Leipzig, Germany
| | - Klaus Eschrich
- Institute of Biochemistry, University Hospital of Leipzig, Leipzig, Germany
| |
Collapse
|
15
|
Zhang B, Li J, Lü Q. Prediction of 8-state protein secondary structures by a novel deep learning architecture. BMC Bioinformatics 2018; 19:293. [PMID: 30075707 PMCID: PMC6090794 DOI: 10.1186/s12859-018-2280-5] [Citation(s) in RCA: 66] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2018] [Accepted: 07/09/2018] [Indexed: 11/16/2022] Open
Abstract
Background Protein secondary structure can be regarded as an information bridge that links the primary sequence and tertiary structure. Accurate 8-state secondary structure prediction can significantly give more precise and high resolution on structure-based properties analysis. Results We present a novel deep learning architecture which exploits an integrative synergy of prediction by a convolutional neural network, residual network, and bidirectional recurrent neural network to improve the performance of protein secondary structure prediction. A local block comprised of convolutional filters and original input is designed for capturing local sequence features. The subsequent bidirectional recurrent neural network consisting of gated recurrent units can capture global context features. Furthermore, the residual network can improve the information flow between the hidden layers and the cascaded recurrent neural network. Our proposed deep network achieved 71.4% accuracy on the benchmark CB513 dataset for the 8-state prediction; and the ensemble learning by our model achieved 74% accuracy. Our model generalization capability is also evaluated on other three independent datasets CASP10, CASP11 and CASP12 for both 8- and 3-state prediction. These prediction performances are superior to the state-of-the-art methods. Conclusion Our experiment demonstrates that it is a valuable method for predicting protein secondary structure, and capturing local and global features concurrently is very useful in deep learning. Electronic supplementary material The online version of this article (10.1186/s12859-018-2280-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Buzhong Zhang
- School of Computer Science and Technology, Soochow University, Suzhou, China.,School of Computer and Information, and the University Key Laboratory of Intelligent Perception and Computing of Anhui Province, Anqing Normal University, Anqing, 246011, China
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Broadway, NSW 2007, Sydney, PO Box 123, Australia
| | - Qiang Lü
- School of Computer Science and Technology, Soochow University, Suzhou, China.
| |
Collapse
|
16
|
Song W, Liu H, Wang J, Kong Y, Yin X, Zang W. MATHT: A web server for comprehensive transcriptome data analysis. J Theor Biol 2018; 455:140-146. [PMID: 30040963 DOI: 10.1016/j.jtbi.2018.07.021] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Revised: 07/17/2018] [Accepted: 07/19/2018] [Indexed: 12/15/2022]
Abstract
The current software/algorithms for high-throughput sequence data analysis are not user-friendly. We developed MATHT, the Multifaceted Analysis Tool for Human Transcriptome, which is a free web server available at www.biocloudservice.com, to provide more comprehensive and reliable analysis of transcriptome data. The web server provides modules for data preprocessing, differential expression analysis, dataset integration, functional analysis, and network analysis. The sequence and structure analysis module is specially designed for RNA-seq data. MATHT is a user-friendly web server that provides comprehensive analysis of transcriptome data, especially integration analysis using special standardization across different platforms.
Collapse
Affiliation(s)
- Wei Song
- Eryun (ShangHai) Information Technology Co., Ltd., No. 951 Jianchuan Road, Minhang District, Shanghai 201109, PR China
| | - Huaping Liu
- Eryun (ShangHai) Information Technology Co., Ltd., No. 951 Jianchuan Road, Minhang District, Shanghai 201109, PR China
| | - Jiajia Wang
- Eryun (ShangHai) Information Technology Co., Ltd., No. 951 Jianchuan Road, Minhang District, Shanghai 201109, PR China
| | - Yan Kong
- Eryun (ShangHai) Information Technology Co., Ltd., No. 951 Jianchuan Road, Minhang District, Shanghai 201109, PR China
| | - Xia Yin
- Eryun (ShangHai) Information Technology Co., Ltd., No. 951 Jianchuan Road, Minhang District, Shanghai 201109, PR China
| | - Weidong Zang
- Eryun (ShangHai) Information Technology Co., Ltd., No. 951 Jianchuan Road, Minhang District, Shanghai 201109, PR China.
| |
Collapse
|
17
|
Protein Secondary Structure Prediction Based on Data Partition and Semi-Random Subspace Method. Sci Rep 2018; 8:9856. [PMID: 29959372 PMCID: PMC6026213 DOI: 10.1038/s41598-018-28084-8] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2018] [Accepted: 06/12/2018] [Indexed: 11/20/2022] Open
Abstract
Protein secondary structure prediction is one of the most important and challenging problems in bioinformatics. Machine learning techniques have been applied to solve the problem and have gained substantial success in this research area. However there is still room for improvement toward the theoretical limit. In this paper, we present a novel method for protein secondary structure prediction based on a data partition and semi-random subspace method (PSRSM). Data partitioning is an important strategy for our method. First, the protein training dataset was partitioned into several subsets based on the length of the protein sequence. Then we trained base classifiers on the subspace data generated by the semi-random subspace method, and combined base classifiers by majority vote rule into ensemble classifiers on each subset. Multiple classifiers were trained on different subsets. These different classifiers were used to predict the secondary structures of different proteins according to the protein sequence length. Experiments are performed on 25PDB, CB513, CASP10, CASP11, CASP12, and T100 datasets, and the good performance of 86.38%, 84.53%, 85.51%, 85.89%, 85.55%, and 85.09% is achieved respectively. Experimental results showed that our method outperforms other state-of-the-art methods.
Collapse
|
18
|
Zhou J, Wang H, Zhao Z, Xu R, Lu Q. CNNH_PSS: protein 8-class secondary structure prediction by convolutional neural network with highway. BMC Bioinformatics 2018; 19:60. [PMID: 29745837 PMCID: PMC5998876 DOI: 10.1186/s12859-018-2067-8] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Protein secondary structure is the three dimensional form of local segments of proteins and its prediction is an important problem in protein tertiary structure prediction. Developing computational approaches for protein secondary structure prediction is becoming increasingly urgent. RESULTS We present a novel deep learning based model, referred to as CNNH_PSS, by using multi-scale CNN with highway. In CNNH_PSS, any two neighbor convolutional layers have a highway to deliver information from current layer to the output of the next one to keep local contexts. As lower layers extract local context while higher layers extract long-range interdependencies, the highways between neighbor layers allow CNNH_PSS to have ability to extract both local contexts and long-range interdependencies. We evaluate CNNH_PSS on two commonly used datasets: CB6133 and CB513. CNNH_PSS outperforms the multi-scale CNN without highway by at least 0.010 Q8 accuracy and also performs better than CNF, DeepCNF and SSpro8, which cannot extract long-range interdependencies, by at least 0.020 Q8 accuracy, demonstrating that both local contexts and long-range interdependencies are indeed useful for prediction. Furthermore, CNNH_PSS also performs better than GSM and DCRNN which need extra complex model to extract long-range interdependencies. It demonstrates that CNNH_PSS not only cost less computer resource, but also achieves better predicting performance. CONCLUSION CNNH_PSS have ability to extracts both local contexts and long-range interdependencies by combing multi-scale CNN and highway network. The evaluations on common datasets and comparisons with state-of-the-art methods indicate that CNNH_PSS is an useful and efficient tool for protein secondary structure prediction.
Collapse
Affiliation(s)
- Jiyun Zhou
- School Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, Guangdong 518055 China
- Department of Computing, the Hong Kong Polytechnic University, Hung Hom, Hong Kong
| | - Hongpeng Wang
- School Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, Guangdong 518055 China
| | - Zhishan Zhao
- School Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, Guangdong 518055 China
| | - Ruifeng Xu
- School Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, Guangdong 518055 China
| | - Qin Lu
- Department of Computing, the Hong Kong Polytechnic University, Hung Hom, Hong Kong
| |
Collapse
|
19
|
Li S, Zou R, Tu Y, Wu J, Landry MP. Cholesterol-directed nanoparticle assemblies based on single amino acid peptide mutations activate cellular uptake and decrease tumor volume. Chem Sci 2017; 8:7552-7559. [PMID: 29163910 PMCID: PMC5676250 DOI: 10.1039/c7sc02616a] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Accepted: 09/07/2017] [Indexed: 01/10/2023] Open
Abstract
Peptide drugs have been difficult to translate into effective therapies due to their low in vivo stability. Here, we report a strategy to develop peptide-based therapeutic nanoparticles by screening a peptide library differing by single-site amino acid mutations of lysine-modified cholesterol. Certain cholesterol-modified peptides are found to promote and stabilize peptide α-helix formation, resulting in selectively cell-permeable peptides. One cholesterol-modified peptide self-assembles into stable nanoparticles with considerable α-helix propensity stabilized by intermolecular van der Waals interactions between inter-peptide cholesterol molecules, and shows 68.3% stability after incubation with serum for 16 h. The nanoparticles in turn interact with cell membrane cholesterols that are disproportionately present in cancer cell membranes, inducing lipid raft-mediated endocytosis and cancer cell death. Our results introduce a strategy to identify peptide nanoparticles that can effectively reduce tumor volumes when administered to in in vivo mice models. Our results also provide a simple platform for developing peptide-based anticancer drugs.
Collapse
Affiliation(s)
- Shang Li
- Key Laboratory for Advanced Materials & Institute of Fine Chemicals , School of Chemistry and Molecular Engineering , East China University of Science and Technology , Shanghai 200237 , China .
| | - Rongfeng Zou
- Key Laboratory for Advanced Materials & Institute of Fine Chemicals , School of Chemistry and Molecular Engineering , East China University of Science and Technology , Shanghai 200237 , China . .,Division of Theoretical Chemistry and Biology , School of Biotechnology , KTH Royal Institute of Technology , SE-10691 Stockholm , Sweden
| | - Yaoquan Tu
- Division of Theoretical Chemistry and Biology , School of Biotechnology , KTH Royal Institute of Technology , SE-10691 Stockholm , Sweden
| | - Junchen Wu
- Key Laboratory for Advanced Materials & Institute of Fine Chemicals , School of Chemistry and Molecular Engineering , East China University of Science and Technology , Shanghai 200237 , China . .,Department of Chemical and Bio-molecular Engineering , University of California Berkeley , 476 Stanley Hall , Berkeley , California 94720 , USA .
| | - Markita P Landry
- Department of Chemical and Bio-molecular Engineering , University of California Berkeley , 476 Stanley Hall , Berkeley , California 94720 , USA . .,California Institute for Quantitative Biosciences (qb3) , University of California-Berkeley , Berkeley , CA 94720 , USA
| |
Collapse
|
20
|
Castillo-Garit JA, Casañola-Martin GM, Barigye SJ, Pham-The H, Torrens F, Torreblanca A. Machine learning-based models to predict modes of toxic action of phenols to Tetrahymena pyriformis. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2017; 28:735-747. [PMID: 29022372 DOI: 10.1080/1062936x.2017.1376705] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/26/2017] [Accepted: 09/01/2017] [Indexed: 06/07/2023]
Abstract
The phenols are structurally heterogeneous pollutants and they present a variety of modes of toxic action (MOA), including polar narcotics, weak acid respiratory uncouplers, pro-electrophiles, and soft electrophiles. Because it is often difficult to determine correctly the mechanism of action of a compound, quantitative structure-activity relationship (QSAR) methods, which have proved their interest in toxicity prediction, can be used. In this work, several QSAR models for the prediction of MOA of 221 phenols to the ciliated protozoan Tetrahymena pyriformis, using Chemistry Development Kit descriptors, are reported. Four machine learning techniques (ML), k-nearest neighbours, support vector machine, classification trees, and artificial neural networks, have been used to develop several models with higher accuracies and predictive capabilities for distinguishing between four MOAs. They showed global accuracy values between 95.9% and 97.7% and area under Receiver Operator Curve values between 0.978 and 0.998; additionally, false alarm rate values were below 8.2% for training set. In order to validate our models, cross-validation (10-folds-out) and external test-set were performed with good behaviour in all cases. These models, obtained with ML techniques, were compared with others previously reported by other researchers, and the improvement was significant.
Collapse
Affiliation(s)
- J A Castillo-Garit
- a Unidad de Toxicología Experimental , Universidad de Ciencias Médicas de Villa Clara , Santa Clara , Villa Clara , Cuba
- b Departament de Biología Funcional i Antropología Física , Universitat de València , Burjassot , Spain
| | - G M Casañola-Martin
- c Departamento de Química Física, Facultad de FarmaciaUnidad de Investigación de Diseño de Fármacos y Conectividad Molecular , Universitat de València , Spain
| | - S J Barigye
- d Department of Chemistry , McGill University , Montréal , Québec , Canada
| | - H Pham-The
- e Hanoi University of Pharmacy , Hoan Kiem, Hanoi , Vietnam
| | - F Torrens
- f Institut Universitari de Ciència Molecular , Universitat de València, Edifici d'Instituts de Paterna , Valencia , Spain
| | - A Torreblanca
- b Departament de Biología Funcional i Antropología Física , Universitat de València , Burjassot , Spain
| |
Collapse
|
21
|
Hidden Markov model and Chapman Kolmogrov for protein structures prediction from images. Comput Biol Chem 2017; 68:231-244. [DOI: 10.1016/j.compbiolchem.2017.04.003] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2017] [Revised: 03/11/2017] [Accepted: 04/11/2017] [Indexed: 11/20/2022]
|
22
|
Mabrouk MS, Naeem SM, Eldosoky MA. DIFFERENT GENOMIC SIGNAL PROCESSING METHODS FOR EUKARYOTIC GENE PREDICTION: A SYSTEMATIC REVIEW. BIOMEDICAL ENGINEERING-APPLICATIONS BASIS COMMUNICATIONS 2017. [DOI: 10.4015/s1016237217300012] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Bioinformatics field has now solidly settled itself as a control in molecular biology and incorporates an extensive variety of branches of knowledge from structural biology, genomics to gene expression studies. Bioinformatics is the application of computer technology to the management of biological information. Genomic signal processing (GSP) techniques have been connected most all around in bioinformatics and will keep on assuming an essential part in the investigation of biomedical issues. GSP refers to using the digital signal processing (DSP) methods for genomic data (e.g. DNA sequences) analysis. Recently, applications of GSP in bioinformatics have obtained great consideration such as identification of DNA protein coding regions, identification of reading frames, cancer detection and others. Cancer is one of the most dangerous diseases that the world faces and has raised the death rate in recent years, it is known medically as malignant neoplasm, so detection of it at the early stage can yield a promising approach to determine and take actions to treat with this risk. GSP is a method which can be used to detect the cancerous cells that are often caused due to genetic abnormality. This systematic review discusses some of the GSP applications in bioinformatics generally. The GSP techniques, used for cancer detection especially, are presented to collect the recent results and what has been reached at this point to be a new subject of research.
Collapse
Affiliation(s)
- Mai S. Mabrouk
- Biomedical Engineering Department, Faculty of Engineering, Misr University for Science and Technology (MUST University), Cairo, Egypt
| | - Safaa M. Naeem
- Biomedical Engineering Department, Faculty of Engineering, Helwan University, Cairo, Egypt
| | - Mohamed A. Eldosoky
- Biomedical Engineering Department, Faculty of Engineering, Helwan University, Cairo, Egypt
| |
Collapse
|
23
|
Xu Y, Li L, Ding J, Wu LY, Mai G, Zhou F. Gly-PseAAC: Identifying protein lysine glycation through sequences. Gene 2017; 602:1-7. [DOI: 10.1016/j.gene.2016.11.021] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Revised: 08/29/2016] [Accepted: 11/10/2016] [Indexed: 11/29/2022]
|
24
|
Arana-Daniel N, Gallegos AA, López-Franco C, Alanís AY, Morales J, López-Franco A. Support Vector Machines Trained with Evolutionary Algorithms Employing Kernel Adatron for Large Scale Classification of Protein Structures. Evol Bioinform Online 2016; 12:285-302. [PMID: 27980384 PMCID: PMC5140013 DOI: 10.4137/ebo.s40912] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Revised: 10/19/2016] [Accepted: 10/20/2016] [Indexed: 11/05/2022] Open
Abstract
With the increasing power of computers, the amount of data that can be processed in small periods of time has grown exponentially, as has the importance of classifying large-scale data efficiently. Support vector machines have shown good results classifying large amounts of high-dimensional data, such as data generated by protein structure prediction, spam recognition, medical diagnosis, optical character recognition and text classification, etc. Most state of the art approaches for large-scale learning use traditional optimization methods, such as quadratic programming or gradient descent, which makes the use of evolutionary algorithms for training support vector machines an area to be explored. The present paper proposes an approach that is simple to implement based on evolutionary algorithms and Kernel-Adatron for solving large-scale classification problems, focusing on protein structure prediction. The functional properties of proteins depend upon their three-dimensional structures. Knowing the structures of proteins is crucial for biology and can lead to improvements in areas such as medicine, agriculture and biofuels.
Collapse
Affiliation(s)
- Nancy Arana-Daniel
- Centro Universitario de Ciencias Exactas e Ingenieras, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Alberto A Gallegos
- Centro Universitario de Ciencias Exactas e Ingenieras, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Carlos López-Franco
- Centro Universitario de Ciencias Exactas e Ingenieras, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Alma Y Alanís
- Centro Universitario de Ciencias Exactas e Ingenieras, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Jacob Morales
- Centro Universitario de Ciencias Exactas e Ingenieras, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Adriana López-Franco
- Centro Universitario de Ciencias Exactas e Ingenieras, Universidad de Guadalajara, Guadalajara, Jalisco, México
| |
Collapse
|
25
|
|
26
|
Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields. Sci Rep 2016; 6:18962. [PMID: 26752681 PMCID: PMC4707437 DOI: 10.1038/srep18962] [Citation(s) in RCA: 273] [Impact Index Per Article: 30.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2015] [Accepted: 11/26/2015] [Indexed: 12/29/2022] Open
Abstract
Protein secondary structure (SS) prediction is important for studying protein structure and function. When only the sequence (profile) information is used as input feature, currently the best predictors can obtain ~80% Q3 accuracy, which has not been improved in the past decade. Here we present DeepCNF (Deep Convolutional Neural Fields) for protein SS prediction. DeepCNF is a Deep Learning extension of Conditional Neural Fields (CNF), which is an integration of Conditional Random Fields (CRF) and shallow neural networks. DeepCNF can model not only complex sequence-structure relationship by a deep hierarchical architecture, but also interdependency between adjacent SS labels, so it is much more powerful than CNF. Experimental results show that DeepCNF can obtain ~84% Q3 accuracy, ~85% SOV score, and ~72% Q8 accuracy, respectively, on the CASP and CAMEO test proteins, greatly outperforming currently popular predictors. As a general framework, DeepCNF can be used to predict other protein structure properties such as contact number, disorder regions, and solvent accessibility.
Collapse
|
27
|
Prediction of sumoylation sites in proteins using linear discriminant analysis. Gene 2016; 576:99-104. [DOI: 10.1016/j.gene.2015.09.072] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2015] [Revised: 08/24/2015] [Accepted: 09/28/2015] [Indexed: 01/05/2023]
|
28
|
Nasrul Islam M, Iqbal S, Katebi AR, Tamjidul Hoque M. A balanced secondary structure predictor. J Theor Biol 2016; 389:60-71. [DOI: 10.1016/j.jtbi.2015.10.015] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2015] [Revised: 10/14/2015] [Accepted: 10/22/2015] [Indexed: 11/30/2022]
|
29
|
Spencer M, Eickholt J, Cheng J. A Deep Learning Network Approach to ab initio Protein Secondary Structure Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:103-12. [PMID: 25750595 PMCID: PMC4348072 DOI: 10.1109/tcbb.2014.2343960] [Citation(s) in RCA: 138] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Ab initio protein secondary structure (SS) predictions are utilized to generate tertiary structure predictions, which are increasingly demanded due to the rapid discovery of proteins. Although recent developments have slightly exceeded previous methods of SS prediction, accuracy has stagnated around 80 percent and many wonder if prediction cannot be advanced beyond this ceiling. Disciplines that have traditionally employed neural networks are experimenting with novel deep learning techniques in attempts to stimulate progress. Since neural networks have historically played an important role in SS prediction, we wanted to determine whether deep learning could contribute to the advancement of this field as well. We developed an SS predictor that makes use of the position-specific scoring matrix generated by PSI-BLAST and deep learning network architectures, which we call DNSS. Graphical processing units and CUDA software optimize the deep network architecture and efficiently train the deep networks. Optimal parameters for the training process were determined, and a workflow comprising three separately trained deep networks was constructed in order to make refined predictions. This deep learning network approach was used to predict SS for a fully independent test dataset of 198 proteins, achieving a Q3 accuracy of 80.7 percent and a Sov accuracy of 74.2 percent.
Collapse
Affiliation(s)
- Matt Spencer
- Informatics Institute, University of Missouri, Columbia, MO 65211.
| | - Jesse Eickholt
- Department of Computer Science, Central Michigan University, Mount Pleasant, MI 48859.
| | - Jianlin Cheng
- Department of Computer Science, University of Missouri, Columbia, MO 65211.
| |
Collapse
|
30
|
Li Q, Dahl DB, Vannucci M, Hyun Joo, Tsai JW. Bayesian model of protein primary sequence for secondary structure prediction. PLoS One 2014; 9:e109832. [PMID: 25314659 PMCID: PMC4196994 DOI: 10.1371/journal.pone.0109832] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2014] [Accepted: 09/02/2014] [Indexed: 01/26/2023] Open
Abstract
Determining the primary structure (i.e., amino acid sequence) of a protein has become cheaper, faster, and more accurate. Higher order protein structure provides insight into a protein's function in the cell. Understanding a protein's secondary structure is a first step towards this goal. Therefore, a number of computational prediction methods have been developed to predict secondary structure from just the primary amino acid sequence. The most successful methods use machine learning approaches that are quite accurate, but do not directly incorporate structural information. As a step towards improving secondary structure reduction given the primary structure, we propose a Bayesian model based on the knob-socket model of protein packing in secondary structure. The method considers the packing influence of residues on the secondary structure determination, including those packed close in space but distant in sequence. By performing an assessment of our method on 2 test sets we show how incorporation of multiple sequence alignment data, similarly to PSIPRED, provides balance and improves the accuracy of the predictions. Software implementing the methods is provided as a web application and a stand-alone implementation.
Collapse
Affiliation(s)
- Qiwei Li
- Department of Statistics, Rice University, Houston, Texas, United States of America
| | - David B. Dahl
- Department of Statistics, Brigham Young University, Provo, Utah, United States of America
| | - Marina Vannucci
- Department of Statistics, Rice University, Houston, Texas, United States of America
| | - Hyun Joo
- Department of Chemistry, University of the Pacific, Stockton, California, United States of America
| | - Jerry W. Tsai
- Department of Chemistry, University of the Pacific, Stockton, California, United States of America
| |
Collapse
|
31
|
Feng Y, Luo L. Using long-range contact number information for protein secondary structure prediction. INT J BIOMATH 2014. [DOI: 10.1142/s1793524514500521] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper, we first combine tetra-peptide structural words with contact number for protein secondary structure prediction. We used the method of increment of diversity combined with quadratic discriminant analysis to predict the structure of central residue for a sequence fragment. The method is used tetra-peptide structural words and long-range contact number as information resources. The accuracy of Q3 is over 83% in 194 proteins. The accuracies of predicted secondary structures for 20 amino acid residues are ranged from 81% to 88%. Moreover, we have introduced the residue long-range contact, which directly indicates the separation of contacting residue in terms of the position in the sequence, and examined the negative influence of long-range residue interactions on predicting secondary structure in a protein. The method is also compared with existing prediction methods. The results show that our method is more effective in protein secondary structures prediction.
Collapse
Affiliation(s)
- Yonge Feng
- College of Science, Inner Mongolia Agriculture University, Hohhot 010018, P. R. China
| | - Liaofu Luo
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, P. R. China
| |
Collapse
|
32
|
Joseph AP, de Brevern AG. From local structure to a global framework: recognition of protein folds. J R Soc Interface 2014; 11:20131147. [PMID: 24740960 DOI: 10.1098/rsif.2013.1147] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Protein folding has been a major area of research for many years. Nonetheless, the mechanisms leading to the formation of an active biological fold are still not fully apprehended. The huge amount of available sequence and structural information provides hints to identify the putative fold for a given sequence. Indeed, protein structures prefer a limited number of local backbone conformations, some being characterized by preferences for certain amino acids. These preferences largely depend on the local structural environment. The prediction of local backbone conformations has become an important factor to correctly identifying the global protein fold. Here, we review the developments in the field of local structure prediction and especially their implication in protein fold recognition.
Collapse
Affiliation(s)
- Agnel Praveen Joseph
- Science and Technology Facilities Council, Rutherford Appleton Laboratory, Harwell Oxford, , Didcot OX11 0QX, UK
| | | |
Collapse
|
33
|
Abstract
The ATP binding proteins exist as a hybrid of proteins with Walker A motif and universal stress proteins (USPs) having an alternative motif for binding ATP. There is an urgent need to find a reliable and comprehensive hybrid predictor for ATP binding proteins using whole sequence information. In this paper the open source LIBSVM toolbox was used to build a classifier at 10-fold cross-validation. The best hybrid model was the combination of amino acid and dipeptide composition with an accuracy of 84.57% and Mathews correlation coefficient (MCC) value of 0.693. This classifier proves to be better than many classical ATP binding protein predictors. The general trend observed is that combinations of descriptors performed better and improved the overall performances of individual descriptors, particularly when combined with amino acid composition. The work developed a comprehensive model for predicting ATP binding proteins irrespective of their functional motifs. This model provides a high probability of success for molecular biologists in predicting and selecting diverse groups of ATP binding proteins irrespective of their functional motifs.
Collapse
|
34
|
Bhattacharjee N, Biswas P. Helical ambivalency induced by point mutations. BMC STRUCTURAL BIOLOGY 2013; 13:9. [PMID: 23675772 PMCID: PMC3683331 DOI: 10.1186/1472-6807-13-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/29/2012] [Accepted: 05/02/2013] [Indexed: 01/15/2023]
Abstract
Background Mutation of amino acid sequences in a protein may have diverse effects on its structure and function. Point mutations of even a single amino acid residue in the helices of the non-redundant database may lead to sequentially identical peptides which adopt different secondary structures in different proteins. However, various physico-chemical factors which govern the formation of these ambivalent helices generated by point mutations of a sequence are not clearly known. Results Sequences generated by point mutations of helices are mapped on to their non-helical counterparts in the SCOP database. The results show that short helices are prone to transform into non-helical conformations upon point mutations. Mutation of amino acid residues by helix breakers preferentially yield non-helical conformations, while mutation with residues of intermediate helix propensity display least preferences for non-helical conformations. Differences in the solvent accessibility of the mutating/mutated residues are found to be a major criteria for these sequences to conform to non-helical conformations. Even with minimal differences in the amino acid distributions of the sequences flanking the helical and non-helical conformations, helix-flanking sequences are found be more solvent accessible. Conclusions All types of mutations from helical to non-helical conformations are investigated. The primary factors attributing such changes in conformation can be: i) type/propensity of the mutating and mutant residues ii) solvent accessibility of the residue at the mutation site iii) context/environment dependence of the flanking sequences. The results from the present study may be used to design de novo proteins via point mutations.
Collapse
|
35
|
Extracting physicochemical features to predict protein secondary structure. ScientificWorldJournal 2013; 2013:347106. [PMID: 23766688 PMCID: PMC3666292 DOI: 10.1155/2013/347106] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2013] [Accepted: 04/23/2013] [Indexed: 11/29/2022] Open
Abstract
We propose a protein secondary structure prediction method based on position-specific scoring matrix (PSSM) profiles and four physicochemical features including conformation parameters, net charges, hydrophobic, and side chain mass. First, the SVM with the optimal window size and the optimal parameters of the kernel function is found. Then, we train the SVM using the PSSM profiles generated from PSI-BLAST and the physicochemical features extracted from the CB513 data set. Finally, we use the filter to refine the predicted results from the trained SVM. For all the performance measures of our method, Q3 reaches 79.52, SOV94 reaches 86.10, and SOV99 reaches 74.60; all the measures are higher than those of the SVMpsi method and the SVMfreq method. This validates that considering these physicochemical features in predicting protein secondary structure would exhibit better performances.
Collapse
|
36
|
Brito-Sánchez Y, Castillo-Garit JA, Le-Thi-Thu H, González-Madariaga Y, Torrens F, Marrero-Ponce Y, Rodríguez-Borges JE. Comparative study to predict toxic modes of action of phenols from molecular structures. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2013; 24:235-251. [PMID: 23437773 DOI: 10.1080/1062936x.2013.766260] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Quantitative structure-activity relationship models for the prediction of mode of toxic action (MOA) of 221 phenols to the ciliated protozoan Tetrahymena pyriformis using atom-based quadratic indices are reported. The phenols represent a variety of MOAs including polar narcotics, weak acid respiratory uncouplers, pro-electrophiles and soft electrophiles. Linear discriminant analysis (LDA), and four machine learning techniques (ML), namely k-nearest neighbours (k-NN), support vector machine (SVM), classification trees (CTs) and artificial neural networks (ANNs), have been used to develop several models with higher accuracies and predictive capabilities for distinguishing between four MOAs. Most of them showed global accuracy of over 90%, and false alarm rate values were below 2.9% for the training set. Cross-validation, complementary subsets and external test set were performed, with good behaviour in all cases. Our models compare favourably with other previously published models, and in general the models obtained with ML techniques show better results than those developed with linear techniques. We developed unsupervised and supervised consensus, and these results were better than our ML models, the results of rule-based approach and other ensemble models previously published. This investigation highlights the merits of ML-based techniques as an alternative to other more traditional methods for modelling MOA.
Collapse
Affiliation(s)
- Y Brito-Sánchez
- Unit of Computer-Aided Molecular Biosilico Discovery and Bioinformatic Research, Faculty of Chemistry-Pharmacy, Universidad Central Marta Abreu de Las Villas, Santa Clara, Cuba
| | | | | | | | | | | | | |
Collapse
|
37
|
Predicting β-turns in protein using kernel logistic regression. BIOMED RESEARCH INTERNATIONAL 2013; 2013:870372. [PMID: 23509793 PMCID: PMC3590576 DOI: 10.1155/2013/870372] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/15/2012] [Accepted: 12/22/2012] [Indexed: 11/18/2022]
Abstract
A β-turn is a secondary protein structure type that plays a significant role in protein configuration and function. On average 25% of amino acids in protein structures are
located in β-turns. It is very important to develope an accurate and efficient method for β-turns prediction. Most of the current successful β-turns prediction methods use support vector
machines (SVMs) or neural networks (NNs). The kernel logistic regression (KLR) is a powerful classification technique that has been applied successfully in many classification problems. However, it is often not found in β-turns classification, mainly because it is computationally expensive. In this paper, we used KLR to obtain sparse β-turns prediction in short evolution time. Secondary structure information and position-specific scoring matrices (PSSMs) are utilized as input features. We achieved Qtotal of 80.7% and MCC of 50% on BT426 dataset. These results show that KLR method with the right algorithm can yield
performance equivalent to or even better than NNs and SVMs in β-turns prediction. In addition, KLR yields probabilistic outcome and has a well-defined extension to multiclass case.
Collapse
|
38
|
Lei JB, Yin JB, Shen HB. GFO: A data driven approach for optimizing the Gaussian function based similarity metric in computational biology. Neurocomputing 2013. [DOI: 10.1016/j.neucom.2012.07.003] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
39
|
Chatterjee S, Ghosh S, Vishveshwara S. Network properties of decoys and CASP predicted models: a comparison with native protein structures. MOLECULAR BIOSYSTEMS 2013; 9:1774-88. [DOI: 10.1039/c3mb70157c] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
40
|
Zangooei MH, Jalili S. Protein secondary structure prediction using DWKF based on SVR-NSGAII. Neurocomputing 2012. [DOI: 10.1016/j.neucom.2012.04.015] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
41
|
Lai YH, Li ZC, Chen LL, Dai Z, Zou XY. Identification of potential host proteins for influenza A virus based on topological and biological characteristics by proteome-wide network approach. J Proteomics 2012; 75:2500-13. [DOI: 10.1016/j.jprot.2012.02.034] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2011] [Revised: 02/21/2012] [Accepted: 02/26/2012] [Indexed: 12/31/2022]
|
42
|
|
43
|
Abstract
This work is a first attempt to characterise the conformational preference of structurally ambivalent helices in terms of their backbone conformational entropy. Ambivalent sequences conform to two different secondary structures (helix-sheet or helix-random coil or sheet-random coil, etc.) in two different proteins. For variable ambivalent helices, the helical conformations are found to possess less conformational entropy as compared with their non-helical counterparts when the ϕ-ψ dihedral angle range of the entire peptide segment is used to calculate the backbone conformational entropy. The favourable number of native contacts is a primary stabilising factor for these helical conformations. However, an opposite trend is observed when the ϕ-ψ angles of the individual amino acids are used to calculate the backbone conformational entropy. The results show that these peptide segments are rather reluctant to form helices, but are driven to form helices due to the favourable number of native contacts and optimum range of ϕ-ψ angle of the segments. Both procedures are validated by applying on conserved helices in the non-redundant database and their corresponding counterparts in the Structural Classification of Proteins database. Although context is a major determinant in deciding conformations of ambivalent sequences, no significant difference in the conformational entropy of sequences flanking ambivalent helical sequences in helical and non-helical forms is observed in this study. The results may be useful in understanding the structural context and environmental factors which leads to the formation of ambivalent helices and designing de novo proteins.
Collapse
|
44
|
Hassan R, Othman RM, Saad P, Kasim S. A compact hybrid feature vector for an accurate secondary structure prediction. Inf Sci (N Y) 2011. [DOI: 10.1016/j.ins.2011.07.019] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
45
|
GUBBI JAYAVARDHANA, LAI DANIELTH, PALANISWAMI MARIMUTHU, PARKER MICHAEL. PROTEIN SECONDARY STRUCTURE PREDICTION USING SUPPORT VECTOR MACHINES AND A NEW FEATURE REPRESENTATION. INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS 2011. [DOI: 10.1142/s1469026806002076] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Knowledge of the secondary structure and solvent accessibility of a protein plays a vital role in the prediction of fold, and eventually the tertiary structure of the protein. A challenging issue of predicting protein secondary structure from sequence alone is addressed. Support vector machines (SVM) are employed for the classification and the SVM outputs are converted to posterior probabilities for multi-class classification. The effect of using Chou–Fasman parameters and physico-chemical parameters along with evolutionary information in the form of position specific scoring matrix (PSSM) is analyzed. These proposed methods are tested on the RS126 and CB513 datasets. A new dataset is curated (PSS504) using recent release of CATH. On the CB513 dataset, sevenfold cross-validation accuracy of 77.9% was obtained using the proposed encoding method. A new method of calculating the reliability index based on the number of votes and the Support Vector Machine decision value is also proposed. A blind test on the EVA dataset gives an average Q3 accuracy of 74.5% and ranks in top five protein structure prediction methods. Supplementary material including datasets are available on .
Collapse
Affiliation(s)
- JAYAVARDHANA GUBBI
- Department of Electrical and Electronic Engineering, The University of Melbourne, Victoria 3010, Australia
| | - DANIEL T. H. LAI
- Department of Electrical and Electronic Engineering, The University of Melbourne, Victoria 3010, Australia
| | - MARIMUTHU PALANISWAMI
- Department of Electrical and Electronic Engineering, The University of Melbourne, Victoria 3010, Australia
| | - MICHAEL PARKER
- St. Vincent's Institute of Medical Research, 9 Princes Street, Fitzroy, Victoria 3065, Australia
| |
Collapse
|
46
|
Bouziane H, Messabih B, Chouarfia A. Profiles and majority voting-based ensemble method for protein secondary structure prediction. Evol Bioinform Online 2011; 7:171-89. [PMID: 22058650 PMCID: PMC3204938 DOI: 10.4137/ebo.s7931] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022] Open
Abstract
Machine learning techniques have been widely applied to solve the problem of predicting protein secondary structure from the amino acid sequence. They have gained substantial success in this research area. Many methods have been used including k-Nearest Neighbors (k-NNs), Hidden Markov Models (HMMs), Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs), which have attracted attention recently. Today, the main goal remains to improve the prediction quality of the secondary structure elements. The prediction accuracy has been continuously improved over the years, especially by using hybrid or ensemble methods and incorporating evolutionary information in the form of profiles extracted from alignments of multiple homologous sequences. In this paper, we investigate how best to combine k-NNs, ANNs and Multi-class SVMs (M-SVMs) to improve secondary structure prediction of globular proteins. An ensemble method which combines the outputs of two feed-forward ANNs, k-NN and three M-SVM classifiers has been applied. Ensemble members are combined using two variants of majority voting rule. An heuristic based filter has also been applied to refine the prediction. To investigate how much improvement the general ensemble method can give rather than the individual classifiers that make up the ensemble, we have experimented with the proposed system on the two widely used benchmark datasets RS126 and CB513 using cross-validation tests by including PSI-BLAST position-specific scoring matrix (PSSM) profiles as inputs. The experimental results reveal that the proposed system yields significant performance gains when compared with the best individual classifier.
Collapse
Affiliation(s)
- Hafida Bouziane
- Department of Computer Science, USTO-MB University, BP 1505 El Mnaouer, Oran, Algeria
| | | | | |
Collapse
|
47
|
Bhattacharjee N, Biswas P. Local order and mobility of water molecules around ambivalent helices. J Phys Chem B 2011; 115:12257-65. [PMID: 21916474 DOI: 10.1021/jp2066106] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Water on a protein surface plays a key role in determining the structure and dynamics of proteins. Compared to the properties of bulk water, many aspects of the structure and dynamics of the water surrounding the proteins are less understood. It is interesting therefore to explore how the properties of the water within the solvation shell around the peptide molecule depend on its specific secondary structure. In this work we investigate the orientational order and residence times of the water molecules to characterize the structure, energetics, and dynamics of the hydration shell water around ambivalent peptides. Ambivalent sequences are identical sequences which display multiple secondary structures in different proteins. Molecular dynamics simulations of representative proteins containing variable helix, variable nonhelix, and conserved helix are also used to explore the local structure and mobility of water molecules in their vicinity. The results, for the first time, depict a different water distribution pattern around the conserved and variable helices. The water molecules surrounding the helical segments in variable helices are found to possess a less locally ordered structure compared to those around their corresponding nonhelical counterparts and conserved helices. The long conserved helices exhibit extremely high local residence times compared to the helical conformations of the variable helices, whereas the residence times of the nonhelical conformations of the variable helices are comparable to those of the short conserved helices. This differential pattern of the structure and dynamics of water molecules in the vicinity of conserved/variable helices may lend valuable insights for understanding the role of solvent effects in determining sequence ambivalency and help in improving the accuracy of water models used in the simulations of proteins.
Collapse
|
48
|
Wang Z, Zhao F, Peng J, Xu J. Protein 8-class secondary structure prediction using conditional neural fields. Proteomics 2011; 11:3786-92. [PMID: 21805636 DOI: 10.1002/pmic.201100196] [Citation(s) in RCA: 62] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2011] [Revised: 06/16/2011] [Accepted: 07/01/2011] [Indexed: 11/10/2022]
Abstract
Compared with the protein 3-class secondary structure (SS) prediction, the 8-class prediction gains less attention and is also much more challenging, especially for proteins with few sequence homologs. This paper presents a new probabilistic method for 8-class SS prediction using conditional neural fields (CNFs), a recently invented probabilistic graphical model. This CNF method not only models the complex relationship between sequence features and SS, but also exploits the interdependency among SS types of adjacent residues. In addition to sequence profiles, our method also makes use of non-evolutionary information for SS prediction. Tested on the CB513 and RS126 data sets, our method achieves Q8 accuracy of 64.9 and 64.7%, respectively, which are much better than the SSpro8 web server (51.0 and 48.0%, respectively). Our method can also be used to predict other structure properties (e.g. solvent accessibility) of a protein or the SS of RNA.
Collapse
Affiliation(s)
- Zhiyong Wang
- Toyota Technological Institute at Chicago, 6045 S Kenwood, Chicago, IL 60637, USA
| | | | | | | |
Collapse
|
49
|
PSP_MCSVM: brainstorming consensus prediction of protein secondary structures using two-stage multiclass support vector machines. J Mol Model 2011; 17:2191-201. [PMID: 21594694 PMCID: PMC3168739 DOI: 10.1007/s00894-011-1102-8] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2010] [Accepted: 04/19/2011] [Indexed: 11/24/2022]
Abstract
Secondary structure prediction is a crucial task for understanding the variety of protein structures and performed biological functions. Prediction of secondary structures for new proteins using their amino acid sequences is of fundamental importance in bioinformatics. We propose a novel technique to predict protein secondary structures based on position-specific scoring matrices (PSSMs) and physico-chemical properties of amino acids. It is a two stage approach involving multiclass support vector machines (SVMs) as classifiers for three different structural conformations, viz., helix, sheet and coil. In the first stage, PSSMs obtained from PSI-BLAST and five specially selected physicochemical properties of amino acids are fed into SVMs as features for sequence-to-structure prediction. Confidence values for forming helix, sheet and coil that are obtained from the first stage SVM are then used in the second stage SVM for performing structure-to-structure prediction. The two-stage cascaded classifiers (PSP_MCSVM) are trained with proteins from RS126 dataset. The classifiers are finally tested on target proteins of critical assessment of protein structure prediction experiment-9 (CASP9). PSP_MCSVM with brainstorming consensus procedure performs better than the prediction servers like Predator, DSC, SIMPA96, for randomly selected proteins from CASP9 targets. The overall performance is found to be comparable with the current state-of-the art. PSP_MCSVM source code, train-test datasets and supplementary files are available freely in public domain at: http://sysbio.icm.edu.pl/secstruct and http://code.google.com/p/cmater-bioinfo/
Collapse
|
50
|
Ao S, Palade V. Ensemble of Elman neural networks and support vector machines for reverse engineering of gene regulatory networks. Appl Soft Comput 2011. [DOI: 10.1016/j.asoc.2010.05.014] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|