1
|
Zhu L, Wang L, Yang Z, Xu P, Yang S. PPSNO: A Feature-Rich SNO Sites Predictor by Stacking Ensemble Strategy from Protein Sequence-Derived Information. Interdiscip Sci 2024; 16:192-217. [PMID: 38206557 DOI: 10.1007/s12539-023-00595-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Revised: 11/20/2023] [Accepted: 11/21/2023] [Indexed: 01/12/2024]
Abstract
The protein S-nitrosylation (SNO) is a significant post-translational modification that affects the stability, activity, cellular localization, and function of proteins. Therefore, highly accurate prediction of SNO sites aids in grasping biological function mechanisms. In this document, we have constructed a predictor, named PPSNO, forecasting protein SNO sites using stacked integrated learning. PPSNO integrates multiple machine learning techniques into an ensemble model, enhancing its predictive accuracy. First, we established benchmark datasets by collecting SNO sites from various sources, including literature, databases, and other predictors. Second, various techniques for feature extraction are applied to derive characteristics from protein sequences, which are subsequently amalgamated into the PPSNO predictor for training. Five-fold cross-validation experiments show that PPSNO outperformed existing predictors, such as PSNO, PreSNO, pCysMod, DeepNitro, RecSNO, and Mul-SNO. The PPSNO predictor achieved an impressive accuracy of 92.8%, an area under the curve (AUC) of 96.1%, a Matthews correlation coefficient (MCC) of 81.3%, an F1-score of 85.6%, an SN of 79.3%, an SP of 97.7%, and an average precision (AP) of 92.2%. We also employed ROC curves, PR curves, and radar plots to show the superior performance of PPSNO. Our study shows that fused protein sequence features and two-layer stacked ensemble models can improve the accuracy of predicting SNO sites, which can aid in comprehending cellular processes and disease mechanisms. The codes and data are available at https://github.com/serendipity-wly/PPSNO .
Collapse
Affiliation(s)
- Lun Zhu
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou, 213164, China
| | - Liuyang Wang
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou, 213164, China
| | - Zexi Yang
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou, 213164, China
| | - Piao Xu
- College of Economics and Management, Nanjing Forestry University, Nanjing, 210037, China
| | - Sen Yang
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou, 213164, China.
- The Affiliated Changzhou No. 2 People's Hospital of Nanjing Medical University, Changzhou, 213164, China.
| |
Collapse
|
2
|
Yue T, Wang Y, Zhang L, Gu C, Xue H, Wang W, Lyu Q, Dun Y. Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models. Int J Mol Sci 2023; 24:15858. [PMID: 37958843 PMCID: PMC10649223 DOI: 10.3390/ijms242115858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 10/24/2023] [Accepted: 10/30/2023] [Indexed: 11/15/2023] Open
Abstract
The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning, since we expect a superhuman intelligence that explores beyond our knowledge to interpret the genome from deep learning. A powerful deep learning model should rely on the insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with proper deep learning-based architecture, and we remark on practical considerations of developing deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research and point out current challenges and potential research directions for future genomics applications. We believe the collaborative use of ever-growing diverse data and the fast iteration of deep learning models will continue to contribute to the future of genomics.
Collapse
Affiliation(s)
- Tianwei Yue
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Yuanxin Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Longxiang Zhang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Chunming Gu
- Department of Biomedical Engineering, School of Medicine, Johns Hopkins University, Baltimore, MD 21218, USA;
| | - Haoru Xue
- The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA;
| | - Wenping Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Qi Lyu
- Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI 48824, USA;
| | - Yujie Dun
- School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China;
| |
Collapse
|
3
|
Muazzam Ali Shah S, Ou YY. Disto-TRP: An approach for identifying transient receptor potential (TRP) channels using structural information generated by AlphaFold. Gene 2023; 871:147435. [PMID: 37075925 DOI: 10.1016/j.gene.2023.147435] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Revised: 03/13/2023] [Accepted: 04/13/2023] [Indexed: 04/21/2023]
Abstract
The ability to predict 3D protein structures computationally has significantly advanced biological research. The AlphaFold protein structure database, developed by DeepMind, has provided a wealth of predicted protein structures and has the potential to bring about revolutionary changes in the field of life sciences. However, directly determining the function of proteins from their structures remains a challenging task. The Distogram from AlphaFold is used in this study as a novel feature set to identify transient receptor potential (TRP) channels. Distograms feature vectors and pre-trained language model (BERT) features were combined to improve prediction performance for transient receptor potential (TRP) channels. The method proposed in this study demonstrated promising performance on many evaluation metrics. For five-fold cross-validation, the method achieved a Sensitivity (SN) of 87.00%, Specificity (SP) of 93.61%, Accuracy (ACC) of 93.39%, and a Matthews correlation coefficient (MCC) of 0.52. Additionally, on an independent dataset, the method obtained 100.00% SN, 95.54% SP, 95.73% ACC, and an MCC of 0.69. The results demonstrate the potential for using structural information to predict protein function. In the future, it is hoped that such structural information will be incorporated into artificial intelligence networks to explore more useful and valuable functional information in the biological field.
Collapse
Affiliation(s)
- Syed Muazzam Ali Shah
- Department of Computer Science & Engineering, Yuan Ze University, Chungli 32003, Taiwan; National University of Computer and Emerging Sciences, Karachi 75050, Pakistan
| | - Yu-Yen Ou
- Department of Computer Science & Engineering, Yuan Ze University, Chungli 32003, Taiwan; Graduate Program in Biomedical Informatics, Yuan Ze University, Chungli 32003, Taiwan.
| |
Collapse
|
4
|
Yuan L, Ma Y, Liu Y. Ensemble deep learning models for protein secondary structure prediction using bidirectional temporal convolution and bidirectional long short-term memory. Front Bioeng Biotechnol 2023; 11:1051268. [PMID: 36860882 PMCID: PMC9968878 DOI: 10.3389/fbioe.2023.1051268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Accepted: 02/03/2023] [Indexed: 02/16/2023] Open
Abstract
Protein secondary structure prediction (PSSP) is a challenging task in computational biology. However, existing models with deep architectures are not sufficient and comprehensive for deep long-range feature extraction of long sequences. This paper proposes a novel deep learning model to improve Protein secondary structure prediction. In the model, our proposed bidirectional temporal convolutional network (BTCN) can extract the bidirectional deep local dependencies in protein sequences segmented by the sliding window technique, the bidirectional long short-term memory (BLSTM) network can extract the global interactions between residues, and our proposed multi-scale bidirectional temporal convolutional network (MSBTCN) can further capture the bidirectional multi-scale long-range features of residues while preserving the hidden layer information more comprehensively. In particular, we also propose that fusing the features of 3-state and 8-state Protein secondary structure prediction can further improve the prediction accuracy. Moreover, we also propose and compare multiple novel deep models by combining bidirectional long short-term memory with temporal convolutional network (TCN), reverse temporal convolutional network (RTCN), multi-scale temporal convolutional network (multi-scale bidirectional temporal convolutional network), bidirectional temporal convolutional network and multi-scale bidirectional temporal convolutional network, respectively. Furthermore, we demonstrate that the reverse prediction of secondary structure outperforms the forward prediction, suggesting that amino acids at later positions have a greater impact on secondary structure recognition. Experimental results on benchmark datasets including CASP10, CASP11, CASP12, CASP13, CASP14, and CB513 show that our methods achieve better prediction performance compared to five state-of-the-art methods.
Collapse
Affiliation(s)
- Lu Yuan
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Yuming Ma
- *Correspondence: Yuming Ma, ; Yihui Liu,
| | - Yihui Liu
- *Correspondence: Yuming Ma, ; Yihui Liu,
| |
Collapse
|
5
|
Yuan L, Hu X, Ma Y, Liu Y. DLBLS_SS: protein secondary structure prediction using deep learning and broad learning system. RSC Adv 2022; 12:33479-33487. [PMID: 36505696 PMCID: PMC9682407 DOI: 10.1039/d2ra06433b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 11/16/2022] [Indexed: 11/24/2022] Open
Abstract
Protein secondary structure prediction (PSSP) is not only beneficial to the study of protein structure and function but also to the development of drugs. As a challenging task in computational biology, experimental methods for PSSP are time-consuming and expensive. In this paper, we propose a novel PSSP model DLBLS_SS based on deep learning and broad learning system (BLS) to predict 3-state and 8-state secondary structure. We first use a bidirectional long short-term memory (BLSTM) network to extract global features in residue sequences. Then, our proposed SEBTCN based on temporal convolutional networks (TCN) and channel attention can capture bidirectional key long-range dependencies in sequences. We also use BLS to rapidly optimize fused features while further capturing local interactions between residues. We conduct extensive experiments on public test sets including CASP10, CASP11, CASP12, CASP13, CASP14 and CB513 to evaluate the performance of the model. Experimental results show that our model exhibits better 3-state and 8-state PSSP performance compared to five state-of-the-art models.
Collapse
Affiliation(s)
- Lu Yuan
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| | - Xiaopei Hu
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| | - Yuming Ma
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| | - Yihui Liu
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| |
Collapse
|
6
|
Machine learning–based sensor array: full and reduced fluorescence data for versatile analyte detection based on gold nanocluster as a single probe. Anal Bioanal Chem 2022; 414:8365-8378. [DOI: 10.1007/s00216-022-04372-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2022] [Revised: 09/25/2022] [Accepted: 10/06/2022] [Indexed: 11/01/2022]
|
7
|
Bongirwar V, Mokhade AS. Different methods, techniques and their limitations in protein structure prediction: A review. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2022; 173:72-82. [PMID: 35588858 DOI: 10.1016/j.pbiomolbio.2022.05.002] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 04/16/2022] [Accepted: 05/11/2022] [Indexed: 11/17/2022]
Abstract
Because of the increase in different types of diseases in human habitats, demands for designing various types of drugs are also increasing. Protein and its structure play a very important role in drug design. Therefore researchers from different areas like mathematics, medicines, and computer science are teaming up for getting better solutions in the said field. In this paper, we have discussed different methods of secondary and tertiary protein structure prediction (PSP), along with the limitations of different approaches. Different types of datasets used in PSP are also discussed here. This paper also tells about different performance measures to evaluate the prediction accuracy of PSP methods. Different software's/servers are available for download, which are used to find the protein structures for the input protein sequence. These softwares will also help to compare the performance of any new algorithm with other available methods. Details of those softwares are also mentioned in this paper.
Collapse
Affiliation(s)
| | - A S Mokhade
- Visvesvaraya National Institute of Technology, Nagpur, India
| |
Collapse
|
8
|
Lin YF, Liu JJ, Chang YJ, Yu CS, Yi W, Lane HY, Lu CH. Predicting Anticancer Drug Resistance Mediated by Mutations. Pharmaceuticals (Basel) 2022; 15:ph15020136. [PMID: 35215249 PMCID: PMC8878306 DOI: 10.3390/ph15020136] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 01/16/2022] [Accepted: 01/21/2022] [Indexed: 02/01/2023] Open
Abstract
Cancer drug resistance presents a challenge for precision medicine. Drug-resistant mutations are always emerging. In this study, we explored the relationship between drug-resistant mutations and drug resistance from the perspective of protein structure. By combining data from previously identified drug-resistant mutations and information of protein structure and function, we used machine learning-based methods to build models to predict cancer drug resistance mutations. The performance of our combined model achieved an accuracy of 86%, a Matthews correlation coefficient score of 0.57, and an F1 score of 0.66. We have constructed a fast, reliable method that predicts and investigates cancer drug resistance in a protein structure. Nonetheless, more information is needed concerning drug resistance and, in particular, clarification is needed about the relationships between the drug and the drug resistance mutations in proteins. Highly accurate predictions regarding drug resistance mutations can be helpful for developing new strategies with personalized cancer treatments. Our novel concept, which combines protein structure information, has the potential to elucidate physiological mechanisms of cancer drug resistance.
Collapse
Affiliation(s)
- Yu-Feng Lin
- Department of Medical Laboratory Science and Biotechnology, Asia University, Taichung 41354, Taiwan; (Y.-F.L.); (W.Y.)
| | - Jia-Jun Liu
- The Ph.D. Program of Biotechnology and Biomedical Industry, China Medical University, Taichung 40402, Taiwan; (J.-J.L.); (Y.-J.C.)
| | - Yu-Jen Chang
- The Ph.D. Program of Biotechnology and Biomedical Industry, China Medical University, Taichung 40402, Taiwan; (J.-J.L.); (Y.-J.C.)
| | - Chin-Sheng Yu
- Department of Information Engineering and Computer Science, Feng Chia University, Taichung 40201, Taiwan;
| | - Wei Yi
- Department of Medical Laboratory Science and Biotechnology, Asia University, Taichung 41354, Taiwan; (Y.-F.L.); (W.Y.)
| | - Hsien-Yuan Lane
- Graduate Institute of Biomedical Sciences, China Medical University, Taichung 40402, Taiwan;
- Department of Psychiatry, China Medical University Hospital, Taichung 40402, Taiwan
- Brain Disease Research Center, China Medical University Hospital, Taichung 40402, Taiwan
| | - Chih-Hao Lu
- The Ph.D. Program of Biotechnology and Biomedical Industry, China Medical University, Taichung 40402, Taiwan; (J.-J.L.); (Y.-J.C.)
- Graduate Institute of Biomedical Sciences, China Medical University, Taichung 40402, Taiwan;
- Department of Medical Laboratory Science and Biotechnology, China Medical University, Taichung 40402, Taiwan
- Correspondence:
| |
Collapse
|
9
|
Zhong W, Gu F. Predicting Local Protein 3D Structures Using Clustering Deep Recurrent Neural Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:593-604. [PMID: 32750880 DOI: 10.1109/tcbb.2020.3005972] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Since protein 3D structure prediction is very important for biochemical study and drug design, researchers have developed many machine learning algorithms to predict protein 3D structures using the sequence information only. Understanding the sequence-to-structure relationship is key for the successful structure prediction. Previous approaches including the single shallow learning model, the single deep learning model and clustering algorithms all have disadvantages to understand precise sequence-to-structure relationship. In order to further improve the performance of the local protein structure prediction, a novel deep learning model called Clustering Recurrent Neural Network (CRNN) is proposed. In this model, the whole protein dataset is divided into multiple cluster subtrees. A RNN is trained for each cluster in the subtrees so that each RNN can be used to learn the computationally simpler local sequence-to-structure relationship instead of attempting to capture the global sequence-to-structure relationship. After learning the local sequence-to-structure relationship using RNN, CRNN is designed to predict distance matrices, torsion angles and secondary structures for backbone α-carbon atoms of protein sequence segments. Our experimental analysis indicates that 3D structure prediction accuracy is comparable or better than other state-of-art approaches.
Collapse
|
10
|
de Oliveira GB, Pedrini H, Dias Z. Ensemble of Template-Free and Template-Based Classifiers for Protein Secondary Structure Prediction. Int J Mol Sci 2021; 22:11449. [PMID: 34768880 PMCID: PMC8583764 DOI: 10.3390/ijms222111449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 10/18/2021] [Accepted: 10/20/2021] [Indexed: 11/16/2022] Open
Abstract
Protein secondary structures are important in many biological processes and applications. Due to advances in sequencing methods, there are many proteins sequenced, but fewer proteins with secondary structures defined by laboratory methods. With the development of computer technology, computational methods have (started to) become the most important methodologies for predicting secondary structures. We evaluated two different approaches to this problem-driven by the recent results obtained by computational methods in this task-(i) template-free classifiers, based on machine learning techniques; and (ii) template-based classifiers, based on searching tools. Both approaches are formed by different sub-classifiers-six for template-free and two for template-based, each with a specific view of the protein. Our results show that these ensembles improve the results of each approach individually.
Collapse
|
11
|
Sharma AK, Srivastava R. Variable Length Character N-Gram Embedding of Protein Sequences for Secondary Structure Prediction. Protein Pept Lett 2021; 28:501-507. [PMID: 33143605 DOI: 10.2174/0929866527666201103145635] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2020] [Revised: 09/23/2020] [Accepted: 09/26/2020] [Indexed: 11/22/2022]
Abstract
BACKGROUND The prediction of a protein's secondary structure from its amino acid sequence is an essential step towards predicting its 3-D structure. The prediction performance improves by incorporating homologous multiple sequence alignment information. Since homologous details not available for all proteins. Therefore, it is necessary to predict the protein secondary structure from single sequences. OBJECTIVE AND METHODS Protein secondary structure predicted from their primary sequences using n-gram word embedding and deep recurrent neural network. Protein secondary structure depends on local and long-range neighbor residues in primary sequences. In the proposed work, the local contextual information of amino acid residues captures variable-length character n-gram words. An embedding vector represents these variable-length character n-gram words. Further, the bidirectional long short-term memory (Bi-LSTM) model is used to capture the long-range contexts by extracting the past and future residues information in primary sequences. RESULTS The proposed model evaluates on three public datasets ss.txt, RS126, and CASP9. The model shows the Q3 accuracy of 92.57%, 86.48%, and 89.66% for ss.txt, RS126, and CASP9. CONCLUSION The proposed model performance compares with state-of-the-art methods available in the literature. After a comparative analysis, it observed that the proposed model performs better than state-of-the-art methods.
Collapse
Affiliation(s)
- Ashish Kumar Sharma
- Department of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, Uttar Pradesh, India
| | - Rajeev Srivastava
- Department of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, Uttar Pradesh, India
| |
Collapse
|
12
|
The structure-based cancer-related single amino acid variation prediction. Sci Rep 2021; 11:13599. [PMID: 34193921 PMCID: PMC8245468 DOI: 10.1038/s41598-021-92793-w] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 06/16/2021] [Indexed: 11/09/2022] Open
Abstract
Single amino acid variation (SAV) is an amino acid substitution of the protein sequence that can potentially influence the entire protein structure or function, as well as its binding affinity. Protein destabilization is related to diseases, including several cancers, although using traditional experiments to clarify the relationship between SAVs and cancer uses much time and resources. Some SAV prediction methods use computational approaches, with most predicting SAV-induced changes in protein stability. In this investigation, all SAV characteristics generated from protein sequences, structures and the microenvironment were converted into feature vectors and fed into an integrated predicting system using a support vector machine and genetic algorithm. Critical features were used to estimate the relationship between their properties and cancers caused by SAVs. We describe how we developed a prediction system based on protein sequences and structure that is capable of distinguishing if the SAV is related to cancer or not. The five-fold cross-validation performance of our system is 89.73% for the accuracy, 0.74 for the Matthews correlation coefficient, and 0.81 for the F1 score. We have built an online prediction server, CanSavPre ( http://bioinfo.cmu.edu.tw/CanSavPre/ ), which is expected to become a useful, practical tool for cancer research and precision medicine.
Collapse
|
13
|
ActTRANS: Functional classification in active transport proteins based on transfer learning and contextual representations. Comput Biol Chem 2021; 93:107537. [PMID: 34217007 DOI: 10.1016/j.compbiolchem.2021.107537] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Revised: 05/09/2021] [Accepted: 06/26/2021] [Indexed: 01/08/2023]
Abstract
MOTIVATION Primary and secondary active transport are two types of active transport that involve using energy to move the substances. Active transport mechanisms do use proteins to assist in transport and play essential roles to regulate the traffic of ions or small molecules across a cell membrane against the concentration gradient. In this study, the two main types of proteins involved in such transport are classified from transmembrane transport proteins. We propose a Support Vector Machine (SVM) with contextualized word embeddings from Bidirectional Encoder Representations from Transformers (BERT) to represent protein sequences. BERT is a powerful model in transfer learning, a deep learning language representation model developed by Google and one of the highest performing pre-trained model for Natural Language Processing (NLP) tasks. The idea of transfer learning with pre-trained model from BERT is applied to extract fixed feature vectors from the hidden layers and learn contextual relations between amino acids in the protein sequence. Therefore, the contextualized word representations of proteins are introduced to effectively model complex structures of amino acids in the sequence and the variations of these amino acids in the context. By generating context information, we capture multiple meanings for the same amino acid to reveal the importance of specific residues in the protein sequence. RESULTS The performance of the proposed method is evaluated using five-fold cross-validation and independent test. The proposed method achieves an accuracy of 85.44 %, 88.74 % and 92.84 % for Class-1, Class-2, and Class-3, respectively. Experimental results show that this approach can outperform from other feature extraction methods using context information, effectively classify two types of active transport and improve the overall performance.
Collapse
|
14
|
Afify HM, Abdelhalim MB, Mabrouk MS, Sayed AY. Protein secondary structure prediction (PSSP) using different machine algorithms. EGYPTIAN JOURNAL OF MEDICAL HUMAN GENETICS 2021. [DOI: 10.1186/s43042-021-00173-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
The computational biology approach has advanced exponentially in protein secondary structure prediction (PSSP), which is vital for the pharmaceutical industry. Extracting protein structure from the laboratory has insufficient information for PSSP that is used in bioinformatics studies. In this paper, the support vector machine (SVM) model and decision tree are presented on the RS126 dataset to address the problem of PSSP. A decision tree is applied for the SVM outcome to obtain the relevant guidelines possible for PSSP. Furthermore, the number of produced rules was fairly small, and they show a greater degree of comprehensibility compared to other rules. Several of the proposed principles have compelling and relevant biological clarification.
Results
The results confirmed that the existence of a particular amino acid in a protein sequence increases the stability for the forecast of protein secondary structure. The suggested algorithm achieved 85% accuracy for the E|~E classifier.
Conclusions
The proposed rules can be very important in managing wet laboratory experiments intended at determining protein secondary structure. Lastly, future work will focus mainly on large protein datasets without overfitting and expand the amount of extracted regulations for PSSP.
Collapse
|
15
|
Görmez Y, Sabzekar M, Aydın Z. IGPRED: Combination of convolutional neural and graph convolutional networks for protein secondary structure prediction. Proteins 2021; 89:1277-1288. [PMID: 33993559 DOI: 10.1002/prot.26149] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2021] [Revised: 04/21/2021] [Accepted: 05/11/2021] [Indexed: 11/10/2022]
Abstract
There is a close relationship between the tertiary structure and the function of a protein. One of the important steps to determine the tertiary structure is protein secondary structure prediction (PSSP). For this reason, predicting secondary structure with higher accuracy will give valuable information about the tertiary structure. Recently, deep learning techniques have obtained promising improvements in several machine learning applications including PSSP. In this article, a novel deep learning model, based on convolutional neural network and graph convolutional network is proposed. PSIBLAST PSSM, HHMAKE PSSM, physico-chemical properties of amino acids are combined with structural profiles to generate a rich feature set. Furthermore, the hyper-parameters of the proposed network are optimized using Bayesian optimization. The proposed model IGPRED obtained 89.19%, 86.34%, 87.87%, 85.76%, and 86.54% Q3 accuracies for CullPDB, EVAset, CASP10, CASP11, and CASP12 datasets, respectively.
Collapse
Affiliation(s)
- Yasin Görmez
- Faculty of Economics and Administrative Sciences, Management Information Systems, Sivas Cumhuriyet University, Sivas, Turkey
| | - Mostafa Sabzekar
- Department of Computer Engineering, Birjand University of Technology, Birjand, Iran
| | - Zafer Aydın
- Engineering Faculty, Computer Engineering Department, Abdullah Gül University, Kayseri, Turkey
| |
Collapse
|
16
|
Sharma AK, Srivastava R. Protein Secondary Structure Prediction Using Character Bi-gram Embedding and Bi-LSTM. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200601122840] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Protein secondary structure is vital to predicting the tertiary structure,
which is essential in deciding protein function and drug designing. Therefore, there is a high
requirement of computational methods to predict secondary structure from their primary sequence.
Protein primary sequences represented as a linear combination of twenty amino acid characters and
contain the contextual information for secondary structure prediction.
Objective and Methods:
Protein secondary structure predicted from their primary sequences using a
deep recurrent neural network. Protein secondary structure depends on local and long-range residues
in primary sequences. In the proposed work, the local contextual information of amino acid residues
captures with character n-gram. A dense embedding vector represents this local contextual
information. Furthermore, the bidirectional long short-term memory (Bi-LSTM) model is used to
capture the long-range contexts by extracting the past and future residues information in primary
sequences.
Results:
The proposed deep recurrent architecture is evaluated for its efficacy for datasets, namely
ss.txt, RS126, and CASP9. The model shows the Q3 accuracies of 88.45%, 83.48%, and 86.69% for
ss.txt, RS126, and CASP9, respectively. The performance of the proposed model is also compared
with other state-of-the-art methods available in the literature.
Conclusion:
After a comparative analysis, it was observed that the proposed model is performing
better in comparison to state-of-art methods.
Collapse
Affiliation(s)
- Ashish Kumar Sharma
- Department of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, Uttar Pradesh, India
| | - Rajeev Srivastava
- Department of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, Uttar Pradesh, India
| |
Collapse
|
17
|
Kruglikov A, Rakesh M, Wei Y, Xia X. Applications of Protein Secondary Structure Algorithms in SARS-CoV-2 Research. J Proteome Res 2021; 20:1457-1463. [PMID: 33617253 PMCID: PMC7927282 DOI: 10.1021/acs.jproteome.0c00734] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2020] [Indexed: 01/25/2023]
Abstract
Since the outset of COVID-19, the pandemic has prompted immediate global efforts to sequence SARS-CoV-2, and over 450 000 complete genomes have been publicly deposited over the course of 12 months. Despite this, comparative nucleotide and amino acid sequence analyses often fall short in answering key questions in vaccine design. For example, the binding affinity between different ACE2 receptors and SARS-COV-2 spike protein cannot be fully explained by amino acid similarity at ACE2 contact sites because protein structure similarities are not fully reflected by amino acid sequence similarities. To comprehensively compare protein homology, secondary structure (SS) analysis is required. While protein structure is slow and difficult to obtain, SS predictions can be made rapidly, and a well-predicted SS structure may serve as a viable proxy to gain biological insight. Here we review algorithms and information used in predicting protein SS to highlight its potential application in pandemics research. We also showed examples of how SS predictions can be used to compare ACE2 proteins and to evaluate the zoonotic origins of viruses. As computational tools are much faster than wet-lab experiments, these applications can be important for research especially in times when quickly obtained biological insights can help in speeding up response to pandemics.
Collapse
Affiliation(s)
- Alibek Kruglikov
- Department
of Biology, University of Ottawa, Ottawa, Ontario K1N 6N5, Canada
| | - Mohan Rakesh
- Department
of Biology, University of Ottawa, Ottawa, Ontario K1N 6N5, Canada
| | - Yulong Wei
- Department
of Biology, University of Ottawa, Ottawa, Ontario K1N 6N5, Canada
| | - Xuhua Xia
- Department
of Biology, University of Ottawa, Ottawa, Ontario K1N 6N5, Canada
- Ottawa
Institute of Systems Biology, University
of Ottawa, Ottawa, Ontario K1N 6N5, Canada
| |
Collapse
|
18
|
Uddin MR, Mahbub S, Rahman MS, Bayzid MS. SAINT: self-attention augmented inception-inside-inception network improves protein secondary structure prediction. Bioinformatics 2021; 36:4599-4608. [PMID: 32437517 DOI: 10.1093/bioinformatics/btaa531] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2019] [Revised: 05/10/2020] [Accepted: 05/16/2020] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Protein structures provide basic insight into how they can interact with other proteins, their functions and biological roles in an organism. Experimental methods (e.g. X-ray crystallography and nuclear magnetic resonance spectroscopy) for predicting the secondary structure (SS) of proteins are very expensive and time consuming. Therefore, developing efficient computational approaches for predicting the SS of protein is of utmost importance. Advances in developing highly accurate SS prediction methods have mostly been focused on 3-class (Q3) structure prediction. However, 8-class (Q8) resolution of SS contains more useful information and is much more challenging than the Q3 prediction. RESULTS We present SAINT, a highly accurate method for Q8 structure prediction, which incorporates self-attention mechanism (a concept from natural language processing) with the Deep Inception-Inside-Inception network in order to effectively capture both the short- and long-range interactions among the amino acid residues. SAINT offers a more interpretable framework than the typical black-box deep neural network methods. Through an extensive evaluation study, we report the performance of SAINT in comparison with the existing best methods on a collection of benchmark datasets, namely, TEST2016, TEST2018, CASP12 and CASP13. Our results suggest that self-attention mechanism improves the prediction accuracy and outperforms the existing best alternate methods. SAINT is the first of its kind and offers the best known Q8 accuracy. Thus, we believe SAINT represents a major step toward the accurate and reliable prediction of SSs of proteins. AVAILABILITY AND IMPLEMENTATION SAINT is freely available as an open-source project at https://github.com/SAINTProtein/SAINT.
Collapse
Affiliation(s)
- Mostofa Rafid Uddin
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh.,Department of Computer Science and Engineering, East West University, Dhaka 1212, Bangladesh
| | - Sazan Mahbub
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | - M Saifur Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| |
Collapse
|
19
|
Narmadha D, Pravin A. An intelligent computer-aided approach for target protein prediction in infectious diseases. Soft comput 2020. [DOI: 10.1007/s00500-020-04815-w] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
20
|
Qiao C, Yu X, Song X, Zhao T, Xu X, Zhao S, Gubbins KE. Enhancing Gas Solubility in Nanopores: A Combined Study Using Classical Density Functional Theory and Machine Learning. LANGMUIR : THE ACS JOURNAL OF SURFACES AND COLLOIDS 2020; 36:8527-8536. [PMID: 32623896 DOI: 10.1021/acs.langmuir.0c01160] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Geometrical confinement has a large impact on gas solubilities in nanoscale pores. This phenomenon is closely associated with heterogeneous catalysis, shale gas extraction, phase separation, etc. Whereas several experimental and theoretical studies have been conducted that provide meaningful insights into the over-solubility and under-solubility of different gases in confined solvents, the microscopic mechanism for regulating the gas solubility remains unclear. Here, we report a hybrid theoretical study for unraveling the regulation mechanism by combining classical density functional theory (CDFT) with machine learning (ML). Specifically, CDFT is employed to predict the solubility of argon in various solvents confined in nanopores of different types and pore widths, and these case studies then supply a valid training set to ML for further investigation. Finally, the dominant parameters that affect the gas solubility are identified, and a criterion is obtained to determine whether a confined gas-solvent system is enhance-beneficial or reduce-beneficial. Our findings provide theoretical guidance for predicting and regulating gas solubilities in nanopores. In addition, the hybrid method proposed in this work sets up a feasible platform for investigating complex interfacial systems with multiple controlling parameters.
Collapse
Affiliation(s)
- Chongzhi Qiao
- State Key Laboratory of Chemical Engineering and School of Chemical Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Xiaochen Yu
- State Key Laboratory of Chemical Engineering and School of Chemical Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Xianyu Song
- State Key Laboratory of Chemical Engineering and School of Chemical Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Teng Zhao
- State Key Laboratory of Chemical Engineering and School of Chemical Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Xiaofei Xu
- State Key Laboratory of Chemical Engineering and School of Chemical Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Shuangliang Zhao
- State Key Laboratory of Chemical Engineering and School of Chemical Engineering, East China University of Science and Technology, Shanghai 200237, China
- Guangxi Key Laboratory of Petrochemical Resource Processing and Process Intensification Technology and School of Chemistry and Chemical Engineering, Guangxi University, Nanning 530004, China
| | - Keith E Gubbins
- Department of Chemical & Biomolecular Engineering, North Carolina State University, Raleigh, North Carolina 27695-7905, United States
| |
Collapse
|
21
|
Xu Z, Wang Z, Liu M, Yan B, Ren X, Gao Z. Machine learning assisted dual-channel carbon quantum dots-based fluorescence sensor array for detection of tetracyclines. SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2020; 232:118147. [PMID: 32092680 DOI: 10.1016/j.saa.2020.118147] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Revised: 02/06/2020] [Accepted: 02/09/2020] [Indexed: 06/10/2023]
Abstract
The detection and differentiation of tetracyclines (TCs) has received increasing attention due to the severe threat they pose to human health and the ecological balance. A dual-channel fluorescence sensor array based on two carbon quantum dots (CDs) was fabricated to distinguish between four TCs, including tetracycline (TC), oxytetracycline (OTC), doxycycline (DOX), and metacycline (MTC). A distinct fluorescence variation pattern (I/I0) was produced when CDs interacted with the four TCs. This pattern was analyzed by LDA and SVM. This was the first time that SVM was used for data processing of fluorescence sensor arrays. LDA and SVM showed that the array has the capacity for parallel and accurate determination of TCs at concentrations between 1.0 μM and 150 μM. In addition, the interference experiment using metal ions and antibiotics as possible coexisting interference substances proves that the sensor array has excellent selectivity and anti-interference ability. The array was also used for the accurate detection and identification of TCs in binary mixtures, and furthermore, the four TCs were successfully identified in river water and milk samples. Besides, the sensor array successfully identified the four TCs in 72 unknown samples with a 100% accuracy. The results proved that SVM can achieve the same accurate classification and prediction as LDA, and considering its additional advantages, it can be used as an optional supplementary method for data processing, thereby expanding the data processing field.
Collapse
Affiliation(s)
- Zijun Xu
- College of Resources and Environmental Sciences, China Agricultural University, Beijing 100193, PR China
| | - Zhaokun Wang
- College of Resources and Environmental Sciences, China Agricultural University, Beijing 100193, PR China
| | - Mingyang Liu
- College of Resources and Environmental Sciences, China Agricultural University, Beijing 100193, PR China
| | - Binwei Yan
- College of Resources and Environmental Sciences, China Agricultural University, Beijing 100193, PR China
| | - Xueqin Ren
- College of Resources and Environmental Sciences, China Agricultural University, Beijing 100193, PR China; Beijing Key Laboratory of Farmland Soil Pollution Prevention and Remediation, China Agricultural University, Beijing 100193, PR China..
| | - Zideng Gao
- College of Resources and Environmental Sciences, China Agricultural University, Beijing 100193, PR China.
| |
Collapse
|
22
|
Smolarczyk T, Roterman-Konieczna I, Stapor K. Protein Secondary Structure Prediction: A Review of Progress and Directions. Curr Bioinform 2020. [DOI: 10.2174/1574893614666191017104639] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Over the last few decades, a search for the theory of protein folding has
grown into a full-fledged research field at the intersection of biology, chemistry and informatics.
Despite enormous effort, there are still open questions and challenges, like understanding the rules
by which amino acid sequence determines protein secondary structure.
Objective:
In this review, we depict the progress of the prediction methods over the years and
identify sources of improvement.
Methods:
The protein secondary structure prediction problem is described followed by the discussion
on theoretical limitations, description of the commonly used data sets, features and a review
of three generations of methods with the focus on the most recent advances. Additionally, methods
with available online servers are assessed on the independent data set.
Results:
The state-of-the-art methods are currently reaching almost 88% for 3-class prediction and
76.5% for an 8-class prediction.
Conclusion:
This review summarizes recent advances and outlines further research directions.
Collapse
Affiliation(s)
- Tomasz Smolarczyk
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
| | - Irena Roterman-Konieczna
- Department of Bioinformatics and Telemedicine, Jagiellonian University Medical College, Krakow, Poland
| | - Katarzyna Stapor
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
| |
Collapse
|
23
|
Iyer SS, Negi A, Srivastava A. Interpretation of Phase Boundary Fluctuation Spectra in Biological Membranes with Nanoscale Organization. J Chem Theory Comput 2020; 16:2736-2750. [DOI: 10.1021/acs.jctc.9b00929] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Affiliation(s)
- Sahithya S. Iyer
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560012, India
| | - Archit Negi
- Department of Physics, Indian Institute of Technology, Bombay, Mumbai 400076, India
| | - Anand Srivastava
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560012, India
| |
Collapse
|
24
|
Yu D, Xu Z, Wang X. Bibliometric analysis of support vector machines research trend: a case study in China. INT J MACH LEARN CYB 2019. [DOI: 10.1007/s13042-019-01028-y] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
25
|
Sample Reduction Strategies for Protein Secondary Structure Prediction. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9204429] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Predicting the secondary structure from protein sequence plays a crucial role in estimating the 3D structure, which has applications in drug design and in understanding the function of proteins. As new genes and proteins are discovered, the large size of the protein databases and datasets that can be used for training prediction models grows considerably. A two-stage hybrid classifier, which employs dynamic Bayesian networks and a support vector machine (SVM) has been shown to provide state-of-the-art prediction accuracy for protein secondary structure prediction. However, SVM is not efficient for large datasets due to the quadratic optimization involved in model training. In this paper, two techniques are implemented on CB513 benchmark for reducing the number of samples in the train set of the SVM. The first method randomly selects a fraction of data samples from the train set using a stratified selection strategy. This approach can remove approximately 50% of the data samples from the train set and reduce the model training time by 73.38% on average without decreasing the prediction accuracy significantly. The second method clusters the data samples by a hierarchical clustering algorithm and replaces the train set samples with nearest neighbors of the cluster centers in order to improve the training time. To cluster the feature vectors, the hierarchical clustering method is implemented, for which the number of clusters and the number of nearest neighbors are optimized as hyper-parameters by computing the prediction accuracy on validation sets. It is found that clustering can reduce the size of the train set by 26% without reducing the prediction accuracy. Among the clustering techniques Ward’s method provided the best accuracy on test data.
Collapse
|
26
|
Guo Y, Wang B, Li W, Yang B. Protein secondary structure prediction improved by recurrent neural networks integrated with two-dimensional convolutional neural networks. J Bioinform Comput Biol 2019; 16:1850021. [PMID: 30419785 DOI: 10.1142/s021972001850021x] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Protein secondary structure prediction (PSSP) is an important research field in bioinformatics. The representation of protein sequence features could be treated as a matrix, which includes the amino-acid residue (time-step) dimension and the feature vector dimension. Common approaches to predict secondary structures only focus on the amino-acid residue dimension. However, the feature vector dimension may also contain useful information for PSSP. To integrate the information on both dimensions of the matrix, we propose a hybrid deep learning framework, two-dimensional convolutional bidirectional recurrent neural network (2C-BRNN), for improving the accuracy of 8-class secondary structure prediction. The proposed hybrid framework is to extract the discriminative local interactions between amino-acid residues by two-dimensional convolutional neural networks (2DCNNs), and then further capture long-range interactions between amino-acid residues by bidirectional gated recurrent units (BGRUs) or bidirectional long short-term memory (BLSTM). Specifically, our proposed 2C-BRNNs framework consists of four models: 2DConv-BGRUs, 2DCNN-BGRUs, 2DConv-BLSTM and 2DCNN-BLSTM. Among these four models, the 2DConv- models only contain two-dimensional (2D) convolution operations. Moreover, the 2DCNN- models contain 2D convolutional and pooling operations. Experiments are conducted on four public datasets. The experimental results show that our proposed 2DConv-BLSTM model performs significantly better than the benchmark models. Furthermore, the experiments also demonstrate that the proposed models can extract more meaningful features from the matrix of proteins, and the feature vector dimension is also useful for PSSP. The codes and datasets of our proposed methods are available at https://github.com/guoyanb/JBCB2018/ .
Collapse
Affiliation(s)
- Yanbu Guo
- * School of Information Science and Engineering, Yunnan University, No. 2 North Cuihu Road, Kunming 650091, P. R. China
| | - Bingyi Wang
- † The Research Institute of Resource Insects, Chinese Academy of Forestry, Bailongsi, Kunming 650224, P. R. China
| | - Weihua Li
- * School of Information Science and Engineering, Yunnan University, No. 2 North Cuihu Road, Kunming 650091, P. R. China
| | - Bei Yang
- ‡ MD. Cardiology Department, The Second People's Hospital of Yunnan Province, No. 176 Qingnian Road, Kunming 650021, P. R. China
| |
Collapse
|
27
|
Toussi CA, Haddadnia J. Improving protein secondary structure prediction: the evolutionary optimized classification algorithms. Struct Chem 2019. [DOI: 10.1007/s11224-018-1271-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
28
|
Guo Y, Li W, Wang B, Liu H, Zhou D. DeepACLSTM: deep asymmetric convolutional long short-term memory neural models for protein secondary structure prediction. BMC Bioinformatics 2019; 20:341. [PMID: 31208331 PMCID: PMC6580607 DOI: 10.1186/s12859-019-2940-0] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Accepted: 06/07/2019] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Protein secondary structure (PSS) is critical to further predict the tertiary structure, understand protein function and design drugs. However, experimental techniques of PSS are time consuming and expensive, and thus it's very urgent to develop efficient computational approaches for predicting PSS based on sequence information alone. Moreover, the feature matrix of a protein contains two dimensions: the amino-acid residue dimension and the feature vector dimension. Existing deep learning based methods have achieved remarkable performances of PSS prediction, but the methods often utilize the features from the amino-acid dimension. Thus, there is still room to improve computational methods of PSS prediction. RESULTS We propose a novel deep neural network method, called DeepACLSTM, to predict 8-category PSS from protein sequence features and profile features. Our method efficiently applies asymmetric convolutional neural networks (ACNNs) combined with bidirectional long short-term memory (BLSTM) neural networks to predict PSS, leveraging the feature vector dimension of the protein feature matrix. In DeepACLSTM, the ACNNs extract the complex local contexts of amino-acids; the BLSTM neural networks capture the long-distance interdependencies between amino-acids. Furthermore, the prediction module predicts the category of each amino-acid residue based on both local contexts and long-distance interdependencies. To evaluate performances of DeepACLSTM, we conduct experiments on three publicly available datasets: CB513, CASP10 and CASP12. Results indicate that the performance of our method is superior to the state-of-the-art baselines on three publicly datasets. CONCLUSIONS Experiments demonstrate that DeepACLSTM is an efficient predication method for predicting 8-category PSS and has the ability to extract more complex sequence-structure relationships between amino-acid residues. Moreover, experiments also indicate the feature vector dimension contains the useful information for improving PSS prediction.
Collapse
Affiliation(s)
- Yanbu Guo
- School of Information Science and Engineering, Yunnan University, Kunming, 650091, China
| | - Weihua Li
- School of Information Science and Engineering, Yunnan University, Kunming, 650091, China.
| | - Bingyi Wang
- Research Institute of Resource Insects, Chinese Academy of Forestry, Kunming, 650224, China.
| | - Huiqing Liu
- School of Information Science and Engineering, Yunnan University, Kunming, 650091, China
| | - Dongming Zhou
- School of Information Science and Engineering, Yunnan University, Kunming, 650091, China
| |
Collapse
|
29
|
Chen Y, Yuan X, Cang X. Population-based incremental learning for the prediction of Homo sapiens’ protein secondary structure. INT J BIOMATH 2019. [DOI: 10.1142/s1793524519500177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Protein structure prediction is the prediction of the 3D structure of a protein based on its amino acid sequence. It is a key component in disciplines such as medicine, biology, and biochemistry. The prediction of the protein secondary structure of Homo sapiens is one of the more important domains. Many methods have been used to feed forward neural networks or SVMs combined with a sliding window. This method’s mechanisms are too complex to be able to extract clear and straightforward physical meanings from it. This paper explores population-based incremental learning (PBIL), which is a method that combines the mechanisms of a generational genetic algorithm with simple competitive learning. The result shows that its accuracies are particularly associated with the Homo species. This new perspective reveals a number of different possibilities for the purposes of performance improvements.
Collapse
Affiliation(s)
- Ye Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, Jiangsu 221008, P. R. China
| | - Xiaoping Yuan
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, Jiangsu 221008, P. R. China
| | - Xiaohui Cang
- Institute of Genetics, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310058, P. R. China
| |
Collapse
|
30
|
Aydin Z, Kaynar O, Görmez Y. Dimensionality reduction for protein secondary structure and solvent accesibility prediction. J Bioinform Comput Biol 2018; 16:1850020. [PMID: 30353781 DOI: 10.1142/s0219720018500208] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Secondary structure and solvent accessibility prediction provide valuable information for estimating the three dimensional structure of a protein. As new feature extraction methods are developed the dimensionality of the input feature space increases steadily. Reducing the number of dimensions provides several advantages such as faster model training, faster prediction and noise elimination. In this work, several dimensionality reduction techniques have been employed including various feature selection methods, autoencoders and PCA for protein secondary structure and solvent accessibility prediction. The reduced feature set is used to train a support vector machine at the second stage of a hybrid classifier. Cross-validation experiments on two difficult benchmarks demonstrate that the dimension of the input space can be reduced substantially while maintaining the prediction accuracy. This will enable the incorporation of additional informative features derived for predicting the structural properties of proteins without reducing the accuracy due to overfitting.
Collapse
Affiliation(s)
- Zafer Aydin
- * Department of Computer Engineering, Abdullah Gul University, Kayseri 38080, Turkey
| | - Oğuz Kaynar
- † Department of Management Information Systems, Cumhuriyet University, Sivas 58000, Turkey
| | - Yasin Görmez
- † Department of Management Information Systems, Cumhuriyet University, Sivas 58000, Turkey
| |
Collapse
|
31
|
Role of solvent accessibility for aggregation-prone patches in protein folding. Sci Rep 2018; 8:12896. [PMID: 30150761 PMCID: PMC6110721 DOI: 10.1038/s41598-018-31289-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2018] [Accepted: 08/15/2018] [Indexed: 11/21/2022] Open
Abstract
The arrangement of amino acids in a protein sequence encodes its native folding. However, the same arrangement in aggregation-prone regions may cause misfolding as a result of local environmental stress. Under normal physiological conditions, such regions congregate in the protein’s interior to avoid aggregation and attain the native fold. We have used solvent accessibility of aggregation patches (SAAPp) to determine the packing of aggregation-prone residues. Our results showed that SAAPp has low values for native crystal structures, consistent with protein folding as a mechanism to minimize the solvent accessibility of aggregation-prone residues. SAAPp also shows an average correlation of 0.76 with the global distance test (GDT) score on CASP12 template-based protein models. Using SAAPp scores and five structural features, a random forest machine learning quality assessment tool, SAAP-QA, showed 2.32 average GDT loss between best model predicted and actual best based on GDT score on independent CASP test data, with the ability to discriminate native-like folds having an AUC of 0.94. Overall, the Pearson correlation coefficient (PCC) between true and predicted GDT scores on independent CASP data was 0.86 while on the external CAMEO dataset, comprising high quality protein structures, PCC and average GDT loss were 0.71 and 4.46 respectively. SAAP-QA can be used to detect the quality of models and iteratively improve them to native or near-native structures.
Collapse
|
32
|
Zhang B, Li J, Lü Q. Prediction of 8-state protein secondary structures by a novel deep learning architecture. BMC Bioinformatics 2018; 19:293. [PMID: 30075707 PMCID: PMC6090794 DOI: 10.1186/s12859-018-2280-5] [Citation(s) in RCA: 60] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2018] [Accepted: 07/09/2018] [Indexed: 11/16/2022] Open
Abstract
Background Protein secondary structure can be regarded as an information bridge that links the primary sequence and tertiary structure. Accurate 8-state secondary structure prediction can significantly give more precise and high resolution on structure-based properties analysis. Results We present a novel deep learning architecture which exploits an integrative synergy of prediction by a convolutional neural network, residual network, and bidirectional recurrent neural network to improve the performance of protein secondary structure prediction. A local block comprised of convolutional filters and original input is designed for capturing local sequence features. The subsequent bidirectional recurrent neural network consisting of gated recurrent units can capture global context features. Furthermore, the residual network can improve the information flow between the hidden layers and the cascaded recurrent neural network. Our proposed deep network achieved 71.4% accuracy on the benchmark CB513 dataset for the 8-state prediction; and the ensemble learning by our model achieved 74% accuracy. Our model generalization capability is also evaluated on other three independent datasets CASP10, CASP11 and CASP12 for both 8- and 3-state prediction. These prediction performances are superior to the state-of-the-art methods. Conclusion Our experiment demonstrates that it is a valuable method for predicting protein secondary structure, and capturing local and global features concurrently is very useful in deep learning. Electronic supplementary material The online version of this article (10.1186/s12859-018-2280-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Buzhong Zhang
- School of Computer Science and Technology, Soochow University, Suzhou, China.,School of Computer and Information, and the University Key Laboratory of Intelligent Perception and Computing of Anhui Province, Anqing Normal University, Anqing, 246011, China
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Broadway, NSW 2007, Sydney, PO Box 123, Australia
| | - Qiang Lü
- School of Computer Science and Technology, Soochow University, Suzhou, China.
| |
Collapse
|
33
|
Protein Secondary Structure Prediction Based on Data Partition and Semi-Random Subspace Method. Sci Rep 2018; 8:9856. [PMID: 29959372 PMCID: PMC6026213 DOI: 10.1038/s41598-018-28084-8] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2018] [Accepted: 06/12/2018] [Indexed: 11/20/2022] Open
Abstract
Protein secondary structure prediction is one of the most important and challenging problems in bioinformatics. Machine learning techniques have been applied to solve the problem and have gained substantial success in this research area. However there is still room for improvement toward the theoretical limit. In this paper, we present a novel method for protein secondary structure prediction based on a data partition and semi-random subspace method (PSRSM). Data partitioning is an important strategy for our method. First, the protein training dataset was partitioned into several subsets based on the length of the protein sequence. Then we trained base classifiers on the subspace data generated by the semi-random subspace method, and combined base classifiers by majority vote rule into ensemble classifiers on each subset. Multiple classifiers were trained on different subsets. These different classifiers were used to predict the secondary structures of different proteins according to the protein sequence length. Experiments are performed on 25PDB, CB513, CASP10, CASP11, CASP12, and T100 datasets, and the good performance of 86.38%, 84.53%, 85.51%, 85.89%, 85.55%, and 85.09% is achieved respectively. Experimental results showed that our method outperforms other state-of-the-art methods.
Collapse
|
34
|
Yu B, Li S, Qiu W, Wang M, Du J, Zhang Y, Chen X. Prediction of subcellular location of apoptosis proteins by incorporating PsePSSM and DCCA coefficient based on LFDA dimensionality reduction. BMC Genomics 2018; 19:478. [PMID: 29914358 PMCID: PMC6006758 DOI: 10.1186/s12864-018-4849-9] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Accepted: 06/01/2018] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Apoptosis is associated with some human diseases, including cancer, autoimmune disease, neurodegenerative disease and ischemic damage, etc. Apoptosis proteins subcellular localization information is very important for understanding the mechanism of programmed cell death and the development of drugs. Therefore, the prediction of subcellular localization of apoptosis protein is still a challenging task. RESULTS In this paper, we propose a novel method for predicting apoptosis protein subcellular localization, called PsePSSM-DCCA-LFDA. Firstly, the protein sequences are extracted by combining pseudo-position specific scoring matrix (PsePSSM) and detrended cross-correlation analysis coefficient (DCCA coefficient), then the extracted feature information is reduced dimensionality by LFDA (local Fisher discriminant analysis). Finally, the optimal feature vectors are input to the SVM classifier to predict subcellular location of the apoptosis proteins. The overall prediction accuracy of 99.7, 99.6 and 100% are achieved respectively on the three benchmark datasets by the most rigorous jackknife test, which is better than other state-of-the-art methods. CONCLUSION The experimental results indicate that our method can significantly improve the prediction accuracy of subcellular localization of apoptosis proteins, which is quite high to be able to become a promising tool for further proteomics studies. The source code and all datasets are available at https://github.com/QUST-BSBRC/PsePSSM-DCCA-LFDA/ .
Collapse
Affiliation(s)
- Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China. .,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China. .,School of Life Sciences, University of Science and Technology of China, Hefei, 230027, China.
| | - Shan Li
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Wenying Qiu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Minghui Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Junwei Du
- College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University at Weihai, Weihai, 264209, China
| | - Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 21116, China
| |
Collapse
|
35
|
Zhou J, Wang H, Zhao Z, Xu R, Lu Q. CNNH_PSS: protein 8-class secondary structure prediction by convolutional neural network with highway. BMC Bioinformatics 2018; 19:60. [PMID: 29745837 PMCID: PMC5998876 DOI: 10.1186/s12859-018-2067-8] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Protein secondary structure is the three dimensional form of local segments of proteins and its prediction is an important problem in protein tertiary structure prediction. Developing computational approaches for protein secondary structure prediction is becoming increasingly urgent. RESULTS We present a novel deep learning based model, referred to as CNNH_PSS, by using multi-scale CNN with highway. In CNNH_PSS, any two neighbor convolutional layers have a highway to deliver information from current layer to the output of the next one to keep local contexts. As lower layers extract local context while higher layers extract long-range interdependencies, the highways between neighbor layers allow CNNH_PSS to have ability to extract both local contexts and long-range interdependencies. We evaluate CNNH_PSS on two commonly used datasets: CB6133 and CB513. CNNH_PSS outperforms the multi-scale CNN without highway by at least 0.010 Q8 accuracy and also performs better than CNF, DeepCNF and SSpro8, which cannot extract long-range interdependencies, by at least 0.020 Q8 accuracy, demonstrating that both local contexts and long-range interdependencies are indeed useful for prediction. Furthermore, CNNH_PSS also performs better than GSM and DCRNN which need extra complex model to extract long-range interdependencies. It demonstrates that CNNH_PSS not only cost less computer resource, but also achieves better predicting performance. CONCLUSION CNNH_PSS have ability to extracts both local contexts and long-range interdependencies by combing multi-scale CNN and highway network. The evaluations on common datasets and comparisons with state-of-the-art methods indicate that CNNH_PSS is an useful and efficient tool for protein secondary structure prediction.
Collapse
Affiliation(s)
- Jiyun Zhou
- School Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, Guangdong 518055 China
- Department of Computing, the Hong Kong Polytechnic University, Hung Hom, Hong Kong
| | - Hongpeng Wang
- School Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, Guangdong 518055 China
| | - Zhishan Zhao
- School Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, Guangdong 518055 China
| | - Ruifeng Xu
- School Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, Guangdong 518055 China
| | - Qin Lu
- Department of Computing, the Hong Kong Polytechnic University, Hung Hom, Hong Kong
| |
Collapse
|
36
|
Yang Y, Gao J, Wang J, Heffernan R, Hanson J, Paliwal K, Zhou Y. Sixty-five years of the long march in protein secondary structure prediction: the final stretch? Brief Bioinform 2018; 19:482-494. [PMID: 28040746 PMCID: PMC5952956 DOI: 10.1093/bib/bbw129] [Citation(s) in RCA: 84] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2016] [Revised: 11/15/2016] [Indexed: 11/13/2022] Open
Abstract
Protein secondary structure prediction began in 1951 when Pauling and Corey predicted helical and sheet conformations for protein polypeptide backbone even before the first protein structure was determined. Sixty-five years later, powerful new methods breathe new life into this field. The highest three-state accuracy without relying on structure templates is now at 82-84%, a number unthinkable just a few years ago. These improvements came from increasingly larger databases of protein sequences and structures for training, the use of template secondary structure information and more powerful deep learning techniques. As we are approaching to the theoretical limit of three-state prediction (88-90%), alternative to secondary structure prediction (prediction of backbone torsion angles and Cα-atom-based angles and torsion angles) not only has more room for further improvement but also allows direct prediction of three-dimensional fragment structures with constantly improved accuracy. About 20% of all 40-residue fragments in a database of 1199 non-redundant proteins have <6 Å root-mean-squared distance from the native conformations by SPIDER2. More powerful deep learning methods with improved capability of capturing long-range interactions begin to emerge as the next generation of techniques for secondary structure prediction. The time has come to finish off the final stretch of the long march towards protein secondary structure prediction.
Collapse
Affiliation(s)
- Yuedong Yang
- Insitute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Drive, Southport, QLD, Australia
| | - Jianzhao Gao
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China
| | - Jihua Wang
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, China
| | - Rhys Heffernan
- Signal Processing Laboratory, Griffith University, Brisbane, Australia
| | - Jack Hanson
- Signal Processing Laboratory, Griffith University, Brisbane, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, Griffith University, Brisbane, Australia
| | - Yaoqi Zhou
- Insitute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Drive, Southport, QLD, Australia
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, China
| |
Collapse
|
37
|
Liu T, Wang Z. SOV_refine: A further refined definition of segment overlap score and its significance for protein structure similarity. SOURCE CODE FOR BIOLOGY AND MEDICINE 2018; 13:1. [PMID: 29713370 PMCID: PMC5909207 DOI: 10.1186/s13029-018-0068-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/30/2016] [Accepted: 04/02/2018] [Indexed: 11/22/2022]
Abstract
Background The segment overlap score (SOV) has been used to evaluate the predicted protein secondary structures, a sequence composed of helix (H), strand (E), and coil (C), by comparing it with the native or reference secondary structures, another sequence of H, E, and C. SOV’s advantage is that it can consider the size of continuous overlapping segments and assign extra allowance to longer continuous overlapping segments instead of only judging from the percentage of overlapping individual positions as Q3 score does. However, we have found a drawback from its previous definition, that is, it cannot ensure increasing allowance assignment when more residues in a segment are further predicted accurately. Results A new way of assigning allowance has been designed, which keeps all the advantages of the previous SOV score definitions and ensures that the amount of allowance assigned is incremental when more elements in a segment are predicted accurately. Furthermore, our improved SOV has achieved a higher correlation with the quality of protein models measured by GDT-TS score and TM-score, indicating its better abilities to evaluate tertiary structure quality at the secondary structure level. We analyzed the statistical significance of SOV scores and found the threshold values for distinguishing two protein structures (SOV_refine > 0.19) and indicating whether two proteins are under the same CATH fold (SOV_refine > 0.94 and > 0.90 for three- and eight-state secondary structures respectively). We provided another two example applications, which are when used as a machine learning feature for protein model quality assessment and comparing different definitions of topologically associating domains. We proved that our newly defined SOV score resulted in better performance. Conclusions The SOV score can be widely used in bioinformatics research and other fields that need to compare two sequences of letters in which continuous segments have important meanings. We also generalized the previous SOV definitions so that it can work for sequences composed of more than three states (e.g., it can work for the eight-state definition of protein secondary structures). A standalone software package has been implemented in Perl with source code released. The software can be downloaded from http://dna.cs.miami.edu/SOV/.
Collapse
Affiliation(s)
- Tong Liu
- Department of Computer Science, University of Miami, 1365 Memorial Drive, Coral Gables, FL 33124 USA
| | - Zheng Wang
- Department of Computer Science, University of Miami, 1365 Memorial Drive, Coral Gables, FL 33124 USA
| |
Collapse
|
38
|
Labjar H, Cherif W, Nadir S, Digua K, Sallek B, Chaair H. Support vector machines for modelling phosphocalcic hydroxyapatite by precipitation from a calcium carbonate solution and phosphoric acid solution. JOURNAL OF TAIBAH UNIVERSITY FOR SCIENCE 2018. [DOI: 10.1016/j.jtusci.2015.09.008] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- Houda Labjar
- Laboratoire des génies des procédés et environnement, Faculté des sciences et techniques, Université Hassan II-Casablanca, B.P.: 146, Mohammedia, Morocco
| | - Walid Cherif
- Laboratoire d’informatique et de mathématiques et leurs applications, Faculté des sciences, Université Chouaib Doukkali, B.P.: 20, El Jadida, 24000, Morocco
| | - Salah Nadir
- Laboratoire de Chimie-Physique des Matériaux, Ecole Hassania des Travaux Publics, B.P.: 8108, Casablanca, Morocco
| | - Khalid Digua
- Laboratoire des génies des procédés et environnement, Faculté des sciences et techniques, Université Hassan II-Casablanca, B.P.: 146, Mohammedia, Morocco
| | - Brahim Sallek
- Laboratoire d’Agroressources et Génie des Procédés, Faculté des Sciences, Université Ibn Tofail, B.P.: 133, Kénitra, Morocco
| | - Hassan Chaair
- Laboratoire des génies des procédés et environnement, Faculté des sciences et techniques, Université Hassan II-Casablanca, B.P.: 146, Mohammedia, Morocco
| |
Collapse
|
39
|
Safdari R, Rezaei-Hachesu P, Marjan GhaziSaeedi, Samad-Soltani T, Zolnoori M. Evaluation of Classification Algorithms vs Knowledge-Based Methods for Differential Diagnosis of Asthma in Iranian Patients. INTERNATIONAL JOURNAL OF INFORMATION SYSTEMS IN THE SERVICE SECTOR 2018. [DOI: 10.4018/ijisss.2018040102] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Medical data mining intends to solve real-world problems in the diagnosis and treatment of diseases. This process applies various techniques and algorithms which have different levels of accuracy and precision. The purpose of this article is to apply data mining techniques to the diagnosis of asthma. Sensitivity, specificity and accuracy of K-nearest neighbor, Support Vector Machine, naive Bayes, Artificial Neural Network, classification tree, CN2 algorithms, and related similar studies were evaluated. ROC curves were plotted to show the performance of the authors' approach. Support vector machine (SVM) algorithms achieved the highest accuracy at 98.59% with a sensitivity of 98.59% and a specificity of 98.61% for class 1. Other algorithms had a range of accuracy greater than 87%. The results show that the authors can accurately diagnose asthma approximately 98% of the time based on demographics and clinical data. The study also has a higher sensitivity when compared to expert and knowledge-based systems.
Collapse
Affiliation(s)
- Reza Safdari
- Department of Health Information Technology, Tehran University of Medical Sciences, Tehran, Iran
| | - Peyman Rezaei-Hachesu
- Department of Health Information Technology, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Marjan GhaziSaeedi
- Department of Health Information Technology, Tehran University of Medical Sciences, Tehran, Iran
| | - Taha Samad-Soltani
- Department of Health Information Technology, Tabriz University of Medical Sciences, Tabriz, Iran
| | | |
Collapse
|
40
|
Manikandan P, Ramyachitra D. PATSIM: Prediction and analysis of protein sequences using hybrid Knuth-Morris Pratt (KMP) and Boyer-Moore (BM) algorithm. Gene 2018; 657:50-59. [PMID: 29501620 DOI: 10.1016/j.gene.2018.02.069] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2017] [Revised: 02/26/2018] [Accepted: 02/27/2018] [Indexed: 10/17/2022]
Abstract
In phylogenomic profiling, the genomic context based methods are based on the observation that two or more proteins having the same pattern of presence or absence in many diverse genomes most likely have a functional link. In this research work, a tool (PATSIM) has been developed to predict the protein patterns based on the SOPM tool. In this tool, the secondary structure for CATH database protein sequences, predicted by the SOPM (Self Optimized Prediction Method) server is passed as input to fulfill objectives such as, (i) Predict the Amino Acid Pattern using the proposed Hybrid KMP and BM algorithm, (ii) Predict the physiochemical properties such as Hydrophobic Non-Polar ALKYL Amino Acid groups, Hydrophobic Non-Polar AROMATIC Amino Acid groups, Hydrophilic Polar Neutral Amino Acid groups, Hydrophilic Polar Acidic Amino Acid groups and Hydrophilic Polar Basic Amino Acid groups of protein sequence, (iii) Predict the secondary structure of protein where the structure of protein sequence is unknown, and (iv) Similarity analysis of protein sequence (structure unknown) with the CATH database. From the results, it is inferred that this tool effectively predicts the similarity between the sequences and also identifies the protein patterns for four secondary structural classes, namely Alpha Helix (h), Beta Sheet (e), Turn (t) and Coil (c). Based on the experimental results, it is inferred that this tool identifies the physiochemical properties of the protein sequence in an effective manner. The source code and its documentation for the PATSIM tool is freely available in the GitHub public repository (https://github.com/manimkn89/Protein-Sequence-Analysis).
Collapse
Affiliation(s)
- P Manikandan
- Department of Computer Science, Bharathiar University, Coimbatore 641 046, India
| | - D Ramyachitra
- Department of Computer Science, Bharathiar University, Coimbatore 641 046, India.
| |
Collapse
|
41
|
Risk-Predicting Model for Incident of Essential Hypertension Based on Environmental and Genetic Factors with Support Vector Machine. Interdiscip Sci 2018; 10:126-130. [DOI: 10.1007/s12539-017-0271-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2017] [Revised: 09/02/2017] [Accepted: 11/01/2017] [Indexed: 10/18/2022]
|
42
|
Srivastava A, Kumar M. Prediction of zinc binding sites in proteins using sequence derived information. J Biomol Struct Dyn 2018; 36:4413-4423. [PMID: 29241411 DOI: 10.1080/07391102.2017.1417910] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
Zinc is one the most abundant catalytic cofactor and also an important structural component of a large number of metallo-proteins. Hence prediction of zinc metal binding sites in proteins can be a significant step in annotation of molecular function of a large number of proteins. Majority of existing methods for zinc-binding site predictions are based on a data-set of proteins, which has been compiled nearly a decade ago. Hence there is a need to develop zinc-binding site prediction system using the current updated data to include recently added proteins. Herein, we propose a support vector machine-based method, named as ZincBinder, for prediction of zinc metal-binding site in a protein using sequence profile information. The predictor was trained using fivefold cross validation approach and achieved 85.37% sensitivity with 86.20% specificity during training. Benchmarking on an independent non-redundant data-set, which was not used during training, showed better performance of ZincBinder vis-à-vis existing methods. Executable versions, source code, sample datasets, and usage instructions are available at http://proteininformatics.org/mkumar/znbinder/.
Collapse
Affiliation(s)
- Abhishikha Srivastava
- a Department of Biophysics , University of Delhi South Campus , Benito Juarez Road, New Delhi 110021 , India
| | - Manish Kumar
- a Department of Biophysics , University of Delhi South Campus , Benito Juarez Road, New Delhi 110021 , India
| |
Collapse
|
43
|
Tixier E, Raphel F, Lombardi D, Gerbeau JF. Composite Biomarkers Derived from Micro-Electrode Array Measurements and Computer Simulations Improve the Classification of Drug-Induced Channel Block. Front Physiol 2018; 8:1096. [PMID: 29354067 PMCID: PMC5762138 DOI: 10.3389/fphys.2017.01096] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2017] [Accepted: 12/13/2017] [Indexed: 12/19/2022] Open
Abstract
The Micro-Electrode Array (MEA) device enables high-throughput electrophysiology measurements that are less labor-intensive than patch-clamp based techniques. Combined with human-induced pluripotent stem cells cardiomyocytes (hiPSC-CM), it represents a new and promising paradigm for automated and accurate in vitro drug safety evaluation. In this article, the following question is addressed: which features of the MEA signals should be measured to better classify the effects of drugs? A framework for the classification of drugs using MEA measurements is proposed. The classification is based on the ion channels blockades induced by the drugs. It relies on an in silico electrophysiology model of the MEA, a feature selection algorithm and automatic classification tools. An in silico model of the MEA is developed and is used to generate synthetic measurements. An algorithm that extracts MEA measurements features designed to perform well in a classification context is described. These features are called composite biomarkers. A state-of-the-art machine learning program is used to carry out the classification of drugs using experimental MEA measurements. The experiments are carried out using five different drugs: mexiletine, flecainide, diltiazem, moxifloxacin, and dofetilide. We show that the composite biomarkers outperform the classical ones in different classification scenarios. We show that using both synthetic and experimental MEA measurements improves the robustness of the composite biomarkers and that the classification scores are increased.
Collapse
Affiliation(s)
- Eliott Tixier
- Inria Paris, Paris, France.,Sorbonne Universités, Université Pierre et Marie Curie-Paris 6, UMR 7598 LJLL, Paris, France
| | - Fabien Raphel
- Inria Paris, Paris, France.,Sorbonne Universités, Université Pierre et Marie Curie-Paris 6, UMR 7598 LJLL, Paris, France
| | - Damiano Lombardi
- Inria Paris, Paris, France.,Sorbonne Universités, Université Pierre et Marie Curie-Paris 6, UMR 7598 LJLL, Paris, France
| | - Jean-Frédéric Gerbeau
- Inria Paris, Paris, France.,Sorbonne Universités, Université Pierre et Marie Curie-Paris 6, UMR 7598 LJLL, Paris, France
| |
Collapse
|
44
|
|
45
|
Li C, Hou L, Sharma BY, Li H, Chen C, Li Y, Zhao X, Huang H, Cai Z, Chen H. Developing a new intelligent system for the diagnosis of tuberculous pleural effusion. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2018; 153:211-225. [PMID: 29157454 DOI: 10.1016/j.cmpb.2017.10.022] [Citation(s) in RCA: 73] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/22/2017] [Revised: 10/05/2017] [Accepted: 10/12/2017] [Indexed: 05/15/2023]
Abstract
BACKGROUND AND OBJECTIVE In countries with high prevalence of tuberculosis (TB), clinicians often diagnose tuberculous pleural effusion (TPE) by using diagnostic tests, which have not only poor sensitivity, but poor availability as well. The aim of our study is to develop a new artificial intelligence based diagnostic model that is accurate, fast, non-invasive and cost effective to diagnose TPE. It is expected that a tool derived based on the model be installed on simple computer devices (such as smart phones and tablets) and be used by clinicians widely. METHODS For this study, data of 140 patients whose clinical signs, routine blood test results, blood biochemistry markers, pleural fluid cell type and count, and pleural fluid biochemical tests' results were prospectively collected into a database. An Artificial intelligence based diagnostic model, which employs moth flame optimization based support vector machine with feature selection (FS-MFO-SVM), is constructed to predict the diagnosis of TPE. RESULTS The optimal model results in an average of 95% accuracy (ACC), 0.9564 the area under the receiver operating characteristic curve (AUC), 93.35% sensitivity, and 97.57% specificity for FS-MFO-SVM. CONCLUSIONS The proposed artificial intelligence based diagnostic model is found to be highly reliable for diagnosing TPE based on simple clinical signs, blood samples and pleural effusion samples. Therefore, the proposed model can be widely used in clinical practice and further evaluated for use as a substitute of invasive pleural biopsies.
Collapse
Affiliation(s)
- Chengye Li
- Department of Pulmonary and Critical Care Medicine,The First Affiliated Hospital of Wenzhou Medical University, Wenzhou 325035, China
| | - Lingxian Hou
- Department of Neurology, Wenzhou Hospital of Integrated Traditional Chinese and Western Medicine, Wenzhou 325027, China
| | - Bishundat Yanesh Sharma
- Department of Pulmonary and Critical Care Medicine,The First Affiliated Hospital of Wenzhou Medical University, Wenzhou 325035, China; Jawaharlal Nehru Hospital, Rose Belle, Grand-Port District 00230, Mauritius
| | - Huaizhong Li
- Department of Computing, Lishui University, Lishui 323000, Zhejiang, China
| | - ChengShui Chen
- Department of Pulmonary and Critical Care Medicine,The First Affiliated Hospital of Wenzhou Medical University, Wenzhou 325035, China
| | - Yuping Li
- Department of Pulmonary and Critical Care Medicine,The First Affiliated Hospital of Wenzhou Medical University, Wenzhou 325035, China
| | - Xuehua Zhao
- School of Digital Media, Shenzhen Institute of Information Technology, Shenzhen 518172, China
| | - Hui Huang
- College of Physics and Electronic Information Engineering, Wenzhou University, Wenzhou 325035, China
| | - Zhennao Cai
- College of Physics and Electronic Information Engineering, Wenzhou University, Wenzhou 325035, China
| | - Huiling Chen
- College of Physics and Electronic Information Engineering, Wenzhou University, Wenzhou 325035, China.
| |
Collapse
|
46
|
Wang Y, Guo Y, Pu X, Li M. Effective prediction of bacterial type IV secreted effectors by combined features of both C-termini and N-termini. J Comput Aided Mol Des 2017; 31:1029-1038. [PMID: 29127583 DOI: 10.1007/s10822-017-0080-z] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2017] [Accepted: 11/01/2017] [Indexed: 12/15/2022]
Abstract
Various bacterial pathogens can deliver their secreted substrates also called as effectors through type IV secretion systems (T4SSs) into host cells and cause diseases. Since T4SS secreted effectors (T4SEs) play important roles in pathogen-host interactions, identifying them is crucial to our understanding of the pathogenic mechanisms of T4SSs. A few computational methods using machine learning algorithms for T4SEs prediction have been developed by using features of C-terminal residues. However, recent studies have shown that targeting information can also be encoded in the N-terminal region of at least some T4SEs. In this study, we present an effective method for T4SEs prediction by novelly integrating both N-terminal and C-terminal sequence information. First, we collected a comprehensive dataset across multiple bacterial species of known T4SEs and non-T4SEs from literatures. Then, three types of distinctive features, namely amino acid composition, composition, transition and distribution and position-specific scoring matrices were calculated for 50 N-terminal and 100 C-terminal residues. After that, we employed information gain represent to rank the importance score of the 150 different position residues for T4SE secretion signaling. At last, 125 distinctive position residues were singled out for the prediction model to classify T4SEs and non-T4SEs. The support vector machine model yields a high receiver operating curve of 0.916 in the fivefold cross-validation and an accuracy of 85.29% for the independent test set.
Collapse
Affiliation(s)
- Yu Wang
- College of Chemistry, Sichuan University, Chengdu, 610064, China
- College of Materials and Chemistry & Chemical Engineering, Chengdu University of Technology, Chengdu, 610059, China
| | - Yanzhi Guo
- College of Chemistry, Sichuan University, Chengdu, 610064, China.
| | - Xuemei Pu
- College of Chemistry, Sichuan University, Chengdu, 610064, China
| | - Menglong Li
- College of Chemistry, Sichuan University, Chengdu, 610064, China
| |
Collapse
|
47
|
Pradhan D, Padhy S, Sahoo B. Enzyme classification using multiclass support vector machine and feature subset selection. Comput Biol Chem 2017; 70:211-219. [PMID: 28934693 DOI: 10.1016/j.compbiolchem.2017.08.009] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2017] [Revised: 07/15/2017] [Accepted: 08/15/2017] [Indexed: 10/19/2022]
Abstract
Proteins are the macromolecules responsible for almost all biological processes in a cell. With the availability of large number of protein sequences from different sequencing projects, the challenge with the scientist is to characterize their functions. As the wet lab methods are time consuming and expensive, many computational methods such as FASTA, PSI-BLAST, DNA microarray clustering, and Nearest Neighborhood classification on protein-protein interaction network have been proposed. Support vector machine is one such method that has been used successfully for several problems such as protein fold recognition, protein structure prediction etc. Cai et al. in 2003 have used SVM for classifying proteins into different functional classes and to predict their function. They used the physico-chemical properties of proteins to represent the protein sequences. In this paper a model comprising of feature subset selection followed by multiclass Support Vector Machine is proposed to determine the functional class of a newly generated protein sequence. To train and test the model for its performance, 32 physico-chemical properties of enzymes from 6 enzyme classes are considered. To determine the features that contribute significantly for functional classification, Sequential Forward Floating Selection (SFFS), Orthogonal Forward Selection (OFS), and SVM Recursive Feature Elimination (SVM-RFE) algorithms are used and it is observed that out of 32 properties considered initially, only 20 features are sufficient to classify the proteins into its functional classes with an accuracy ranging from 91% to 94%. On comparison it is seen that, OFS followed by SVM performs better than other methods. Our model generalizes the existing model to include multiclass classification and to identify most significant features affecting the protein function.
Collapse
Affiliation(s)
- Debasmita Pradhan
- Department of Computer Scienceing and Engineering, Silicon Institute of Technology, Silicon Hills, Patia, Bhubaneswar, 751024, India.
| | - Sudarsan Padhy
- Department of Computer Scienceing and Engineering, Silicon Institute of Technology, Silicon Hills, Patia, Bhubaneswar, 751024, India
| | - Biswajit Sahoo
- School of Computer Engineering, KIIT University, Bhubaneswar, 751024, India
| |
Collapse
|
48
|
Darnag R, Minaoui B, Fakir M. QSAR models for prediction study of HIV protease inhibitors using support vector machines, neural networks and multiple linear regression. ARAB J CHEM 2017. [DOI: 10.1016/j.arabjc.2012.10.021] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
|
49
|
Wang Y, Guo Y, Pu X, Li M. A sequence-based computational method for prediction of MoRFs. RSC Adv 2017. [DOI: 10.1039/c6ra27161h] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Molecular recognition features (MoRFs) are relatively short segments (10–70 residues) within intrinsically disordered regions (IDRs) that can undergo disorder-to-order transitions during binding to partner proteins.
Collapse
Affiliation(s)
- Yu Wang
- College of Chemistry
- Sichuan University
- Chengdu
- People's Republic of China
| | - Yanzhi Guo
- College of Chemistry
- Sichuan University
- Chengdu
- People's Republic of China
| | - Xuemei Pu
- College of Chemistry
- Sichuan University
- Chengdu
- People's Republic of China
| | - Menglong Li
- College of Chemistry
- Sichuan University
- Chengdu
- People's Republic of China
| |
Collapse
|
50
|
Faraggi E, Kloczkowski A. Accurate Prediction of One-Dimensional Protein Structure Features Using SPINE-X. Methods Mol Biol 2017; 1484:45-53. [PMID: 27787819 DOI: 10.1007/978-1-4939-6406-2_5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Accurate prediction of protein secondary structure and other one-dimensional structure features is essential for accurate sequence alignment, three-dimensional structure modeling, and function prediction. SPINE-X is a software package to predict secondary structure as well as accessible surface area and dihedral angles ϕ and ψ. For secondary structure SPINE-X achieves an accuracy of between 81 and 84 % depending on the dataset and choice of tests. The Pearson correlation coefficient for accessible surface area prediction is 0.75 and the mean absolute error from the ϕ and ψ dihedral angles are 20∘ and 33∘, respectively. The source code and a Linux executables for SPINE-X are available from Research and Information Systems at http://mamiris.com .
Collapse
Affiliation(s)
- Eshel Faraggi
- Department of Biochemistry and Molecular Biology, Indiana University School of Medicine, Indianapolis, IN, 46032, USA
- Research and Information Systems, LLC, Indianapolis, IN, USA
| | - Andrzej Kloczkowski
- Battelle Center for Mathematical Medicine, Nationwide Children's Hospital, Columbus, OH, USA
- Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| |
Collapse
|