1
|
Shazia, Ullah FUM, Rho S, Lee MY. Predictive modeling for ubiquitin proteins through advanced machine learning technique. Heliyon 2024; 10:e32517. [PMID: 38975176 PMCID: PMC11225741 DOI: 10.1016/j.heliyon.2024.e32517] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 06/05/2024] [Indexed: 07/09/2024] Open
Abstract
Ubiquitination is an essential post-translational modification mechanism involving the ubiquitin protein's bonding to a substrate protein. It is crucial in a variety of physiological activities including cell survival and differentiation, and innate and adaptive immunity. Any alteration in the ubiquitin system leads to the development of various human diseases. Numerous researches show the highly reversibility and dynamic of ubiquitin system, making the experimental identification quite difficult. To solve this issue, this article develops a model using a machine learning approach, tending to improve the ubiquitin protein prediction precisely. We deeply investigate the ubiquitination data that is proceed through different features extraction methods, followed by the classification. The evaluation and assessment are conducted considering Jackknife tests and 10-fold cross-validation. The proposed method demonstrated the remarkable performance in terms of 100 %, 99.88 %, and 99.84 % accuracy on Dataset-I, Dataset-II, and Dataset-III, respectively. Using Jackknife test, the method achieves 100 %, 99.91 %, and 99.99 % for Dataset-I, Dataset-II and Dataset-III, respectively. This analysis concludes that the proposed method outperformed the state-of-the-arts to identify the ubiquitination sites and helpful in the development of current clinical therapies. The source code and datasets will be made available at Github.
Collapse
Affiliation(s)
- Shazia
- Mardan College of Nursing, Bacha Khan Medical College, Mardan, Pakistan
| | - Fath U Min Ullah
- Deparment of Computing, School of Engineering and Computing, University of Central Lancashire, Preston, United Kingdom
| | - Seungmin Rho
- Department of Industrial Security, Chung-Ang University, Seoul 06974, Republic of Korea
| | - Mi Young Lee
- Chung-Ang University, Seoul 06974, Republic of Korea
| |
Collapse
|
2
|
Khojasteh H, Pirgazi J, Ghanbari Sorkhi A. Improving prediction of drug-target interactions based on fusing multiple features with data balancing and feature selection techniques. PLoS One 2023; 18:e0288173. [PMID: 37535616 PMCID: PMC10399861 DOI: 10.1371/journal.pone.0288173] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Accepted: 06/21/2023] [Indexed: 08/05/2023] Open
Abstract
Drug discovery relies on predicting drug-target interaction (DTI), which is an important challenging task. The purpose of DTI is to identify the interaction between drug chemical compounds and protein targets. Traditional wet lab experiments are time-consuming and expensive, that's why in recent years, the use of computational methods based on machine learning has attracted the attention of many researchers. Actually, a dry lab environment focusing more on computational methods of interaction prediction can be helpful in limiting search space for wet lab experiments. In this paper, a novel multi-stage approach for DTI is proposed that called SRX-DTI. In the first stage, combination of various descriptors from protein sequences, and a FP2 fingerprint that is encoded from drug are extracted as feature vectors. A major challenge in this application is the imbalanced data due to the lack of known interactions, in this regard, in the second stage, the One-SVM-US technique is proposed to deal with this problem. Next, the FFS-RF algorithm, a forward feature selection algorithm, coupled with a random forest (RF) classifier is developed to maximize the predictive performance. This feature selection algorithm removes irrelevant features to obtain optimal features. Finally, balanced dataset with optimal features is given to the XGBoost classifier to identify DTIs. The experimental results demonstrate that our proposed approach SRX-DTI achieves higher performance than other existing methods in predicting DTIs. The datasets and source code are available at: https://github.com/Khojasteh-hb/SRX-DTI.
Collapse
Affiliation(s)
- Hakimeh Khojasteh
- Department of Computer Engineering, University of Zanjan, Zanjan, Iran
- School of Biological Sciences Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| | - Jamshid Pirgazi
- School of Biological Sciences Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
- Department of Computer Engineering, University of Science and Technology of Mazandaran, Behshahr, Iran
| | - Ali Ghanbari Sorkhi
- Department of Computer Engineering, University of Science and Technology of Mazandaran, Behshahr, Iran
| |
Collapse
|
3
|
Alotaibi FM, Khan YD. A Framework for Prediction of Oncogenomic Progression Aiding Personalized Treatment of Gastric Cancer. Diagnostics (Basel) 2023; 13:2291. [PMID: 37443684 DOI: 10.3390/diagnostics13132291] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2023] [Revised: 06/05/2023] [Accepted: 06/13/2023] [Indexed: 07/15/2023] Open
Abstract
Mutations in genes can alter their DNA patterns, and by recognizing these mutations, many carcinomas can be diagnosed in the progression stages. The human body contains many hidden and enigmatic features that humankind has not yet fully understood. A total of 7539 neoplasm cases were reported from 1 January 2021 to 31 December 2021. Of these, 3156 were seen in males (41.9%) and 4383 (58.1%) in female patients. Several machine learning and deep learning frameworks are already implemented to detect mutations, but these techniques lack generalized datasets and need to be optimized for better results. Deep learning-based neural networks provide the computational power to calculate the complex structures of gastric carcinoma-driven gene mutations. This study proposes deep learning approaches such as long and short-term memory, gated recurrent units and bi-LSTM to help in identifying the progression of gastric carcinoma in an optimized manner. This study includes 61 carcinogenic driver genes whose mutations can cause gastric cancer. The mutation information was downloaded from intOGen.org and normal gene sequences were downloaded from asia.ensembl.org, as explained in the data collection section. The proposed deep learning models are validated using the self-consistency test (SCT), 10-fold cross-validation test (FCVT), and independent set test (IST); the IST prediction metrics of accuracy, sensitivity, specificity, MCC and AUC of LSTM, Bi-LSTM, and GRU are 97.18%, 98.35%, 96.01%, 0.94, 0.98; 99.46%, 98.93%, 100%, 0.989, 1.00; 99.46%, 98.93%, 100%, 0.989 and 1.00, respectively.
Collapse
Affiliation(s)
- Fahad M Alotaibi
- Department of Information System, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, University of Management and Technology, Lahore 54770, Pakistan
| |
Collapse
|
4
|
Attique M, Alkhalifah T, Alturise F, Khan YD. DeepBCE: Evaluation of deep learning models for identification of immunogenic B-cell epitopes. Comput Biol Chem 2023; 104:107874. [PMID: 37126975 DOI: 10.1016/j.compbiolchem.2023.107874] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2023] [Revised: 04/17/2023] [Accepted: 04/20/2023] [Indexed: 05/03/2023]
Abstract
B-Cell epitopes (BCEs) can identify and bind with receptor proteins (antigens) to initiate an immune response against pathogens. Understanding antigen-antibody binding interactions has many applications in biotechnology and biomedicine, including designing antibodies, therapeutics, and vaccines. Lab-based experimental identification of these proteins is time-consuming and challenging. Computational techniques have been proposed to discover BCEs, but most lack of significant accomplishments. This work uses classical and deep learning models (DLMs) with sequence-based features to predict immunity stimulator BCEs from proteomics sequences. The proposed convolutional neural network-based model outperforms other models with an accuracy (ACC) of 0.878, an F-measure of 0.871, and an area under the receiver operating characteristic curve (AUC) of 0.945. The proposed strategy achieves 58.7% better results on average than other state-of-the-art approaches based on the Mathews Correlation Coefficient (MCC) results. The established model is accessible through a web application located at http://deeplbcepred.pythonanywhere.com.
Collapse
Affiliation(s)
- Muhammad Attique
- Department of Computer Science, University of Management and Technology, Lahore 54000, Pakistan; Department of Information Technology, University of Gujrat, Gujrat 50700, Pakistan
| | - Tamim Alkhalifah
- Department of Computer, College of Science and Arts in Ar Rass Qassim University, Ar Rass, Qassim, Saudi Arabia.
| | - Fahad Alturise
- Department of Computer, College of Science and Arts in Ar Rass Qassim University, Ar Rass, Qassim, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, University of Management and Technology, Lahore 54000, Pakistan
| |
Collapse
|
5
|
Perveen G, Alturise F, Alkhalifah T, Daanial Khan Y. Hemolytic-Pred: A machine learning-based predictor for hemolytic proteins using position and composition-based features. Digit Health 2023; 9:20552076231180739. [PMID: 37434723 PMCID: PMC10331097 DOI: 10.1177/20552076231180739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Accepted: 05/22/2023] [Indexed: 07/13/2023] Open
Abstract
Objective The objective of this study is to propose a novel in-silico method called Hemolytic-Pred for identifying hemolytic proteins based on their sequences, using statistical moment-based features, along with position-relative and frequency-relative information. Methods Primary sequences were transformed into feature vectors using statistical and position-relative moment-based features. Varying machine learning algorithms were employed for classification. Computational models were rigorously evaluated using four different validation. The Hemolytic-Pred webserver is available for further analysis at http://ec2-54-160-229-10.compute-1.amazonaws.com/. Results XGBoost outperformed the other six classifiers with an accuracy value of 0.99, 0.98, 0.97, and 0.98 for self-consistency test, 10-fold cross-validation, Jackknife test, and independent set test, respectively. The proposed method with the XGBoost classifier is a workable and robust solution for predicting hemolytic proteins efficiently and accurately. Conclusions The proposed method of Hemolytic-Pred with XGBoost classifier is a reliable tool for the timely identification of hemolytic cells and diagnosis of various related severe disorders. The application of Hemolytic-Pred can yield profound benefits in the medical field.
Collapse
Affiliation(s)
- Gulnaz Perveen
- Department of Computer Science, School
of Systems and Technology, University of Management and Technology, Lahore, Punjab,
Pakistan
| | - Fahad Alturise
- Department of Computer, College of
Science and Arts in Ar Rass Qassim University, Buraidah, Qassim, Saudi Arabia
| | - Tamim Alkhalifah
- Department of Computer, College of
Science and Arts in Ar Rass Qassim University, Buraidah, Qassim, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School
of Systems and Technology, University of Management and Technology, Lahore, Punjab,
Pakistan
| |
Collapse
|
6
|
Hassan A, Alkhalifah T, Alturise F, Khan YD. RCCC_Pred: A Novel Method for Sequence-Based Identification of Renal Clear Cell Carcinoma Genes through DNA Mutations and a Blend of Features. Diagnostics (Basel) 2022; 12:diagnostics12123036. [PMID: 36553042 PMCID: PMC9776995 DOI: 10.3390/diagnostics12123036] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 11/24/2022] [Accepted: 11/30/2022] [Indexed: 12/07/2022] Open
Abstract
To save lives from cancer, it is very crucial to diagnose it at its early stages. One solution to early diagnosis lies in the identification of the cancer driver genes and their mutations. Such diagnostics can substantially minimize the mortality rate of this deadly disease. However, concurrently, the identification of cancer driver gene mutation through experimental mechanisms could be an expensive, slow, and laborious job. The advancement of computational strategies that could help in the early prediction of cancer growth effectively and accurately is thus highly needed towards early diagnoses and a decrease in the mortality rates due to this disease. Herein, we aim to predict clear cell renal carcinoma (RCCC) at the level of the genes, using the genomic sequences. The dataset was taken from IntOgen Cancer Mutations Browser and all genes' standard DNA sequences were taken from the NCBI database. Using cancer-associated information of mutation from INTOGEN, the benchmark dataset was generated by creating the mutations in original sequences. After extensive feature extraction, the dataset was used to train ANN+ Hist Gradient boosting that could perform the classification of RCCC genes, other cancer-associated genes, and non-cancerous/unknown (non-tumor driver) genes. Through an independent dataset test, the accuracy observed was 83%, whereas the 10-fold cross-validation and Jackknife validation yielded 98% and 100% accurate results, respectively. The proposed predictor RCCC_Pred is able to identify RCCC genes with high accuracy and efficiency and can help scientists/researchers easily predict and diagnose cancer at its early stages.
Collapse
Affiliation(s)
- Arfa Hassan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore 54770, Pakistan
| | - Tamim Alkhalifah
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass 58892, Qassim, Saudi Arabia
- Correspondence:
| | - Fahad Alturise
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass 58892, Qassim, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore 54770, Pakistan
| |
Collapse
|
7
|
CNNLSTMac4CPred: A Hybrid Model for N4-Acetylcytidine Prediction. Interdiscip Sci 2022; 14:439-451. [PMID: 35106702 DOI: 10.1007/s12539-021-00500-0] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2021] [Revised: 12/04/2021] [Accepted: 12/13/2021] [Indexed: 12/23/2022]
Abstract
N4-Acetylcytidine (ac4C) is a highly conserved post-transcriptional and an extensively existing RNA modification, playing versatile roles in the cellular processes. Due to the limitation of techniques and knowledge, large-scale identification of ac4C is still a challenging task. RNA sequences are like sentences containing semantics in the natural language. Inspired by the semantics of language, we proposed a hybrid model for ac4C prediction. The model used long short-term memory and convolution neural network to extract the semantic features hidden in the sequences. The semantic and the two traditional features (k-nucleotide frequencies and pseudo tri-tuple nucleotide composition) were combined to represent ac4C or non-ac4C sequences. The eXtreme Gradient Boosting was used as the learning algorithm. Five-fold cross-validation over the training set consisting of 1160 ac4C and 10,855 non-ac4C sequences obtained the area under the receiver operating characteristic curve (AUROC) of 0.9004, and the independent test over 469 ac4C and 4343 non-ac4C sequences reached an AUROC of 0.8825. The model obtained a sensitivity of 0.6474 in the five-fold cross-validation and 0.6290 in the independent test, outperforming two state-of-the-art methods. The performance of semantic features alone was better than those of k-nucleotide frequencies and pseudo tri-tuple nucleotide composition, implying that ac4C sequences are of semantics. The proposed hybrid model was implemented into a user-friendly web-server which is freely available to scientific communities: http://47.113.117.61/ac4c/ . The presented model and tool are beneficial to identify ac4C on large scale.
Collapse
|
8
|
Naseer S, Hussain W, Khan YD, Rasool N. iPhosS(Deep)-PseAAC: Identification of Phosphoserine Sites in Proteins Using Deep Learning on General Pseudo Amino Acid Compositions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1703-1714. [PMID: 33242308 DOI: 10.1109/tcbb.2020.3040747] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Among all the PTMs, the protein phosphorylation is pivotal for various pathological and physiological processes. About 30 percent of eukaryotic proteins undergo the phosphorylation modification, leading to various changes in conformation, function, stability, localization, and so forth. In eukaryotic proteins, phosphorylation occurs on serine (S), Threonine (T) and Tyrosine (Y) residues. Among these all, serine phosphorylation has its own importance as it is associated with various importance biological processes, including energy metabolism, signal transduction pathways, cell cycling, and apoptosis. Thus, its identification is important, however, the in vitro, ex vivo and in vivo identification can be laborious, time-taking and costly. There is a dire need of an efficient and accurate computational model to help researchers and biologists identifying these sites, in an easy manner. Herein, we propose a novel predictor for identification of Phosphoserine sites (PhosS) in proteins, by integrating the Chou's Pseudo Amino Acid Composition (PseAAC) with deep features. We used well-known DNNs for both the tasks of learning a feature representation of peptide sequences and performing classifications. Among different DNNs, the best score is shown by Covolutional Neural Network based model which renders CNN based prediction model the best for Phosphoserine prediction. Based on these results, it is concluded that the proposed model can help to identify PhosS sites in a very efficient and accurate manner which can help scientists understand the mechanism of this modification in proteins.
Collapse
|
9
|
Malebary SJ, Alzahrani E, Khan YD. A comprehensive tool for accurate identification of methyl-Glutamine sites. J Mol Graph Model 2021; 110:108074. [PMID: 34768228 DOI: 10.1016/j.jmgm.2021.108074] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Revised: 10/15/2021] [Accepted: 11/02/2021] [Indexed: 11/16/2022]
Abstract
Methylation is a biochemical process involved in nearly all of the human body functions. Glutamine is considered an indispensable amino acid that is susceptible to methylation via post-translational modification (PTM). Modern research has proved that methylation plays a momentous role in the progression of most types of cancers. Therefore, there is a need for an effective method to predict glutamine sites vulnerable to methylation accurately and inexpensively. The motive of this study is the formulation of an accurate method that could predict such sites with high accuracy. Various computationally intelligent classifiers were employed for their formulation and evaluation. Rigorous validations prove that deep learning performs best as compared to other classifiers. The accuracy (ACC) and the area under the receiver operating curve (AUC) obtained by 10-fold cross-validation was 0.962 and 0.981, while with the jackknife testing, it was 0.968 and 0.980, respectively. From these results, it is concluded that the proposed methodology works sufficiently well for the prediction of methyl-glutamine sites. The webserver's code, developed for the prediction of methyl-glutamine sites, is freely available at https://github.com/s20181080001/WebServer.git. The code can easily be set up by any intermediate-level Python user.
Collapse
Affiliation(s)
- Sharaf J Malebary
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, P.O. Box 344, Rabigh, 21911, Saudi Arabia.
| | - Ebraheem Alzahrani
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P. O. Box 80203, Jeddah, 21589, Saudi Arabia.
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan.
| |
Collapse
|
10
|
Alzahrani E, Alghamdi W, Ullah MZ, Khan YD. Identification of stress response proteins through fusion of machine learning models and statistical paradigms. Sci Rep 2021; 11:21767. [PMID: 34741132 PMCID: PMC8571424 DOI: 10.1038/s41598-021-99083-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 09/13/2021] [Indexed: 11/08/2022] Open
Abstract
Proteins are a vital component of cells that perform physiological functions to ensure smooth operations of bodily functions. Identification of a protein's function involves a detailed understanding of the structure of proteins. Stress proteins are essential mediators of several responses to cellular stress and are categorized based on their structural characteristics. These proteins are found to be conserved across many eukaryotic and prokaryotic linkages and demonstrate varied crucial functional activities inside a cell. The in-vivo, ex vivo, and in-vitro identification of stress proteins are a time-consuming and costly task. This study is aimed at the identification of stress protein sequences with the aid of mathematical modelling and machine learning methods to supplement the aforementioned wet lab methods. The model developed using Random Forest showed remarkable results with 91.1% accuracy while models based on neural network and support vector machine showed 87.7% and 47.0% accuracy, respectively. Based on evaluation results it was concluded that random-forest based classifier surpassed all other predictors and is suitable for use in practical applications for the identification of stress proteins. Live web server is available at http://biopred.org/stressprotiens , while the webserver code available is at https://github.com/abdullah5naveed/SRP_WebServer.git.
Collapse
Affiliation(s)
- Ebraheem Alzahrani
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P. O. Box 80203, Jeddah, 21589, Saudi Arabia
| | - Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, P. O. Box 80221, Jeddah, 21589, Saudi Arabia
| | - Malik Zaka Ullah
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P. O. Box 80203, Jeddah, 21589, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, University of Management and Technology, Lahore, 54770, Pakistan.
| |
Collapse
|
11
|
Alghamdi W, Alzahrani E, Ullah MZ, Khan YD. 4mC-RF: Improving the prediction of 4mC sites using composition and position relative features and statistical moment. Anal Biochem 2021; 633:114385. [PMID: 34571005 DOI: 10.1016/j.ab.2021.114385] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Revised: 09/09/2021] [Accepted: 09/13/2021] [Indexed: 01/28/2023]
Abstract
N4-methylcytosine (4 mC) is an important epigenetic modification that occurs enzymatically by the action of DNA methyltransferases. 4 mC sites exist in prokaryotes and eukaryotes while playing a vital role in regulating gene expression, DNA replication, and cell cycle. The efficient and accurate prediction of 4 mC sites has a significant role in the insight of 4 mC biological properties and functions. Therefore, a sequence-based predictor is proposed, namely 4 mC-RF, for identifying 4 mC sites through the integration of statistical moments along with position, and composition-dependent features. Relative and absolute position-based features are computed to extract optimal features. A popular machine learning classifier Random Forest was used for training the model. Validation results were obtained through rigorous processes of self-consistency, 10-fold cross-validation, Independent set testing, and Jackknife yielding 95.1%, 95.2%, 97.0%, and 94.7% accuracies, respectively. Our proposed model depicts the highest prediction accuracies as compared to existing models. Subsequently, the developed 4 mC-RF model was constructed into a web server. A significant and more accurate predictor of 4 mC Methylcytosine sites helps experimental scientists to gather faster, efficient, and cost-effective results.
Collapse
Affiliation(s)
- Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, P. O. Box 80221, Jeddah 21589, Saudi Arabia.
| | - Ebraheem Alzahrani
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P. O. Box 80203, Jeddah 21589, Saudi Arabia.
| | - Malik Zaka Ullah
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P. O. Box 80203, Jeddah 21589, Saudi Arabia.
| | - Yaser Daanial Khan
- Department of Computer Science, University of Management and Technology, Lahore 54770, Pakistan.
| |
Collapse
|
12
|
Akmal MA, Hussain W, Rasool N, Khan YD, Khan SA, Chou KC. Using CHOU'S 5-Steps Rule to Predict O-Linked Serine Glycosylation Sites by Blending Position Relative Features and Statistical Moment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2045-2056. [PMID: 31985438 DOI: 10.1109/tcbb.2020.2968441] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Glycosylation of proteins in eukaryote cells is an important and complicated post-translation modification due to its pivotal role and association with crucial physiological functions within most of the proteins. Identification of glycosylation sites in a polypeptide chain is not an easy task due to multiple impediments. Analytical identification of these sites is expensive and laborious. There is a dire need to develop a reliable computational method for precise determination of such sites which can help researchers to save time and effort. Herein, we propose a novel predictor namely iGlycoS-PseAAC by integrating the Chou's Pseudo Amino Acid Composition (PseAAC) and relative/absolute position-based features. The self-consistency results show that the accuracy revealed by the model using the benchmark dataset for prediction of O-linked glycosylation having serine sites is 98.8 percent. The overall accuracy of predictor achieved through 10-fold cross validation by combining the positive and negative results is 97.2 percent. The overall accuracy achieved through Jackknife test is 96.195 percent by aggregating of all the prediction results. Thus the proposed predictor can help in predicting the O-linked glycosylated serine sites in an efficient and accurate way. The overall results show that the accuracy of the iGlycoS-PseAAC is higher than the existing tools.
Collapse
|
13
|
Malebary SJ, Khan YD. Evaluating machine learning methodologies for identification of cancer driver genes. Sci Rep 2021; 11:12281. [PMID: 34112883 PMCID: PMC8192921 DOI: 10.1038/s41598-021-91656-8] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Accepted: 05/19/2021] [Indexed: 02/06/2023] Open
Abstract
Cancer is driven by distinctive sorts of changes and basic variations in genes. Recognizing cancer driver genes is basic for accurate oncological analysis. Numerous methodologies to distinguish and identify drivers presently exist, but efficient tools to combine and optimize them on huge datasets are few. Most strategies for prioritizing transformations depend basically on frequency-based criteria. Strategies are required to dependably prioritize organically dynamic driver changes over inert passengers in high-throughput sequencing cancer information sets. This study proposes a model namely PCDG-Pred which works as a utility capable of distinguishing cancer driver and passenger attributes of genes based on sequencing data. Keeping in view the significance of the cancer driver genes an efficient method is proposed to identify the cancer driver genes. Further, various validation techniques are applied at different levels to establish the effectiveness of the model and to obtain metrics like accuracy, Mathew's correlation coefficient, sensitivity, and specificity. The results of the study strongly indicate that the proposed strategy provides a fundamental functional advantage over other existing strategies for cancer driver genes identification. Subsequently, careful experiments exhibit that the accuracy metrics obtained for self-consistency, independent set, and cross-validation tests are 91.08%., 87.26%, and 92.48% respectively.
Collapse
Affiliation(s)
- Sharaf J Malebary
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, P.O. Box 344, Rabigh, 21911, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan.
| |
Collapse
|
14
|
Naseer S, Hussain W, Khan YD, Rasool N. NPalmitoylDeep-PseAAC: A Predictor of N-Palmitoylation Sites in Proteins Using Deep Representations of Proteins and PseAAC via Modified 5-Steps Rule. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200605142828] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Among all the major Post-translational modification, lipid modifications
possess special significance due to their widespread functional importance in eukaryotic cells. There
exist multiple types of lipid modifications and Palmitoylation, among them, is one of the broader
types of modification, having three different types. The N-Palmitoylation is carried out by
attachment of palmitic acid to an N-terminal cysteine. Due to the association of N-Palmitoylation
with various biological functions and diseases such as Alzheimer’s and other neurodegenerative
diseases, its identification is very important.
Objective:
The in vitro, ex vivo and in vivo identification of Palmitoylation is laborious, time-taking
and costly. There is a dire need for an efficient and accurate computational model to help researchers
and biologists identify these sites, in an easy manner. Herein, we propose a novel prediction model
for the identification of N-Palmitoylation sites in proteins.
Method:
The proposed prediction model is developed by combining the Chou’s Pseudo Amino
Acid Composition (PseAAC) with deep neural networks. We used well-known deep neural
networks (DNNs) for both the tasks of learning a feature representation of peptide sequences and
developing a prediction model to perform classification.
Results:
Among different DNNs, Gated Recurrent Unit (GRU) based RNN model showed the
highest scores in terms of accuracy, and all other computed measures, and outperforms all the
previously reported predictors.
Conclusion:
The proposed GRU based RNN model can help to identify N-Palmitoylation in a very
efficient and accurate manner which can help scientists understand the mechanism of this
modification in proteins.
Collapse
Affiliation(s)
- Sheraz Naseer
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, P.O. Box 10033, C-II, Johar Town, Lahore 54770, Pakistan
| | - Waqar Hussain
- National Center of Artificial Intelligence, Punjab University College of Information Technology, University of the Punjab, Lahore, Pakistan
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, P.O. Box 10033, C-II, Johar Town, Lahore 54770, Pakistan
| | - Nouman Rasool
- Dr Panjwani Center for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences, University of Karachi, Karachi, 75270, Pakistan
| |
Collapse
|
15
|
Wang Y, Li Z, Zhang Y, Ma Y, Huang Q, Chen X, Dai Z, Zou X. Performance improvement for a 2D convolutional neural network by using SSC encoding on protein-protein interaction tasks. BMC Bioinformatics 2021; 22:184. [PMID: 33845759 PMCID: PMC8042949 DOI: 10.1186/s12859-021-04111-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2020] [Accepted: 03/30/2021] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND The interactions of proteins are determined by their sequences and affect the regulation of the cell cycle, signal transduction and metabolism, which is of extraordinary significance to modern proteomics research. Despite advances in experimental technology, it is still expensive, laborious, and time-consuming to determine protein-protein interactions (PPIs), and there is a strong demand for effective bioinformatics approaches to identify potential PPIs. Considering the large amount of PPI data, a high-performance processor can be utilized to enhance the capability of the deep learning method and directly predict protein sequences. RESULTS We propose the Sequence-Statistics-Content protein sequence encoding format (SSC) based on information extraction from the original sequence for further performance improvement of the convolutional neural network. The original protein sequences are encoded in the three-channel format by introducing statistical information (the second channel) and bigram encoding information (the third channel), which can increase the unique sequence features to enhance the performance of the deep learning model. On predicting protein-protein interaction tasks, the results using the 2D convolutional neural network (2D CNN) with the SSC encoding method are better than those of the 1D CNN with one hot encoding. The independent validation of new interactions from the HIPPIE database (version 2.1 published on July 18, 2017) and the validation of directly predicted results by applying a molecular docking tool indicate the effectiveness of the proposed protein encoding improvement in the CNN model. CONCLUSION The proposed protein sequence encoding method is efficient at improving the capability of the CNN model on protein sequence-related tasks and may also be effective at enhancing the capability of other machine learning or deep learning methods. Prediction accuracy and molecular docking validation showed considerable improvement compared to the existing hot encoding method, indicating that the SSC encoding method may be useful for analyzing protein sequence-related tasks. The source code of the proposed methods is freely available for academic research at https://github.com/wangy496/SSC-format/ .
Collapse
Affiliation(s)
- Yang Wang
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China
| | - Zhanchao Li
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, People's Republic of China
| | - Yanfei Zhang
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China
| | - Yingjun Ma
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China
| | - Qixing Huang
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, People's Republic of China
| | - Xingyu Chen
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, People's Republic of China
| | - Zong Dai
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China
- Research Institute of Sun Yat-Sen University in Shenzhen, Shenzhen, 518000, People's Republic of China
| | - Xiaoyong Zou
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China.
- Research Institute of Sun Yat-Sen University in Shenzhen, Shenzhen, 518000, People's Republic of China.
| |
Collapse
|
16
|
iAmideV-Deep: Valine Amidation Site Prediction in Proteins Using Deep Learning and Pseudo Amino Acid Compositions. Symmetry (Basel) 2021. [DOI: 10.3390/sym13040560] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
Amidation is an important post translational modification where a peptide ends with an amide group (–NH2) rather than carboxyl group (–COOH). These amidated peptides are less sensitive to proteolytic degradation with extended half-life in the bloodstream. Amides are used in different industries like pharmaceuticals, natural products, and biologically active compounds. The in-vivo, ex-vivo, and in-vitro identification of amidation sites is a costly and time-consuming but important task to study the physiochemical properties of amidated peptides. A less costly and efficient alternative is to supplement wet lab experiments with accurate computational models. Hence, an urgent need exists for efficient and accurate computational models to easily identify amidated sites in peptides. In this study, we present a new predictor, based on deep neural networks (DNN) and Pseudo Amino Acid Compositions (PseAAC), to learn efficient, task-specific, and effective representations for valine amidation site identification. Well-known DNN architectures are used in this contribution to learn peptide sequence representations and classify peptide chains. Of all the different DNN based predictors developed in this study, Convolutional neural network-based model showed the best performance surpassing all other DNN based models and reported literature contributions. The proposed model will supplement in-vivo methods and help scientists to determine valine amidation very efficiently and accurately, which in turn will enhance understanding of the valine amidation in different biological processes.
Collapse
|
17
|
Awais M, Hussain W, Khan YD, Rasool N, Khan SA, Chou KC. iPhosH-PseAAC: Identify Phosphohistidine Sites in Proteins by Blending Statistical Moments and Position Relative Features According to the Chou's 5-Step Rule and General Pseudo Amino Acid Composition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:596-610. [PMID: 31144645 DOI: 10.1109/tcbb.2019.2919025] [Citation(s) in RCA: 48] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Protein phosphorylation is one of the key mechanism in prokaryotes and eukaryotes and is responsible for various biological functions such as protein degradation, intracellular localization, the multitude of cellular processes, molecular association, cytoskeletal dynamics, and enzymatic inhibition/activation. Phosphohistidine (PhosH) has a key role in a number of biological processes, including central metabolism to signalling in eukaryotes and bacteria. Thus, identification of phosphohistidine sites in a protein sequence is crucial, and experimental identification can be expensive, time-taking, and laborious. To address this problem, here, we propose a novel computational model namely iPhosH-PseAAC for prediction of phosphohistidine sites in a given protein sequence using pseudo amino acid composition (PseAAC), statistical moments, and position relative features. The results of the proposed predictor are validated through self-consistency testing, 10-fold cross-validation, and jackknife testing. The self-consistency validation gave the 100 percent accuracy, whereas, for cross-validation, the accuracy achieved is 94.26 percent. Moreover, jackknife testing gave 97.07 percent accuracy for the proposed model. Thus, the proposed model iPhosH-PseAAC for prediction of iPhosH site has the great ability to predict the PhosH sites in given proteins.
Collapse
|
18
|
Khan YD, Alzahrani E, Alghamdi W, Ullah MZ. Sequence-based Identification of Allergen Proteins Developed by Integration of PseAAC and Statistical Moments via 5-Step Rule. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200424085947] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
Background:
Allergens are antigens that can stimulate an atopic type I human
hypersensitivity reaction by an immunoglobulin E (IgE) reaction. Some proteins are naturally
allergenic than others. The challenge for toxicologists is to identify properties that allow proteins
to cause allergic sensitization and allergic diseases. The identification of allergen proteins is a very
critical and pivotal task. The experimental identification of protein functions is a hectic, laborious
and costly task; therefore, computer scientists have proposed various methods in the field of
computational biology and bioinformatics using various data science approaches. Objectives:
Herein, we report a novel predictor for the identification of allergen proteins.
Methods:
For feature extraction, statistical moments and various position-based features have been
incorporated into Chou’s pseudo amino acid composition (PseAAC), and are used for training of a
neural network.
Results:
The predictor is validated through 10-fold cross-validation and Jackknife testing, which
gave 99.43% and 99.87% accurate results.
Conclusions:
Thus, the proposed predictor can help in predicting the Allergen proteins in an
efficient and accurate way and can provide baseline data for the discovery of new drugs and
biomarkers.
Collapse
Affiliation(s)
- Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, C II Johar Town, Lahore 54770, Pakistan
| | - Ebraheem Alzahrani
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P.O. Box 80203, Jeddah 21589, Saudi Arabia
| | - Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, P.O. Box 80221, Jeddah, Saudi Arabia
| | - Malik Zaka Ullah
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P.O. Box 80203, Jeddah 21589, Saudi Arabia
| |
Collapse
|
19
|
Abstract
Background Many transcripts have been generated due to the development of sequencing technologies, and lncRNA is an important type of transcript. Predicting lncRNAs from transcripts is a challenging and important task. Traditional experimental lncRNA prediction methods are time-consuming and labor-intensive. Efficient computational methods for lncRNA prediction are in demand. Results In this paper, we propose two lncRNA prediction methods based on feature ensemble learning strategies named LncPred-IEL and LncPred-ANEL. Specifically, we encode sequences into six different types of features including transcript-specified features and general sequence-derived features. Then we consider two feature ensemble strategies to utilize and integrate the information in different feature types, the iterative ensemble learning (IEL) and the attention network ensemble learning (ANEL). IEL employs a supervised iterative way to ensemble base predictors built on six different types of features. ANEL introduces an attention mechanism-based deep learning model to ensemble features by adaptively learning the weight of individual feature types. Experiments demonstrate that both LncPred-IEL and LncPred-ANEL can effectively separate lncRNAs and other transcripts in feature space. Moreover, comparison experiments demonstrate that LncPred-IEL and LncPred-ANEL outperform several state-of-the-art methods when evaluated by 5-fold cross-validation. Both methods have good performances in cross-species lncRNA prediction. Conclusions LncPred-IEL and LncPred-ANEL are promising lncRNA prediction tools that can effectively utilize and integrate the information in different types of features.
Collapse
Affiliation(s)
- Yanzhen Xu
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Xiaohan Zhao
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Shuai Liu
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Wen Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China.
| |
Collapse
|
20
|
Naseer S, Hussain W, Khan YD, Rasool N. Optimization of serine phosphorylation prediction in proteins by comparing human engineered features and deep representations. Anal Biochem 2020; 615:114069. [PMID: 33340540 DOI: 10.1016/j.ab.2020.114069] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 11/15/2020] [Accepted: 12/14/2020] [Indexed: 02/01/2023]
Abstract
Deep representations can be used to replace human-engineered representations, as such features are constrained by certain limitations. For the prediction of protein post-translation modifications (PTMs) sites, research community uses different feature extraction techniques applied on Pseudo amino acid compositions (PseAAC). Serine phosphorylation is one of the most important PTM as it is the most occurring, and is important for various biological functions. Creating efficient representations from large protein sequences, to predict PTM sites, is a time and resource intensive task. In this study we propose, implement and evaluate use of Deep learning to learn effective protein data representations from PseAAC to develop data driven PTM detection systems and compare the same with two human representations.. The comparisons are performed by training an xgboost based classifier using each representation. The best scores were achieved by RNN-LSTM based deep representation and CNN based representation with an accuracy score of 81.1% and 78.3% respectively. Human engineered representations scored 77.3% and 74.9% respectively. Based on these results, it is concluded that the deep features are promising feature engineering replacement to identify PhosS sites in a very efficient and accurate manner which can help scientists understand the mechanism of this modification in proteins.
Collapse
Affiliation(s)
- Sheraz Naseer
- Department of Computer Science, University of Management and Technology, Lahore, Pakistan.
| | - Waqar Hussain
- National Center of Artificial Intelligence, Punjab University College of Information Technology, University of the Punjab, Lahore, Pakistan; Center for Professional & Applied Studies, Lahore, Pakistan
| | - Yaser Daanial Khan
- Department of Computer Science, University of Management and Technology, Lahore, Pakistan
| | - Nouman Rasool
- Center for Professional & Applied Studies, Lahore, Pakistan
| |
Collapse
|
21
|
Liu GH, Zhang BW, Qian G, Wang B, Mao B, Bichindaritz I. Bioimage-Based Prediction of Protein Subcellular Location in Human Tissue with Ensemble Features and Deep Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1966-1980. [PMID: 31107658 DOI: 10.1109/tcbb.2019.2917429] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Prediction of protein subcellular location has currently become a hot topic because it has been proven to be useful for understanding both the disease mechanisms and novel drug design. With the rapid development of automated microscopic imaging technology in recent years, classification methods of bioimage-based protein subcellular location have attracted considerable attention for images can describe the protein distribution intuitively and in detail. In the current study, a prediction method of protein subcellular location was proposed based on multi-view image features that are extracted from three different views, including the four texture features of the original image, the global and local features of the protein extracted from the protein channel images after color segmentation, and the global features of DNA extracted from the DNA channel image. Finally, the extracted features were combined together to improve the performance of subcellular localization prediction. From the performance comparison of different combination features under the same classifier, the best ensemble features could be obtained. In this work, a classifier based on Stacked Auto-encoders and the random forest was also put forward. To improve the prediction results, the deep network was combined with the traditional statistical classification methods. Stringent cross-validation and independent validation tests on the benchmark dataset demonstrated the efficacy of the proposed method.
Collapse
|
22
|
Liu Y, Yu Z, Chen C, Han Y, Yu B. Prediction of protein crotonylation sites through LightGBM classifier based on SMOTE and elastic net. Anal Biochem 2020; 609:113903. [DOI: 10.1016/j.ab.2020.113903] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2020] [Revised: 07/27/2020] [Accepted: 08/05/2020] [Indexed: 12/18/2022]
|
23
|
Amanat S, Ashraf A, Hussain W, Rasool N, Khan YD. Identification of Lysine Carboxylation Sites in Proteins by Integrating Statistical Moments and Position Relative Features via General PseAAC. Curr Bioinform 2020. [DOI: 10.2174/1574893614666190723114923] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
Background:
Carboxylation is one of the most biologically important post-translational
modifications and occurs on lysine, arginine, and glutamine residues of a protein. Among all these
three, the covalent attachment of the carboxyl group with the lysine side chain is the most frequent
and biologically important type of carboxylation. For studying such biological functions, it is essential
to correctly determine the lysine sites sensitive to carboxylation.
Objective:
Herein, we present a computational model for the prediction of the carboxylysine site
which is based on machine learning.
Methods:
Various position and composition relative features have been incorporated into the Pse-
AAC for construction of feature vectors and a neural network is employed as a classifier. The
model is validated by jackknife, cross-validation, self-consistency, and independent testing.
Results:
The results of the self-consistency test elaborated that model has 99.76% Acc, 99.76% Sp,
99.76% Sp, and 0.99 MCC. Using the jackknife method, prediction model validation gave 97.07%
Acc, while for 10-fold cross-validation, prediction model validation gave 95.16% Acc.
Conclusion:
The results of independent dataset testing were 94.3% which illustrated that the proposed
model has better performance as compared to the existing model PreLysCar; however, the
accuracy can be improved further, in the future, due to the increasing number of carboxylysine
sites in proteins.
Collapse
Affiliation(s)
- Saba Amanat
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| | - Adeel Ashraf
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| | - Waqar Hussain
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| | - Nouman Rasool
- Department of Life Sciences, School of Science University of Management and Technology, Lahore, Pakistan
| | - Yaser D. Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| |
Collapse
|
24
|
Shah AA, Khan YD. Identification of 4-carboxyglutamate residue sites based on position based statistical feature and multiple classification. Sci Rep 2020; 10:16913. [PMID: 33037248 PMCID: PMC7547663 DOI: 10.1038/s41598-020-73107-y] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2020] [Accepted: 08/20/2020] [Indexed: 11/08/2022] Open
Abstract
Glutamic acid is an alpha-amino acid used by all living beings in protein biosynthesis. One of the important glutamic acid modifications is post-translationally modified 4-carboxyglutamate. It has a significant role in blood coagulation. 4-carboxyglumates are required for the binding of calcium ions. On the contrary, this modification can also cause different diseases such as bone resorption, osteoporosis, papilloma, and plaque atherosclerosis. Considering its importance, it is necessary to predict the occurrence of glutamic acid carboxylation in amino acid stretches. As there is no computational based prediction model available to identify 4-carboxyglutamate modification, this study is, therefore, designed to predict 4-carboxyglutamate sites with a less computational cost. A machine learning model is devised with a Multilayered Perceptron (MLP) classifier using Chou's 5-step rule. It may help in learning statistical moments and based on this learning, the prediction is to be made accurately either it is 4-carboxyglutamate residue site or detected residue site having no 4-carboxyglutamate. Prediction accuracy of the proposed model is 94% using an independent set test, while obtained prediction accuracy is 99% by self-consistency tests.
Collapse
Affiliation(s)
- Asghar Ali Shah
- Department of Computer Sciences, Bahria University Lahore Campus, Lahore, 25000, Pakistan.
| | | |
Collapse
|
25
|
Behbahani M, Rabiei P, Mohabatkar H. A Comparative Analysis of Allergen Proteins between Plants and Animals Using Several Computational Tools and Chou's PseAAC Concept. Int Arch Allergy Immunol 2020; 181:813-821. [PMID: 32906141 DOI: 10.1159/000509084] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2020] [Accepted: 05/29/2020] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND A large number of allergens are derived from plant and animal proteins. A major challenge for researchers is to study the possible allergenic properties of proteins. The aim of this study was in silico analysis and comparison of several physiochemical and structural features of plant- and animal-derived allergen proteins, as well as classifying these proteins based on Chou's pseudo-amino acid composition (PseAAC) concept combined with bioinformatics algorithms. METHODS The physiochemical properties and secondary structure of plant and animal allergens were studied. The classification of the sequences was done using the PseAAC concept incorporated with the deep learning algorithm. Conserved motifs of plant and animal proteins were discovered using the MEME tool. B-cell and T-cell epitopes of the proteins were predicted in conserved motifs. Allergenicity and amino acid composition of epitopes were also analyzed via bioinformatics servers. RESULTS In comparison of physiochemical features of animal and plant allergens, extinction coefficient was different significantly. Secondary structure prediction showed more random coiled structure in plant allergen proteins compared with animal proteins. Classification of proteins based on PseAAC achieved 88.24% accuracy. The amino acid composition study of predicted B- and T-cell epitopes revealed more aliphatic index in plant-derived epitopes. CONCLUSIONS The results indicated that bioinformatics-based studies could be useful in comparing plant and animal allergens.
Collapse
Affiliation(s)
- Mandana Behbahani
- Department of Biotechnology, Faculty of Biological Science and Technology, University of Isfahan, Isfahan, Iran
| | - Parisa Rabiei
- Department of Biotechnology, Faculty of Biological Science and Technology, University of Isfahan, Isfahan, Iran
| | - Hassan Mohabatkar
- Department of Biotechnology, Faculty of Biological Science and Technology, University of Isfahan, Isfahan, Iran,
| |
Collapse
|
26
|
Hussain W, Rasool N, Khan YD. Insights into Machine Learning-based Approaches for Virtual Screening in Drug Discovery: Existing Strategies and Streamlining Through FP-CADD. Curr Drug Discov Technol 2020; 18:463-472. [PMID: 32767944 DOI: 10.2174/1570163817666200806165934] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2020] [Revised: 07/01/2020] [Accepted: 07/03/2020] [Indexed: 11/22/2022]
Abstract
BACKGROUND Machine learning is an active area of research in computer science by the availability of big data collection of all sorts prompting interest in the development of novel tools for data mining. Machine learning methods have wide applications in computer-aided drug discovery methods. Most incredible approaches to machine learning are used in drug designing, which further aid the process of biological modelling in drug discovery. Mainly, two main categories are present which are Ligand-Based Virtual Screening (LBVS) and Structure-Based Virtual Screening (SBVS), however, the machine learning approaches fall mostly in the category of LBVS. OBJECTIVES This study exposits the major machine learning approaches being used in LBVS. Moreover, we have introduced a protocol named FP-CADD which depicts a 4-steps rule of thumb for drug discovery, the four protocols of computer-aided drug discovery (FP-CADD). Various important aspects along with SWOT analysis of FP-CADD are also discussed in this article. CONCLUSION By this thorough study, we have observed that in LBVS algorithms, Support Vector Machines (SVM) and Random Forest (RF) are those which are widely used due to high accuracy and efficiency. These virtual screening approaches have the potential to revolutionize the drug designing field. Also, we believe that the process flow presented in this study, named FP-CADD, can streamline the whole process of computer-aided drug discovery. By adopting this rule, the studies related to drug discovery can be made homogeneous and this protocol can also be considered as an evaluation criterion in the peer-review process of research articles.
Collapse
Affiliation(s)
| | | | - Yaser Daanial Khan
- Department of Computer Science, University of Management and Technology, Lahore, Pakistan
| |
Collapse
|
27
|
Chou KC. An Insightful 10-year Recollection Since the Emergence of the 5-steps Rule. Curr Pharm Des 2020; 25:4223-4234. [PMID: 31782354 DOI: 10.2174/1381612825666191129164042] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2019] [Accepted: 11/25/2019] [Indexed: 11/22/2022]
Abstract
OBJECTIVE One of the most challenging and also the most difficult problems is how to formulate a biological sequence with a vector but considerably keep its sequence order information. METHODS To address such a problem, the approach of Pseudo Amino Acid Components or PseAAC has been developed. RESULTS AND CONCLUSION It has become increasingly clear via the 10-year recollection that the aforementioned proposal has been indeed very powerful.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, Massachusetts 02478, United States.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
28
|
Hu Y, Lu Y, Wang S, Zhang M, Qu X, Niu B. Application of Machine Learning Approaches for the Design and Study of Anticancer Drugs. Curr Drug Targets 2020; 20:488-500. [PMID: 30091413 DOI: 10.2174/1389450119666180809122244] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2018] [Revised: 06/19/2018] [Accepted: 06/25/2018] [Indexed: 12/14/2022]
Abstract
BACKGROUND Globally the number of cancer patients and deaths are continuing to increase yearly, and cancer has, therefore, become one of the world's highest causes of morbidity and mortality. In recent years, the study of anticancer drugs has become one of the most popular medical topics. OBJECTIVE In this review, in order to study the application of machine learning in predicting anticancer drugs activity, some machine learning approaches such as Linear Discriminant Analysis (LDA), Principal components analysis (PCA), Support Vector Machine (SVM), Random forest (RF), k-Nearest Neighbor (kNN), and Naïve Bayes (NB) were selected, and the examples of their applications in anticancer drugs design are listed. RESULTS Machine learning contributes a lot to anticancer drugs design and helps researchers by saving time and is cost effective. However, it can only be an assisting tool for drug design. CONCLUSION This paper introduces the application of machine learning approaches in anticancer drug design. Many examples of success in identification and prediction in the area of anticancer drugs activity prediction are discussed, and the anticancer drugs research is still in active progress. Moreover, the merits of some web servers related to anticancer drugs are mentioned.
Collapse
Affiliation(s)
- Yan Hu
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Yi Lu
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Shuo Wang
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Mengying Zhang
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Xiaosheng Qu
- National Engineering Laboratory of Southwest Endangered Medicinal Resources Development, Guangxi Botanical Garden of Medicinal Plants, 530023,Nanning, China
| | - Bing Niu
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| |
Collapse
|
29
|
Feng P, Wang Z. Recent Advances in Computational Methods for Identifying Anticancer Peptides. Curr Drug Targets 2020; 20:481-487. [PMID: 30068270 DOI: 10.2174/1389450119666180801121548] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2018] [Revised: 05/28/2018] [Accepted: 05/28/2018] [Indexed: 01/10/2023]
Abstract
Anticancer peptide (ACP) is a kind of small peptides that can kill cancer cells without damaging normal cells. In recent years, ACP has been pre-clinically used for cancer treatment. Therefore, accurate identification of ACPs will promote their clinical applications. In contrast to labor-intensive experimental techniques, a series of computational methods have been proposed for identifying ACPs. In this review, we briefly summarized the current progress in computational identification of ACPs. The challenges and future perspectives in developing reliable methods for identification of ACPs were also discussed. We anticipate that this review could provide novel insights into future researches on anticancer peptides.
Collapse
Affiliation(s)
- Pengmian Feng
- School of Public Health, North China University of Science and Technology, Tangshan, 063000, China
| | - Zhenyi Wang
- Center for Genomics and Computational Biology, School of Life Science, North China University of Science and Technology, Tangshan, 063000, China
| |
Collapse
|
30
|
|
31
|
Zheng L, Huang S, Mu N, Zhang H, Zhang J, Chang Y, Yang L, Zuo Y. RAACBook: a web server of reduced amino acid alphabet for sequence-dependent inference by using Chou's five-step rule. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2019:5650975. [PMID: 31802128 PMCID: PMC6893003 DOI: 10.1093/database/baz131] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/20/2019] [Revised: 10/16/2019] [Accepted: 10/17/2019] [Indexed: 12/12/2022]
Abstract
By reducing amino acid alphabet, the protein complexity can be significantly simplified, which could improve computational efficiency, decrease information redundancy and reduce chance of overfitting. Although some reduced alphabets have been proposed, different classification rules could produce distinctive results for protein sequence analysis. Thus, it is urgent to construct a systematical frame for reduced alphabets. In this work, we constructed a comprehensive web server called RAACBook for protein sequence analysis and machine learning application by integrating reduction alphabets. The web server contains three parts: (i) 74 types of reduced amino acid alphabet were manually extracted to generate 673 reduced amino acid clusters (RAACs) for dealing with unique protein problems. It is easy for users to select desired RAACs from a multilayer browser tool. (ii) An online tool was developed to analyze primary sequence of protein. The tool could produce K-tuple reduced amino acid composition by defining three correlation parameters (K-tuple, g-gap, λ-correlation). The results are visualized as sequence alignment, mergence of RAA composition, feature distribution and logo of reduced sequence. (iii) The machine learning server is provided to train the model of protein classification based on K-tuple RAAC. The optimal model could be selected according to the evaluation indexes (ROC, AUC, MCC, etc.). In conclusion, RAACBook presents a powerful and user-friendly service in protein sequence analysis and computational proteomics. RAACBook can be freely available at http://bioinfor.imu.edu.cn/raacbook. Database URL: http://bioinfor.imu.edu.cn/raacbook
Collapse
Affiliation(s)
- Lei Zheng
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Zhaojun Road No.24, Hohhot, 010070, China
| | - Shenghui Huang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Zhaojun Road No.24, Hohhot, 010070, China
| | - Nengjiang Mu
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Zhaojun Road No.24, Hohhot, 010070, China
| | - Haoyue Zhang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Zhaojun Road No.24, Hohhot, 010070, China
| | - Jiayu Zhang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Zhaojun Road No.24, Hohhot, 010070, China
| | - Yu Chang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Zhaojun Road No.24, Hohhot, 010070, China
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Baojian Road No.157, Harbin 150081, China
| | - Yongchun Zuo
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Zhaojun Road No.24, Hohhot, 010070, China
| |
Collapse
|
32
|
Hussain W, Rasool N, Khan YD. A Sequence-Based Predictor of Zika Virus Proteins Developed by Integration of PseAAC and Statistical Moments. Comb Chem High Throughput Screen 2020; 23:797-804. [PMID: 32342804 DOI: 10.2174/1386207323666200428115449] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Revised: 03/17/2020] [Accepted: 03/19/2020] [Indexed: 12/20/2022]
Abstract
BACKGROUND ZIKV has been a well-known global threat, which hits almost all of the American countries and posed a serious threat to the entire globe in 2016. The first outbreak of ZIKV was reported in 2007 in the Pacific area, followed by another severe outbreak, which occurred in 2013/2014 and subsequently, ZIKV spread to all other Pacific islands. A broad spectrum of ZIKV associated neurological malformations in neonates and adults has driven this deadly virus into the limelight. Though tremendous efforts have been focused on understanding the molecular basis of ZIKV, the viral proteins of ZIKV have still not been studied extensively. OBJECTIVES Herein, we report the first and the novel predictor for the identification of ZIKV proteins. METHODS We have employed Chou's pseudo amino acid composition (PseAAC), statistical moments and various position-based features. RESULTS The predictor is validated through 10-fold cross-validation and Jackknife testing. In 10- fold cross-validation, 94.09% accuracy, 93.48% specificity, 94.20% sensitivity and 0.80 MCC were achieved while in Jackknife testing, 96.62% accuracy, 94.57% specificity, 97.00% sensitivity and 0.88 MCC were achieved. CONCLUSION Thus, ZIKVPred-PseAAC can help in predicting the ZIKV proteins efficiently and accurately and can provide baseline data for the discovery of new drugs and biomarkers against ZIKV.
Collapse
Affiliation(s)
- Waqar Hussain
- National Center of Artificial Intelligence, Punjab University College of Information Technology, University of the
Punjab, Lahore, Pakistan,Center for Professional Studies, Lahore, Pakistan
| | | | - Yaser D Khan
- Department of Computer Science, University of Management and Technology, Lahore, Pakistan
| |
Collapse
|
33
|
Identifying FL11 subtype by characterizing tumor immune microenvironment in prostate adenocarcinoma via Chou's 5-steps rule. Genomics 2020; 112:1500-1515. [DOI: 10.1016/j.ygeno.2019.08.021] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2019] [Revised: 08/03/2019] [Accepted: 08/26/2019] [Indexed: 12/14/2022]
|
34
|
Ahmad A, Lin H, Shatabda S. Locate-R: Subcellular localization of long non-coding RNAs using nucleotide compositions. Genomics 2020; 112:2583-2589. [PMID: 32068122 DOI: 10.1016/j.ygeno.2020.02.011] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2019] [Revised: 11/11/2019] [Accepted: 02/12/2020] [Indexed: 12/12/2022]
Abstract
Knowledge of the sub-cellular localization of the most diverse class of transcribed RNA, long non-coding RNAs (lncRNAs) will lead us to identify different types of cancers and other diseases as lncRNAs play key role in related cellular functions. In recent days with the exponential growth of known records, it becomes essential to establish new machine learning based techniques to identify the new one due to faster and cheaper solutions provided compared to laboratory methods. In this paper, we propose Locate-R, a novel method for predicting the sub-cellular location of lncRNAs. We have used only n-gapped l-mer composition and l-mer composition as features and select best 655 features to build the model. This model is based locally deep support vector machines which significantly enhance the prediction accuracy with respect to exiting state-of-the-art methods. Our predictor is readily available for use as a stand-alone web application from: http://locate-r.azurewebsites.net/.
Collapse
Affiliation(s)
- Ahsan Ahmad
- Department of Computer Science and Engineering, United International University, Plot 2, United City, Madani Avenue, Satarkul, Badda, Dhaka 1212, Bangladesh
| | - Hao Lin
- School of Life Science and Technology, University of Electronic Science and Technology of China, China
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Plot 2, United City, Madani Avenue, Satarkul, Badda, Dhaka 1212, Bangladesh.
| |
Collapse
|
35
|
Ju Z, Wang SY. Prediction of lysine formylation sites using the composition of k-spaced amino acid pairs via Chou's 5-steps rule and general pseudo components. Genomics 2020; 112:859-866. [DOI: 10.1016/j.ygeno.2019.05.027] [Citation(s) in RCA: 54] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2019] [Revised: 05/13/2019] [Accepted: 05/30/2019] [Indexed: 11/30/2022]
|
36
|
Lu B, Liu XH, Liao SM, Lu ZL, Chen D, Troy Ii FA, Huang RB, Zhou GP. A Possible Modulation Mechanism of Intramolecular and Intermolecular Interactions for NCAM Polysialylation and Cell Migration. Curr Top Med Chem 2019; 19:2271-2282. [PMID: 31648641 DOI: 10.2174/1568026619666191018094805] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Revised: 08/01/2019] [Accepted: 08/06/2019] [Indexed: 12/31/2022]
Abstract
Polysialic acid (polySia) is a novel glycan that posttranslationally modifies neural cell adhesion molecules (NCAMs) in mammalian cells. Up-regulation of polySia-NCAM expression or NCAM polysialylation is associated with tumor cell migration and progression in many metastatic cancers and neurocognition. It has been known that two highly homologous mammalian polysialyltransferases (polySTs), ST8Sia II (STX) and ST8Sia IV (PST), can catalyze polysialylation of NCAM, and two polybasic domains, polybasic region (PBR) and polysialyltransferase domain (PSTD) in polySTs play key roles in affecting polyST activity or NCAM polysialylation. However, the molecular mechanisms of NCAM polysialylation and cell migration are still not entirely clear. In this minireview, the recent research results about the intermolecular interactions between the PBR and NCAM, the PSTD and cytidine monophosphate-sialic acid (CMP-Sia), the PSTD and polySia, and as well as the intramolecular interaction between the PBR and the PSTD within the polyST, are summarized. Based on these cooperative interactions, we have built a novel model of NCAM polysialylation and cell migration mechanisms, which may be helpful to design and develop new polysialyltransferase inhibitors.
Collapse
Affiliation(s)
- Bo Lu
- The National Engineering Research Center for Non-Food Biorefinery, Guangxi Academy of Sciences, Nanning, Guangxi 530007, China
| | - Xue-Hui Liu
- Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China
| | - Si-Ming Liao
- The National Engineering Research Center for Non-Food Biorefinery, Guangxi Academy of Sciences, Nanning, Guangxi 530007, China
| | - Zhi-Long Lu
- The National Engineering Research Center for Non-Food Biorefinery, Guangxi Academy of Sciences, Nanning, Guangxi 530007, China
| | - Dong Chen
- The National Engineering Research Center for Non-Food Biorefinery, Guangxi Academy of Sciences, Nanning, Guangxi 530007, China
| | - Frederic A Troy Ii
- Department of Biochemistry and Molecular Medicine, University of California School of Medicine, Davis, CA, 95817, United States
| | - Ri-Bo Huang
- The National Engineering Research Center for Non-Food Biorefinery, Guangxi Academy of Sciences, Nanning, Guangxi 530007, China.,Life Science and Biotechnology College, Guangxi University, Nanning, Guangxi 530004, China
| | - Guo-Ping Zhou
- The National Engineering Research Center for Non-Food Biorefinery, Guangxi Academy of Sciences, Nanning, Guangxi 530007, China
| |
Collapse
|
37
|
pLoc_bal-mHum: Predict subcellular localization of human proteins by PseAAC and quasi-balancing training dataset. Genomics 2019; 111:1274-1282. [DOI: 10.1016/j.ygeno.2018.08.007] [Citation(s) in RCA: 56] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2018] [Revised: 08/14/2018] [Accepted: 08/16/2018] [Indexed: 12/17/2022]
|
38
|
Chou KC. Impacts of Pseudo Amino Acid Components and 5-steps Rule to Proteomics and Proteome Analysis. Curr Top Med Chem 2019; 19:2283-2300. [DOI: 10.2174/1568026619666191018100141] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Revised: 08/18/2019] [Accepted: 08/26/2019] [Indexed: 01/27/2023]
Abstract
Stimulated by the 5-steps rule during the last decade or so, computational proteomics has achieved remarkable progresses in the following three areas: (1) protein structural class prediction; (2) protein subcellular location prediction; (3) post-translational modification (PTM) site prediction. The results obtained by these predictions are very useful not only for an in-depth study of the functions of proteins and their biological processes in a cell, but also for developing novel drugs against major diseases such as cancers, Alzheimer’s, and Parkinson’s. Moreover, since the targets to be predicted may have the multi-label feature, two sets of metrics are introduced: one is for inspecting the global prediction quality, while the other for the local prediction quality. All the predictors covered in this review have a userfriendly web-server, through which the majority of experimental scientists can easily obtain their desired data without the need to go through the complicated mathematics.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| |
Collapse
|
39
|
Malebary SJ, Rehman MSU, Khan YD. iCrotoK-PseAAC: Identify lysine crotonylation sites by blending position relative statistical features according to the Chou's 5-step rule. PLoS One 2019; 14:e0223993. [PMID: 31751380 PMCID: PMC6874067 DOI: 10.1371/journal.pone.0223993] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Accepted: 10/02/2019] [Indexed: 01/22/2023] Open
Abstract
Among different post-translational modifications (PTMs), one of the most important one is the lysine crotonylation in proteins. Its importance cannot be undermined related to different diseases and essential biological practice. The key step for finding the hidden mechanisms of crotonylation along with their occurrence sites is to completely apprehend the mechanism behind this biological process. In previously reported studies, researchers have used different techniques, like position weighted matrix (PWM), support vector machine (SVM), k nearest neighbors (KNN), and many others. However, the maximum prediction accuracy achieved was not such high. To address this, herein, we propose an improved predictor for lysine crotonylation sites named iCrotoK-PseAAC, in which we have incorporated various position and composition relative features along with statistical moments into PseAAC. The results of self-consistency testing were 100% accurate, while the 10-fold cross validation gave 99.0% accuracy. Based on the validation and comparison of model, it is concluded that the iCrotoK-PseAAC is more accurate than the previously proposed models.
Collapse
Affiliation(s)
- Sharaf Jameel Malebary
- Department of Information Technology, King Abdul Aziz University, Rabigh, Kingdom of Saudi Arabia
| | - Muhammad Safi ur Rehman
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| |
Collapse
|
40
|
SDBP-Pred: Prediction of single-stranded and double-stranded DNA-binding proteins by extending consensus sequence and K-segmentation strategies into PSSM. Anal Biochem 2019; 589:113494. [PMID: 31693872 DOI: 10.1016/j.ab.2019.113494] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2019] [Revised: 10/24/2019] [Accepted: 10/31/2019] [Indexed: 11/24/2022]
Abstract
Identification of DNA-binding proteins (DNA-BPs) is a hot issue in protein science due to its key role in various biological processes. These processes are highly concerned with DNA-binding protein types. DNA-BPs are classified into single-stranded DNA-binding proteins (SSBs) and double-stranded DNA-binding proteins (DSBs). SSBs mainly involved in DNA recombination, replication, and repair, while DSBs regulate transcription process, DNA cleavage, and chromosome packaging. In spite of the aforementioned significance, few methods have been proposed for discrimination of SSBs and DSBs. Therefore, more predictors with favorable performance are indispensable. In this work, we present an innovative predictor, called SDBP-Pred with a novel feature descriptor, named consensus sequence-based K-segmentation position-specific scoring matrix (CSKS-PSSM). We encoded the local discriminative features concealed in PSSM via K-segmentation strategy and the global potential features by applying the notion of the consensus sequence. The obtained feature vector then input to support vector machine (SVM) with linear, polynomial and radial base function (RBF) kernels. Our model with SVM-RBF achieved the highest accuracies on three tests namely jackknife, 10-fold, and independent tests, respectively than the recent method. The obtained prediction results illustrate the superlative prediction performance of SDBP-Pred over existing studies in the literature so far.
Collapse
|
41
|
Behbahani M, Nosrati M, Moradi M, Mohabatkar H. Using Chou's General Pseudo Amino Acid Composition to Classify Laccases from Bacterial and Fungal Sources via Chou's Five-Step Rule. Appl Biochem Biotechnol 2019; 190:1035-1048. [PMID: 31659712 DOI: 10.1007/s12010-019-03141-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2019] [Accepted: 09/12/2019] [Indexed: 01/28/2023]
Abstract
Laccases are a group of enzymes with a critical activity in the degradation process of both phenolic and non-phenolic compounds. These enzymes present in a diverse array of species, including fungi and bacteria. Since this enzyme is in the market for different usages from industry to medicine, having a better knowledge of its structures and properties from diverse sources will be useful to select the most appropriate candidate for different purposes. In the current study, sequence- and structure-based characteristics of these enzymes from fungi and bacteria, including pseudo amino acid composition (PseAAC), physicochemical characteristics, and their secondary structures, are being compared and classified. Autodock 4 software was used for docking analysis between these laccases and some phenolic and non-phenolic compounds. The results indicated that features including molecular weight, aliphatic, extinction coefficient, and random coil percentage of these protein groups present high degrees of diversity in most cases. Categorization of these enzymes by the notion of PseAAC, showed over 96% accuracy. The binding free energy between fungal laccases and their substrates showed to be considerably higher than those of bacterial ones. According to the outcomes of the current study, data mining methods by using different machine learning algorithms, especially neural networks, could provide valuable information for a fair comparison between fungal and bacterial laccases. These results also suggested an association between efficacy and physicochemical features of laccase enzymes from different sources.
Collapse
Affiliation(s)
- Mandana Behbahani
- Department of Biotechnology, Faculty of Biological Science and Technology, University of Isfahan, Isfahan, Iran
| | - Mokhtar Nosrati
- Department of Biotechnology, Faculty of Biological Science and Technology, University of Isfahan, Isfahan, Iran
| | - Mohammad Moradi
- Department of Biotechnology, Faculty of Biological Science and Technology, University of Isfahan, Isfahan, Iran
| | - Hassan Mohabatkar
- Department of Biotechnology, Faculty of Biological Science and Technology, University of Isfahan, Isfahan, Iran.
| |
Collapse
|
42
|
Kang C. 19F-NMR in Target-based Drug Discovery. Curr Med Chem 2019; 26:4964-4983. [PMID: 31187703 DOI: 10.2174/0929867326666190610160534] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2018] [Revised: 08/14/2018] [Accepted: 03/13/2019] [Indexed: 02/06/2023]
Abstract
Solution NMR spectroscopy plays important roles in understanding protein structures, dynamics and protein-protein/ligand interactions. In a target-based drug discovery project, NMR can serve an important function in hit identification and lead optimization. Fluorine is a valuable probe for evaluating protein conformational changes and protein-ligand interactions. Accumulated studies demonstrate that 19F-NMR can play important roles in fragment- based drug discovery (FBDD) and probing protein-ligand interactions. This review summarizes the application of 19F-NMR in understanding protein-ligand interactions and drug discovery. Several examples are included to show the roles of 19F-NMR in confirming identified hits/leads in the drug discovery process. In addition to identifying hits from fluorinecontaining compound libraries, 19F-NMR will play an important role in drug discovery by providing a fast and robust way in novel hit identification. This technique can be used for ranking compounds with different binding affinities and is particularly useful for screening competitive compounds when a reference ligand is available.
Collapse
Affiliation(s)
- CongBao Kang
- Experimental Drug Development Centre (EDDC), Agency for Science, Technology and Research (A*STAR), 10 Biopolis Road, #05-01, Singapore, 138670, Singapore
| |
Collapse
|
43
|
Su ZD, Huang Y, Zhang ZY, Zhao YW, Wang D, Chen W, Chou KC, Lin H. iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC. Bioinformatics 2019; 34:4196-4204. [PMID: 29931187 DOI: 10.1093/bioinformatics/bty508] [Citation(s) in RCA: 129] [Impact Index Per Article: 25.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2018] [Accepted: 06/19/2018] [Indexed: 12/20/2022] Open
Abstract
Motivation Long non-coding RNAs (lncRNAs) are a class of RNA molecules with more than 200 nucleotides. They have important functions in cell development and metabolism, such as genetic markers, genome rearrangements, chromatin modifications, cell cycle regulation, transcription and translation. Their functions are generally closely related to their localization in the cell. Therefore, knowledge about their subcellular locations can provide very useful clues or preliminary insight into their biological functions. Although biochemical experiments could determine the localization of lncRNAs in a cell, they are both time-consuming and expensive. Therefore, it is highly desirable to develop bioinformatics tools for fast and effective identification of their subcellular locations. Results We developed a sequence-based bioinformatics tool called 'iLoc-lncRNA' to predict the subcellular locations of LncRNAs by incorporating the 8-tuple nucleotide features into the general PseKNC (Pseudo K-tuple Nucleotide Composition) via the binomial distribution approach. Rigorous jackknife tests have shown that the overall accuracy achieved by the new predictor on a stringent benchmark dataset is 86.72%, which is over 20% higher than that by the existing state-of-the-art predictor evaluated on the same tests. Availability and implementation A user-friendly webserver has been established at http://lin-group.cn/server/iLoc-LncRNA, by which users can easily obtain their desired results. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zhen-Dong Su
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yan Huang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Zhao-Yue Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Ya-Wei Zhao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Dong Wang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Wei Chen
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan, China.,Gordon Life Science Institute, Boston, MA, USA
| | - Kuo-Chen Chou
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,Gordon Life Science Institute, Boston, MA, USA
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,Gordon Life Science Institute, Boston, MA, USA
| |
Collapse
|
44
|
Khan YD, Amin N, Hussain W, Rasool N, Khan SA, Chou KC. iProtease-PseAAC(2L): A two-layer predictor for identifying proteases and their types using Chou's 5-step-rule and general PseAAC. Anal Biochem 2019; 588:113477. [PMID: 31654612 DOI: 10.1016/j.ab.2019.113477] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2019] [Revised: 10/02/2019] [Accepted: 10/18/2019] [Indexed: 12/16/2022]
Abstract
Proteases are a type of enzymes, which perform the process of proteolysis. Proteolysis normally refers to protein and peptide degradation which is crucial for the survival, growth and wellbeing of a cell. Moreover, proteases have a strong association with therapeutics and drug development. The proteases are classified into five different types according to their nature and physiochemical characteristics. Mostly the methods used to differentiate protease from other proteins and identify their class requires a clinical test which is usually time-consuming and operator dependent. Herein, we report a classifier named iProtease-PseAAC (2L) for identifying proteases and their classes. The predictor is developed employing the flow of 5-step rule, initiating from the collection of benchmark dataset and terminating at the development of predictor. Rigorous verification and validation tests are performed and metrics are collected to calculate the authenticity of the trained model. The self-consistency validation gives the 98.32% accuracy, for cross-validation the accuracy is 90.71% and jackknife gives 96.07% accuracy. The average accuracy for level-2 i.e. protease classification is 95.77%. Based on the above-mentioned results, it is concluded that iProtease-PseAAC (2L) has the great ability to identify the proteases and their classes using a given protein sequence.
Collapse
Affiliation(s)
- Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, P.O. Box 10033, C-II, Johar Town, Lahore, 54770, Pakistan.
| | - Najm Amin
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, P.O. Box 10033, C-II, Johar Town, Lahore, 54770, Pakistan
| | - Waqar Hussain
- National Center of Artificial Intelligence, Punjab University College of Information Technology, University of the Punjab, Lahore, Pakistan
| | - Nouman Rasool
- Dr Panjwani Center for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences, University of Karachi, Karachi, 75270, Pakistan
| | - Sher Afzal Khan
- Faculty of Computing and Information Technology in Rabigh, Jeddah, 21577, Saudi Arabia; Abdul Wali Khan University, Department of Computer Sciences, Mardan, Pakistan
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA, 02478, USA
| |
Collapse
|
45
|
Liang R, Xie J, Zhang C, Zhang M, Huang H, Huo H, Cao X, Niu B. Identifying Cancer Targets Based on Machine Learning Methods via Chou's 5-steps Rule and General Pseudo Components. Curr Top Med Chem 2019; 19:2301-2317. [PMID: 31622219 DOI: 10.2174/1568026619666191016155543] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2019] [Revised: 07/19/2019] [Accepted: 08/26/2019] [Indexed: 01/09/2023]
Abstract
In recent years, the successful implementation of human genome project has made people realize that genetic, environmental and lifestyle factors should be combined together to study cancer due to the complexity and various forms of the disease. The increasing availability and growth rate of 'big data' derived from various omics, opens a new window for study and therapy of cancer. In this paper, we will introduce the application of machine learning methods in handling cancer big data including the use of artificial neural networks, support vector machines, ensemble learning and naïve Bayes classifiers.
Collapse
Affiliation(s)
- Ruirui Liang
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| | - Jiayang Xie
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| | - Chi Zhang
- Foshan Huaxia Eye Hospital, Huaxia Eye Hospital Group, Foshan 528000, China
| | - Mengying Zhang
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| | - Hai Huang
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| | - Haizhong Huo
- Department of General Surgery, Shanghai Ninth People's Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200011, China
| | - Xin Cao
- Zhongshan Hospital, Institute of Clinical Science, Shanghai Medical College, Fudan University, Shanghai 200032, China
| | - Bing Niu
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| |
Collapse
|
46
|
Liu B, Li K, Huang DS, Chou KC. iEnhancer-EL: identifying enhancers and their strength with ensemble learning approach. Bioinformatics 2019; 34:3835-3842. [PMID: 29878118 DOI: 10.1093/bioinformatics/bty458] [Citation(s) in RCA: 138] [Impact Index Per Article: 27.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2018] [Accepted: 06/06/2018] [Indexed: 11/14/2022] Open
Abstract
Motivation Identification of enhancers and their strength is important because they play a critical role in controlling gene expression. Although some bioinformatics tools were developed, they are limited in discriminating enhancers from non-enhancers only. Recently, a two-layer predictor called 'iEnhancer-2L' was developed that can be used to predict the enhancer's strength as well. However, its prediction quality needs further improvement to enhance the practical application value. Results A new predictor called 'iEnhancer-EL' was proposed that contains two layer predictors: the first one (for identifying enhancers) is formed by fusing an array of six key individual classifiers, and the second one (for their strength) formed by fusing an array of ten key individual classifiers. All these key classifiers were selected from 171 elementary classifiers formed by SVM (Support Vector Machine) based on kmer, subsequence profile and PseKNC (Pseudo K-tuple Nucleotide Composition), respectively. Rigorous cross-validations have indicated that the proposed predictor is remarkably superior to the existing state-of-the-art one in this area. Availability and implementation A web server for the iEnhancer-EL has been established at http://bioinformatics.hitsz.edu.cn/iEnhancer-EL/, by which users can easily get their desired results without the need to go through the mathematical details. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China.,Gordon Life Science Institute, Belmont, MA, USA
| | - Kai Li
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Belmont, MA, USA.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
47
|
Prediction of S-Sulfenylation Sites Using Statistical Moments Based Features via CHOU’S 5-Step Rule. Int J Pept Res Ther 2019. [DOI: 10.1007/s10989-019-09931-2] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
|
48
|
Chou KC. Proposing Pseudo Amino Acid Components is an Important Milestone for Proteome and Genome Analyses. Int J Pept Res Ther 2019. [DOI: 10.1007/s10989-019-09910-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
49
|
|
50
|
Xiao X, Cheng X, Chen G, Mao Q, Chou KC. pLoc_bal-mVirus: Predict Subcellular Localization of Multi-Label Virus Proteins by Chou's General PseAAC and IHTS Treatment to Balance Training Dataset. Med Chem 2019; 15:496-509. [DOI: 10.2174/1573406415666181217114710] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2018] [Revised: 10/23/2018] [Accepted: 12/12/2018] [Indexed: 12/17/2022]
Abstract
Background/Objective:Knowledge of protein subcellular localization is vitally important for both basic research and drug development. Facing the avalanche of protein sequences emerging in the post-genomic age, it is urgent to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called “pLoc-mVirus” was developed for identifying the subcellular localization of virus proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, known as “multiplex proteins”, may simultaneously occur in, or move between two or more subcellular location sites. Despite the fact that it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mVirus was trained by an extremely skewed dataset in which some subset was over 10 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset.Methods:Using the Chou's general PseAAC (Pseudo Amino Acid Composition) approach and the IHTS (Inserting Hypothetical Training Samples) treatment to balance out the training dataset, we have developed a new predictor called “pLoc_bal-mVirus” for predicting the subcellular localization of multi-label virus proteins.Results:Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mVirus, the existing state-of-theart predictor for the same purpose.Conclusion:Its user-friendly web-server is available at http://www.jci-bioinfo.cn/pLoc_balmVirus/, by which the majority of experimental scientists can easily get their desired results without the need to go through the detailed complicated mathematics. Accordingly, pLoc_bal-mVirus will become a very useful tool for designing multi-target drugs and in-depth understanding of the biological process in a cell.
Collapse
Affiliation(s)
- Xuan Xiao
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Xiang Cheng
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Genqiang Chen
- College of Chemistry, Chemical Engineering and Biotechnology, Donghua University, Shanghai 201620, China
| | - Qi Mao
- College of Information Science and Technology, Donghua University, Shanghai, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|