1
|
Sabir MJ, Kamli MR, Atef A, Alhibshi AM, Edris S, Hajarah NH, Bahieldin A, Manavalan B, Sabir JSM. Computational prediction of phosphorylation sites of SARS-CoV-2 infection using feature fusion and optimization strategies. Methods 2024; 229:1-8. [PMID: 38768932 DOI: 10.1016/j.ymeth.2024.04.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 03/15/2024] [Accepted: 04/30/2024] [Indexed: 05/22/2024] Open
Abstract
SARS-CoV-2's global spread has instigated a critical health and economic emergency, impacting countless individuals. Understanding the virus's phosphorylation sites is vital to unravel the molecular intricacies of the infection and subsequent changes in host cellular processes. Several computational methods have been proposed to identify phosphorylation sites, typically focusing on specific residue (S/T) or Y phosphorylation sites. Unfortunately, current predictive tools perform best on these specific residues and may not extend their efficacy to other residues, emphasizing the urgent need for enhanced methodologies. In this study, we developed a novel predictor that integrated all the residues (STY) phosphorylation sites information. We extracted ten different feature descriptors, primarily derived from composition, evolutionary, and position-specific information, and assessed their discriminative power through five classifiers. Our results indicated that Light Gradient Boosting (LGB) showed superior performance, and five descriptors displayed excellent discriminative capabilities. Subsequently, we identified the top two integrated features have high discriminative capability and trained with LGB to develop the final prediction model, LGB-IPs. The proposed approach shows an excellent performance on 10-fold cross-validation with an ACC, MCC, and AUC values of 0.831, 0.662, 0.907, respectively. Notably, these performances are replicated in the independent evaluation. Consequently, our approach may provide valuable insights into the phosphorylation mechanisms in SARS-CoV-2 infection for biomedical researchers.
Collapse
Affiliation(s)
- Mumdooh J Sabir
- Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Majid Rasool Kamli
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ahmed Atef
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Alawiah M Alhibshi
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Sherif Edris
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Nahid H Hajarah
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ahmed Bahieldin
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea.
| | - Jamal S M Sabir
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia.
| |
Collapse
|
2
|
Wang X, Quinn D, Moody TS, Huang M. ALDELE: All-Purpose Deep Learning Toolkits for Predicting the Biocatalytic Activities of Enzymes. J Chem Inf Model 2024; 64:3123-3139. [PMID: 38573056 DOI: 10.1021/acs.jcim.4c00058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/05/2024]
Abstract
Rapidly predicting enzyme properties for catalyzing specific substrates is essential for identifying potential enzymes for industrial transformations. The demand for sustainable production of valuable industry chemicals utilizing biological resources raised a pressing need to speed up biocatalyst screening using machine learning techniques. In this research, we developed an all-purpose deep-learning-based multiple-toolkit (ALDELE) workflow for screening enzyme catalysts. ALDELE incorporates both structural and sequence representations of proteins, alongside representations of ligands by subgraphs and overall physicochemical properties. Comprehensive evaluation demonstrated that ALDELE can predict the catalytic activities of enzymes, and particularly, it identifies residue-based hotspots to guide enzyme engineering and generates substrate heat maps to explore the substrate scope for a given biocatalyst. Moreover, our models notably match empirical data, reinforcing the practicality and reliability of our approach through the alignment with confirmed mutation sites. ALDELE offers a facile and comprehensive solution by integrating different toolkits tailored for different purposes at affordable computational cost and therefore would be valuable to speed up the discovery of new functional enzymes for their exploitation by the industry.
Collapse
Affiliation(s)
- Xiangwen Wang
- School of Chemistry and Chemical Engineering, Queen's University Belfast, Belfast BT9 5AG, Northern Ireland, U.K
- Department of Biocatalysis and Isotope Chemistry, Almac Sciences, Craigavon BT63 5QD, Northern Ireland, U.K
| | - Derek Quinn
- Department of Biocatalysis and Isotope Chemistry, Almac Sciences, Craigavon BT63 5QD, Northern Ireland, U.K
| | - Thomas S Moody
- Department of Biocatalysis and Isotope Chemistry, Almac Sciences, Craigavon BT63 5QD, Northern Ireland, U.K
- Arran Chemical Company Limited, Unit 1 Monksland Industrial Estate, Athlone, Co., Roscommon N37 DN24, Ireland
| | - Meilan Huang
- School of Chemistry and Chemical Engineering, Queen's University Belfast, Belfast BT9 5AG, Northern Ireland, U.K
| |
Collapse
|
3
|
Zhuang J, Midgley AC, Wei Y, Liu Q, Kong D, Huang X. Machine-Learning-Assisted Nanozyme Design: Lessons from Materials and Engineered Enzymes. ADVANCED MATERIALS (DEERFIELD BEACH, FLA.) 2024; 36:e2210848. [PMID: 36701424 DOI: 10.1002/adma.202210848] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Revised: 01/03/2023] [Indexed: 05/11/2023]
Abstract
Nanozymes are nanomaterials that exhibit enzyme-like biomimicry. In combination with intrinsic characteristics of nanomaterials, nanozymes have broad applicability in materials science, chemical engineering, bioengineering, biochemistry, and disease theranostics. Recently, the heterogeneity of published results has highlighted the complexity and diversity of nanozymes in terms of consistency of catalytic capacity. Machine learning (ML) shows promising potential for discovering new materials, yet it remains challenging for the design of new nanozymes based on ML approaches. Alternatively, ML is employed to promote optimization of intelligent design and application of catalytic materials and engineered enzymes. Incorporation of the successful ML algorithms used in the intelligent design of catalytic materials and engineered enzymes can concomitantly facilitate the guided development of next-generation nanozymes with desirable properties. Here, recent progress in ML, its utilization in the design of catalytic materials and enzymes, and how emergent ML applications serve as promising strategies to circumvent challenges associated with time-expensive and laborious testing in nanozyme research and development are summarized. The potential applications of successful examples of ML-aided catalytic materials and engineered enzymes in nanozyme design are also highlighted, with special focus on the unified aims in enhancing design and recapitulation of substrate selectivity and catalytic activity.
Collapse
Affiliation(s)
- Jie Zhuang
- School of Medicine, and State, Key Laboratory of Medicinal Chemical Biology, Nankai University, Tianjin, 300071, China
| | - Adam C Midgley
- Key Laboratory of Bioactive Materials for the Ministry of Education, College of Life Sciences, State Key Laboratory of Medicinal Chemical Biology, and Frontiers, Science Center for Cell Responses, Nankai University, Tianjin, 300071, China
| | - Yonghua Wei
- Key Laboratory of Bioactive Materials for the Ministry of Education, College of Life Sciences, State Key Laboratory of Medicinal Chemical Biology, and Frontiers, Science Center for Cell Responses, Nankai University, Tianjin, 300071, China
| | - Qiqi Liu
- Key Laboratory of Bioactive Materials for the Ministry of Education, College of Life Sciences, State Key Laboratory of Medicinal Chemical Biology, and Frontiers, Science Center for Cell Responses, Nankai University, Tianjin, 300071, China
| | - Deling Kong
- Key Laboratory of Bioactive Materials for the Ministry of Education, College of Life Sciences, State Key Laboratory of Medicinal Chemical Biology, and Frontiers, Science Center for Cell Responses, Nankai University, Tianjin, 300071, China
| | - Xinglu Huang
- Key Laboratory of Bioactive Materials for the Ministry of Education, College of Life Sciences, State Key Laboratory of Medicinal Chemical Biology, and Frontiers, Science Center for Cell Responses, Nankai University, Tianjin, 300071, China
| |
Collapse
|
4
|
Ribeiro AJM, Riziotis IG, Borkakoti N, Thornton JM. Enzyme function and evolution through the lens of bioinformatics. Biochem J 2023; 480:1845-1863. [PMID: 37991346 PMCID: PMC10754289 DOI: 10.1042/bcj20220405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Revised: 11/09/2023] [Accepted: 11/14/2023] [Indexed: 11/23/2023]
Abstract
Enzymes have been shaped by evolution over billions of years to catalyse the chemical reactions that support life on earth. Dispersed in the literature, or organised in online databases, knowledge about enzymes can be structured in distinct dimensions, either related to their quality as biological macromolecules, such as their sequence and structure, or related to their chemical functions, such as the catalytic site, kinetics, mechanism, and overall reaction. The evolution of enzymes can only be understood when each of these dimensions is considered. In addition, many of the properties of enzymes only make sense in the light of evolution. We start this review by outlining the main paradigms of enzyme evolution, including gene duplication and divergence, convergent evolution, and evolution by recombination of domains. In the second part, we overview the current collective knowledge about enzymes, as organised by different types of data and collected in several databases. We also highlight some increasingly powerful computational tools that can be used to close gaps in understanding, in particular for types of data that require laborious experimental protocols. We believe that recent advances in protein structure prediction will be a powerful catalyst for the prediction of binding, mechanism, and ultimately, chemical reactions. A comprehensive mapping of enzyme function and evolution may be attainable in the near future.
Collapse
Affiliation(s)
- Antonio J. M. Ribeiro
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, U.K
| | - Ioannis G. Riziotis
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, U.K
| | - Neera Borkakoti
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, U.K
| | - Janet M. Thornton
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, U.K
| |
Collapse
|
5
|
Prešern U, Goličnik M. Enzyme Databases in the Era of Omics and Artificial Intelligence. Int J Mol Sci 2023; 24:16918. [PMID: 38069254 PMCID: PMC10707154 DOI: 10.3390/ijms242316918] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 11/24/2023] [Accepted: 11/26/2023] [Indexed: 12/18/2023] Open
Abstract
Enzyme research is important for the development of various scientific fields such as medicine and biotechnology. Enzyme databases facilitate this research by providing a wide range of information relevant to research planning and data analysis. Over the years, various databases that cover different aspects of enzyme biology (e.g., kinetic parameters, enzyme occurrence, and reaction mechanisms) have been developed. Most of the databases are curated manually, which improves reliability of the information; however, such curation cannot keep pace with the exponential growth in published data. Lack of data standardization is another obstacle for data extraction and analysis. Improving machine readability of databases is especially important in the light of recent advances in deep learning algorithms that require big training datasets. This review provides information regarding the current state of enzyme databases, especially in relation to the ever-increasing amount of generated research data and recent advancements in artificial intelligence algorithms. Furthermore, it describes several enzyme databases, providing the reader with necessary information for their use.
Collapse
Affiliation(s)
| | - Marko Goličnik
- Institute of Biochemistry and Molecular Genetics, Faculty of Medicine, University of Ljubljana, Vrazov trg 2, 1000 Ljubljana, Slovenia;
| |
Collapse
|
6
|
Shen X, Zhang S, Long J, Chen C, Wang M, Cui Z, Chen B, Tan T. A Highly Sensitive Model Based on Graph Neural Networks for Enzyme Key Catalytic Residue Prediction. J Chem Inf Model 2023; 63:4277-4290. [PMID: 37399293 DOI: 10.1021/acs.jcim.3c00273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/05/2023]
Abstract
Determining the catalytic site of enzymes is a great help for understanding the relationship between protein sequence, structure, and function, which provides the basis and targets for designing, modifying, and enhancing enzyme activity. The unique local spatial configuration bound to the substrate at the active center of the enzyme determines the catalytic ability of enzymes and plays an important role in the catalytic site prediction. As a suitable tool, the graph neural network can better understand and identify the residue sites with unique local spatial configurations due to its remarkable ability to characterize the three-dimensional structural features of proteins. Consequently, a novel model for predicting enzyme catalytic sites has been developed, which incorporates a uniquely designed adaptive edge-gated graph attention neural network (AEGAN). This model is capable of effectively handling sequential and structural characteristics of proteins at various levels, and the extracted features enable an accurate description of the local spatial configuration of the enzyme active site by sampling the local space around candidate residues and special design of amino acid physical and chemical properties. To evaluate its performance, the model was compared with existing catalytic site prediction models using different benchmark datasets and achieved the best results on each benchmark dataset. The model exhibited a sensitivity of 0.9659, accuracy of 0.9226, and area under the precision-recall curve (AUPRC) of 0.9241 on the independent test set constructed for evaluation. Furthermore, the F1-score of this model is nearly four times higher than that of the best-performing similar model in previous studies. This research can serve as a valuable tool to help researchers understand protein sequence-structure-function relationships while facilitating the characterization of novel enzymes of unknown function.
Collapse
Affiliation(s)
- Xiaowei Shen
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Shiding Zhang
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Jianyu Long
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Changjing Chen
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Meng Wang
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Ziheng Cui
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Biqiang Chen
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Tianwei Tan
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| |
Collapse
|
7
|
Feehan R, Copeland M, Franklin MW, Slusky JSG. MAHOMES II: A webserver for predicting if a metal binding site is enzymatic. Protein Sci 2023; 32:e4626. [PMID: 36916762 PMCID: PMC10044107 DOI: 10.1002/pro.4626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Revised: 03/08/2023] [Accepted: 03/10/2023] [Indexed: 03/15/2023]
Abstract
Recent advances have enabled high-quality computationally generated structures for proteins with no solved crystal structures. However, protein function data remains largely limited to experimental methods and homology mapping. Since structure determines function, it is natural that methods capable of using computationally generated structures for functional annotations need to be advanced. Our laboratory recently developed a method to distinguish between metalloenzyme and nonenzyme sites. Here we report improvements to this method by upgrading our physicochemical features to alleviate the need for structures with sub-angstrom precision and using machine learning to reduce training data labeling error. Our improved classifier identifies protein bound metal sites as enzymatic or nonenzymatic with 94% precision and 92% recall. We demonstrate that both adjustments increased predictive performance and reliability on sites with sub-angstrom variations. We constructed a set of predicted metalloprotein structures with no solved crystal structures and no detectable homology to our training data. Our model had an accuracy of 90%-97.5% depending on the quality of the predicted structures included in our test. Finally, we found the physicochemical trends that drove this model's successful performance were local protein density, second shell ionizable residue burial, and the pocket's accessibility to the site. We anticipate that our model's ability to correctly identify catalytic metal sites could enable identification of new enzymatic mechanisms and improve de novo metalloenzyme design success rates.
Collapse
Affiliation(s)
- Ryan Feehan
- Center for Computational BiologyThe University of Kansas, 2030 Becker Dr66047LawrenceKansasUSA
| | - Matthew Copeland
- Center for Computational BiologyThe University of Kansas, 2030 Becker Dr66047LawrenceKansasUSA
| | - Meghan W. Franklin
- Center for Computational BiologyThe University of Kansas, 2030 Becker Dr66047LawrenceKansasUSA
| | - Joanna S. G. Slusky
- Center for Computational BiologyThe University of Kansas, 2030 Becker Dr66047LawrenceKansasUSA
- Department of Molecular Biosciences|The University of Kansas, Ave. Lawrence KS 66045‐31011200SunnysideKansasUSA
- Present address:
Generate BiomedicinesSomervilleMassachusettsUSA
| |
Collapse
|
8
|
Feehan R, Copeland M, Franklin MW, Slusky JSG. MAHOMES II: A webserver for predicting if a metal binding site is enzymatic. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.08.531790. [PMID: 36945603 PMCID: PMC10028950 DOI: 10.1101/2023.03.08.531790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
Recent advances have enabled high-quality computationally generated structures for proteins with no solved crystal structures. However, protein function data remains largely limited to experimental methods and homology mapping. Since structure determines function, it is natural that methods capable of using computationally generated structures for functional annotations need to be advanced. Our laboratory recently developed a method to distinguish between metalloenzyme and non-enzyme sites. Here we report improvements to this method by upgrading our physicochemical features to alleviate the need for structures with sub-angstrom precision and using machine learning to reduce training data labeling error. Our improved classifier identifies protein bound metal sites as enzymatic or non-enzymatic with 94% precision and 92% recall. We demonstrate that both adjustments increased predictive performance and reliability on sites with sub-angstrom variations. We constructed a set of predicted metalloprotein structures with no solved crystal structures and no detectable homology to our training data. Our model had an accuracy of 90 - 97.5% depending on the quality of the predicted structures included in our test. Finally, we found the physicochemical trends that drove this model's successful performance were local protein density, second shell ionizable residue burial, and the pocket's accessibility to the site. We anticipate that our model's ability to correctly identify catalytic metal sites could enable identification of new enzymatic mechanisms and improve de novo metalloenzyme design success rates. Significance statement Identification of enzyme active sites on proteins with unsolved crystallographic structures can accelerate discovery of novel biochemical reactions, which can impact healthcare, industrial processes, and environmental remediation. Our lab has developed an ML tool for predicting sites on computationally generated protein structures as enzymatic and non-enzymatic. We have made our tool available on a webserver, allowing the scientific community to rapidly search previously unknown protein function space.
Collapse
Affiliation(s)
- Ryan Feehan
- Center for Computational Biology, The University of Kansas, 2030 Becker Dr., Lawrence, KS 66047
| | - Matthew Copeland
- Center for Computational Biology, The University of Kansas, 2030 Becker Dr., Lawrence, KS 66047
| | - Meghan W. Franklin
- Center for Computational Biology, The University of Kansas, 2030 Becker Dr., Lawrence, KS 66047
| | - Joanna S. G. Slusky
- Center for Computational Biology, The University of Kansas, 2030 Becker Dr., Lawrence, KS 66047
- Department of Molecular Biosciences, The University of Kansas, 1200 Sunnyside Ave. Lawrence KS 66045-3101
| |
Collapse
|
9
|
Yin YH, Shen LC, Jiang Y, Gao S, Song J, Yu DJ. Improving the prediction of DNA-protein binding by integrating multi-scale dense convolutional network with fault-tolerant coding. Anal Biochem 2022; 656:114878. [DOI: 10.1016/j.ab.2022.114878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 08/18/2022] [Accepted: 08/23/2022] [Indexed: 11/01/2022]
|
10
|
Mathur G, Pandey A, Goyal S. A comprehensive tool for rapid and accurate prediction of disease using DNA sequence classifier. JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING 2022; 14:1-17. [PMID: 35789598 PMCID: PMC9243743 DOI: 10.1007/s12652-022-04099-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Accepted: 06/06/2022] [Indexed: 06/15/2023]
Abstract
In the current pandemic situation where the coronavirus is spreading very fast that can jump from one human to another. Along with this, there are millions of viruses for example Ebola, SARS, etc. that can spread as fast as the coronavirus due to the mobilization and globalization of the population and are equally deadly. Earlier identification of these viruses can prevent the outbreaks that we are facing currently as well as can help in the earlier designing of drugs. Identification of disease at a prior stage can be achieved through DNA sequence classification as DNA carries most of the genetic information about organisms. This is the reason why the classification of DNA sequences plays an important role in computational biology. This paper has presented a solution in which samples collected from NCBI are used for the classification of DNA sequences. DNA sequence classification will in turn gives the pattern of various diseases; these patterns are then compared with the samples of a newly infected person and can help in the earlier identification of disease. However, feature extraction always remains a big issue. In this paper, a machine learning-based classifier and a new technique for extracting features from DNA sequences based on a hot vector matrix have been proposed. In the hot vector representation of the DNA sequence, each pair of the word is represented using a binary matrix which represents the position of each nucleotide in the DNA sequence. The resultant matrix is then given as an input to the traditional CNN for feature extraction. The results of the proposed method have been compared with 5 well-known classifiers namely Convolution neural network (CNN), Support Vector Machines (SVM), K-Nearest Neighbor (KNN) algorithm, Decision Trees, Recurrent Neural Networks (RNN) on several parameters including precision rate and accuracy and the result shows that the proposed method gives an accuracy of 93.9%, which is highest compared to other classifiers.
Collapse
Affiliation(s)
- Garima Mathur
- Department of Computer Science and Engineering, UIT, RGPV, Bhopal, India
| | - Anjana Pandey
- Department of Information Technology, UIT, RGPV, Bhopal, India
| | - Sachin Goyal
- Department of Information Technology, UIT, RGPV, Bhopal, India
| |
Collapse
|
11
|
Naseer S, Hussain W, Khan YD, Rasool N. iPhosS(Deep)-PseAAC: Identification of Phosphoserine Sites in Proteins Using Deep Learning on General Pseudo Amino Acid Compositions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1703-1714. [PMID: 33242308 DOI: 10.1109/tcbb.2020.3040747] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Among all the PTMs, the protein phosphorylation is pivotal for various pathological and physiological processes. About 30 percent of eukaryotic proteins undergo the phosphorylation modification, leading to various changes in conformation, function, stability, localization, and so forth. In eukaryotic proteins, phosphorylation occurs on serine (S), Threonine (T) and Tyrosine (Y) residues. Among these all, serine phosphorylation has its own importance as it is associated with various importance biological processes, including energy metabolism, signal transduction pathways, cell cycling, and apoptosis. Thus, its identification is important, however, the in vitro, ex vivo and in vivo identification can be laborious, time-taking and costly. There is a dire need of an efficient and accurate computational model to help researchers and biologists identifying these sites, in an easy manner. Herein, we propose a novel predictor for identification of Phosphoserine sites (PhosS) in proteins, by integrating the Chou's Pseudo Amino Acid Composition (PseAAC) with deep features. We used well-known DNNs for both the tasks of learning a feature representation of peptide sequences and performing classifications. Among different DNNs, the best score is shown by Covolutional Neural Network based model which renders CNN based prediction model the best for Phosphoserine prediction. Based on these results, it is concluded that the proposed model can help to identify PhosS sites in a very efficient and accurate manner which can help scientists understand the mechanism of this modification in proteins.
Collapse
|
12
|
Hot spots-making directed evolution easier. Biotechnol Adv 2022; 56:107926. [DOI: 10.1016/j.biotechadv.2022.107926] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Revised: 01/04/2022] [Accepted: 02/07/2022] [Indexed: 01/20/2023]
|
13
|
Shahid M, Ilyas M, Hussain W, Khan YD. ORI-Deep: improving the accuracy for predicting origin of replication sites by using a blend of features and long short-term memory network. Brief Bioinform 2022; 23:6511972. [PMID: 35048955 DOI: 10.1093/bib/bbac001] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Revised: 12/30/2021] [Accepted: 01/02/2022] [Indexed: 11/14/2022] Open
Abstract
Replication of DNA is an important process for the cell division cycle, gene expression regulation and other biological evolution processes. It also has a crucial role in a living organism's physical growth and structure. Replication of DNA comprises of three stages known as initiation, elongation and termination, whereas the origin of replication sites (ORI) is the location of initiation of the DNA replication process. There exist various methodologies to identify ORIs in the genomic sequences, however, these methods have used either extensive computations for execution, or have limited optimization for the large datasets. Herein, a model called ORI-Deep is proposed to identify ORIs from the multiple cell type genomic sequence benchmark data. An efficient method is proposed using a deep neural network to identify ORIs for four different eukaryotic species. For better representation of data, a feature vector is constructed using statistical moments for the training and testing of data and is further fed to a long short-term memory (LSTM) network. To prove the effectiveness of the proposed model, we applied several validation techniques at different levels to obtain seven accuracy metrics, and the accuracy score for self-consistency, 10-fold cross-validation, jackknife and the independent set test is observed to be 0.977, 0.948, 0.976 and 0.977, respectively. Based on the results, it can be concluded that ORI-Deep can efficiently predict the sites of origin replication in DNA sequence with high accuracy. Webserver for ORI-Deep is available at (https://share.streamlit.io/waqarhusain/orideep/main/app.py), whereas source code is available at (https://github.com/WaqarHusain/OriDeep).
Collapse
Affiliation(s)
- Mahwish Shahid
- School of Systems and Technologies, University of Management and Technology, Lahore, Pakistan
| | - Maham Ilyas
- University of Management and Technology, Lahore, Pakistan
| | - Waqar Hussain
- University of Management and Technology, Lahore, Pakistan
| | - Yaser Daanial Khan
- Department of Computer Science, University of Management and Technology, Lahore, Pakistan
| |
Collapse
|
14
|
Singh N, Bhatnagar S. Machine Learning for Prediction of Drug Targets in Microbe Associated Cardiovascular Diseases by Incorporating Host-pathogen Interaction Network Parameters. Mol Inform 2021; 41:e2100115. [PMID: 34676983 DOI: 10.1002/minf.202100115] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Accepted: 10/01/2021] [Indexed: 12/20/2022]
Abstract
Host-pathogen interactions play a crucial role in invasion, infection, and induction of immune response in humans. In this work, four machine learning algorithms, namely Logistic regression, K-nearest neighbor, Support Vector Machine, and Random Forest were implemented for the classification of drug targets. The algorithms were trained using 3400 hosts and 3800 pathogen drug and non-drug target proteins as learning instances. For each protein, 68 pathogen and 73 host features were computed that included sequence, structure, biological and host-pathogen network centrality characteristics. The Random Forest classifier model achieved the best accuracy after 10-fold cross-validation. 99 % accuracy was achieved with a ROC-AUC score of 0.99±0.01 for both pathogen and host training sets. The Eigenvector Centrality of host-pathogen interactions and host-host interactions was the top feature in performing classification of pathogen and host targets respectively. Other features important for classification were the presence of catalytic and binding sites, low instability/aliphatic index, and cellular location. The Random Forest classifier was then used for prediction of drug targets involved in Microbe Associated Cardiovascular Diseases. 331 host and 743 pathogen proteins were predicted as drug targets by the random forest model and can be validated experimentally for therapeutic intervention in Microbe Associated Cardiovascular Diseases.
Collapse
Affiliation(s)
- Nirupma Singh
- Department of Biotechnology, Netaji Subhas Institute of Technology, Dwarka, New Delhi, 110078, India
| | - Sonika Bhatnagar
- Department of Biotechnology, Netaji Subhas Institute of Technology, Dwarka, New Delhi, 110078, India.,Computational and Structural Biology Laboratory, Department of Biological Sciences and Engineering, Netaji Subhas University of Technology Dwarka, New Delhi, 110078, India
| |
Collapse
|
15
|
Han K, Shen LC, Zhu YH, Xu J, Song J, Yu DJ. MAResNet: predicting transcription factor binding sites by combining multi-scale bottom-up and top-down attention and residual network. Brief Bioinform 2021; 23:6399874. [PMID: 34664074 DOI: 10.1093/bib/bbab445] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2021] [Revised: 09/06/2021] [Accepted: 09/28/2021] [Indexed: 11/14/2022] Open
Abstract
Accurate identification of transcription factor binding sites is of great significance in understanding gene expression, biological development and drug design. Although a variety of methods based on deep-learning models and large-scale data have been developed to predict transcription factor binding sites in DNA sequences, there is room for further improvement in prediction performance. In addition, effective interpretation of deep-learning models is greatly desirable. Here we present MAResNet, a new deep-learning method, for predicting transcription factor binding sites on 690 ChIP-seq datasets. More specifically, MAResNet combines the bottom-up and top-down attention mechanisms and a state-of-the-art feed-forward network (ResNet), which is constructed by stacking attention modules that generate attention-aware features. In particular, the multi-scale attention mechanism is utilized at the first stage to extract rich and representative sequence features. We further discuss the attention-aware features learned from different attention modules in accordance with the changes as the layers go deeper. The features learned by MAResNet are also visualized through the TMAP tool to illustrate that the method can extract the unique characteristics of transcription factor binding sites. The performance of MAResNet is extensively tested on 690 test subsets with an average AUC of 0.927, which is higher than that of the current state-of-the-art methods. Overall, this study provides a new and useful framework for the prediction of transcription factor binding sites by combining the funnel attention modules with the residual network.
Collapse
Affiliation(s)
- Ke Han
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Long-Chen Shen
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Jian Xu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| |
Collapse
|
16
|
Zhang S, Zhao L, Zheng CH, Xia J. A feature-based approach to predict hot spots in protein-DNA binding interfaces. Brief Bioinform 2021; 21:1038-1046. [PMID: 30957840 DOI: 10.1093/bib/bbz037] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2019] [Revised: 02/20/2019] [Accepted: 03/07/2019] [Indexed: 12/21/2022] Open
Abstract
DNA-binding hot spot residues of proteins are dominant and fundamental interface residues that contribute most of the binding free energy of protein-DNA interfaces. As experimental methods for identifying hot spots are expensive and time consuming, computational approaches are urgently required in predicting hot spots on a large scale. In this work, we systematically assessed a wide variety of 114 features from a combination of the protein sequence, structure, network and solvent accessible information and their combinations along with various feature selection strategies for hot spot prediction. We then trained and compared four commonly used machine learning models, namely, support vector machine (SVM), random forest, Naïve Bayes and k-nearest neighbor, for the identification of hot spots using 10-fold cross-validation and the independent test set. Our results show that (1) features based on the solvent accessible surface area have significant effect on hot spot prediction; (2) different but complementary features generally enhance the prediction performance; and (3) SVM outperforms other machine learning methods on both training and independent test sets. In an effort to improve predictive performance, we developed a feature-based method, namely, PrPDH (Prediction of Protein-DNA binding Hot spots), for the prediction of hot spots in protein-DNA binding interfaces using SVM based on the selected 10 optimal features. Comparative results on benchmark data sets indicate that our predictor is able to achieve generally better performance in predicting hot spots compared to the state-of-the-art predictors. A user-friendly web server for PrPDH is well established and is freely available at http://bioinfo.ahu.edu.cn:8080/PrPDH.
Collapse
Affiliation(s)
- Sijia Zhang
- Institutes of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| | - Le Zhao
- Institutes of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| | - Chun-Hou Zheng
- Institutes of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| | - Junfeng Xia
- Institutes of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| |
Collapse
|
17
|
Xie J, Zhao C, Sun J, Li J, Yang F, Wang J, Nie Q. Prediction of Essential Genes in Comparison States Using Machine Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1784-1792. [PMID: 32991286 DOI: 10.1109/tcbb.2020.3027392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Identifying essential genes in comparison states (EGS) is vital to understanding cell differentiation, performing drug discovery, and identifying disease causes. Here, we present a machine learning method termed Prediction of Essential Genes in Comparison States (PreEGS). To capture the alteration of the network in comparison states, PreEGS extracts topological and gene expression features of each gene in a five-dimensional vector. PreEGS also recruits a positive sample expansion method to address the problem of unbalanced positive and negative samples, which is often encountered in practical applications. Different classifiers are applied to the simulated datasets, and the PreEGS based on the random forests model (PreEGSRF) was chosen for optimal performance. PreEGSRF was then compared with six other methods, including three machine learning methods, to predict EGS in a specific state. On real datasets with four gene regulatory networks, PreEGSRF predicted five essential genes related to leukemia and five enriched KEGG pathways. Four of the predicted essential genes and all predicted pathways were consistent with previous studies and highly correlated with leukemia. With high prediction accuracy and generalization ability, PreEGSRF is broadly applicable for the discovery of disease-causing genes, driver genes for cell fate decisions, and complex biomarkers of biological systems.
Collapse
|
18
|
Shen Y, Zhang Y, Xue W, Yue Z. dbMCS: A Database for Exploring the Mutation Markers of Anti-Cancer Drug Sensitivity. IEEE J Biomed Health Inform 2021; 25:4229-4237. [PMID: 34314366 DOI: 10.1109/jbhi.2021.3100424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The identification of mutation markers and the selection of appropriate treatment for patients with specific genome mutations are important steps in the development of targeted therapies and the realization of precision medicine for human cancers. To investigate the baseline characteristics of drug sensitivity markers and develop computational methods of mutation effect prediction, we presented a manually curated online- based database of mutation Markers for anti-Cancer drug Sensitivity (dbMCS). Currently, dbMCS contains 1271 mutations and 4427 mutation-disease-drug associations (3151 and 1276 for sensitivity and resistance, respectively) with their PubMed indexed articles. By comparing the mutations in dbMCS with the putative neutral polymorphisms, we investigated the characteristics of drug sensitivity markers. We found that the mutation markers tend to significantly impact on high-conservative regions both in DNA sequences and protein domains. And some of them presented pleiotropic effects depending on the tumor context, appearing concurrently in the sensitivity and resistance categories. In addition, we preliminarily explored the machine learning-based methods for identifying mutation markers of anti-cancer drug sensitivity and produced optimistic results, which suggests that a reliable dataset may provide new insights and essential clues for future cancer pharmacogenomics studies. dbMCS is available at http://bioinfo.aielab.cc/dbMCS/.
Collapse
|
19
|
Rauer C, Sen N, Waman VP, Abbasian M, Orengo CA. Computational approaches to predict protein functional families and functional sites. Curr Opin Struct Biol 2021; 70:108-122. [PMID: 34225010 DOI: 10.1016/j.sbi.2021.05.012] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 05/13/2021] [Accepted: 05/25/2021] [Indexed: 01/06/2023]
Abstract
Understanding the mechanisms of protein function is indispensable for many biological applications, such as protein engineering and drug design. However, experimental annotations are sparse, and therefore, theoretical strategies are needed to fill the gap. Here, we present the latest developments in building functional subclassifications of protein superfamilies and using evolutionary conservation to detect functional determinants, for example, catalytic-, binding- and specificity-determining residues important for delineating the functional families. We also briefly review other features exploited for functional site detection and new machine learning strategies for combining multiple features.
Collapse
Affiliation(s)
- Clemens Rauer
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Neeladri Sen
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Vaishali P Waman
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Mahnaz Abbasian
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK.
| |
Collapse
|
20
|
Brysbaert G, Lensink MF. Centrality Measures in Residue Interaction Networks to Highlight Amino Acids in Protein–Protein Binding. FRONTIERS IN BIOINFORMATICS 2021; 1:684970. [PMID: 36303777 PMCID: PMC9581030 DOI: 10.3389/fbinf.2021.684970] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Accepted: 05/17/2021] [Indexed: 12/21/2022] Open
Abstract
Residue interaction networks (RINs) describe a protein structure as a network of interacting residues. Central nodes in these networks, identified by centrality analyses, highlight those residues that play a role in the structure and function of the protein. However, little is known about the capability of such analyses to identify residues involved in the formation of macromolecular complexes. Here, we performed six different centrality measures on the RINs generated from the complexes of the SKEMPI 2 database of changes in protein–protein binding upon mutation in order to evaluate the capability of each of these measures to identify major binding residues. The analyses were performed with and without the crystallographic water molecules, in addition to the protein residues. We also investigated the use of a weight factor based on the inter-residue distances to improve the detection of these residues. We show that for the identification of major binding residues, closeness, degree, and PageRank result in good precision, whereas betweenness, eigenvector, and residue centrality analyses give a higher sensitivity. Including water in the analysis improves the sensitivity of all measures without losing precision. Applying weights only slightly raises the sensitivity of eigenvector centrality analysis. We finally show that a combination of multiple centrality analyses is the optimal approach to identify residues that play a role in protein–protein interaction.
Collapse
|
21
|
Feehan R, Franklin MW, Slusky JSG. Machine learning differentiates enzymatic and non-enzymatic metals in proteins. Nat Commun 2021; 12:3712. [PMID: 34140507 PMCID: PMC8211803 DOI: 10.1038/s41467-021-24070-3] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2021] [Accepted: 06/02/2021] [Indexed: 11/09/2022] Open
Abstract
Metalloenzymes are 40% of all enzymes and can perform all seven classes of enzyme reactions. Because of the physicochemical similarities between the active sites of metalloenzymes and inactive metal binding sites, it is challenging to differentiate between them. Yet distinguishing these two classes is critical for the identification of both native and designed enzymes. Because of similarities between catalytic and non-catalytic metal binding sites, finding physicochemical features that distinguish these two types of metal sites can indicate aspects that are critical to enzyme function. In this work, we develop the largest structural dataset of enzymatic and non-enzymatic metalloprotein sites to date. We then use a decision-tree ensemble machine learning model to classify metals bound to proteins as enzymatic or non-enzymatic with 92.2% precision and 90.1% recall. Our model scores electrostatic and pocket lining features as more important than pocket volume, despite the fact that volume is the most quantitatively different feature between enzyme and non-enzymatic sites. Finally, we find our model has overall better performance in a side-to-side comparison against other methods that differentiate enzymatic from non-enzymatic sequences. We anticipate that our model's ability to correctly identify which metal sites are responsible for enzymatic activity could enable identification of new enzymatic mechanisms and de novo enzyme design.
Collapse
Affiliation(s)
- Ryan Feehan
- Center for Computational Biology, The University of Kansas, Lawrence, KS, USA
| | - Meghan W Franklin
- Center for Computational Biology, The University of Kansas, Lawrence, KS, USA
| | - Joanna S G Slusky
- Center for Computational Biology, The University of Kansas, Lawrence, KS, USA.
- Department of Molecular Biosciences, The University of Kansas, Lawrence, KS, USA.
| |
Collapse
|
22
|
Malebary SJ, Khan YD. Evaluating machine learning methodologies for identification of cancer driver genes. Sci Rep 2021; 11:12281. [PMID: 34112883 PMCID: PMC8192921 DOI: 10.1038/s41598-021-91656-8] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Accepted: 05/19/2021] [Indexed: 02/06/2023] Open
Abstract
Cancer is driven by distinctive sorts of changes and basic variations in genes. Recognizing cancer driver genes is basic for accurate oncological analysis. Numerous methodologies to distinguish and identify drivers presently exist, but efficient tools to combine and optimize them on huge datasets are few. Most strategies for prioritizing transformations depend basically on frequency-based criteria. Strategies are required to dependably prioritize organically dynamic driver changes over inert passengers in high-throughput sequencing cancer information sets. This study proposes a model namely PCDG-Pred which works as a utility capable of distinguishing cancer driver and passenger attributes of genes based on sequencing data. Keeping in view the significance of the cancer driver genes an efficient method is proposed to identify the cancer driver genes. Further, various validation techniques are applied at different levels to establish the effectiveness of the model and to obtain metrics like accuracy, Mathew's correlation coefficient, sensitivity, and specificity. The results of the study strongly indicate that the proposed strategy provides a fundamental functional advantage over other existing strategies for cancer driver genes identification. Subsequently, careful experiments exhibit that the accuracy metrics obtained for self-consistency, independent set, and cross-validation tests are 91.08%., 87.26%, and 92.48% respectively.
Collapse
Affiliation(s)
- Sharaf J Malebary
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, P.O. Box 344, Rabigh, 21911, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan.
| |
Collapse
|
23
|
Naseer S, Hussain W, Khan YD, Rasool N. NPalmitoylDeep-PseAAC: A Predictor of N-Palmitoylation Sites in Proteins Using Deep Representations of Proteins and PseAAC via Modified 5-Steps Rule. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200605142828] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Among all the major Post-translational modification, lipid modifications
possess special significance due to their widespread functional importance in eukaryotic cells. There
exist multiple types of lipid modifications and Palmitoylation, among them, is one of the broader
types of modification, having three different types. The N-Palmitoylation is carried out by
attachment of palmitic acid to an N-terminal cysteine. Due to the association of N-Palmitoylation
with various biological functions and diseases such as Alzheimer’s and other neurodegenerative
diseases, its identification is very important.
Objective:
The in vitro, ex vivo and in vivo identification of Palmitoylation is laborious, time-taking
and costly. There is a dire need for an efficient and accurate computational model to help researchers
and biologists identify these sites, in an easy manner. Herein, we propose a novel prediction model
for the identification of N-Palmitoylation sites in proteins.
Method:
The proposed prediction model is developed by combining the Chou’s Pseudo Amino
Acid Composition (PseAAC) with deep neural networks. We used well-known deep neural
networks (DNNs) for both the tasks of learning a feature representation of peptide sequences and
developing a prediction model to perform classification.
Results:
Among different DNNs, Gated Recurrent Unit (GRU) based RNN model showed the
highest scores in terms of accuracy, and all other computed measures, and outperforms all the
previously reported predictors.
Conclusion:
The proposed GRU based RNN model can help to identify N-Palmitoylation in a very
efficient and accurate manner which can help scientists understand the mechanism of this
modification in proteins.
Collapse
Affiliation(s)
- Sheraz Naseer
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, P.O. Box 10033, C-II, Johar Town, Lahore 54770, Pakistan
| | - Waqar Hussain
- National Center of Artificial Intelligence, Punjab University College of Information Technology, University of the Punjab, Lahore, Pakistan
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, P.O. Box 10033, C-II, Johar Town, Lahore 54770, Pakistan
| | - Nouman Rasool
- Dr Panjwani Center for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences, University of Karachi, Karachi, 75270, Pakistan
| |
Collapse
|
24
|
Awais M, Hussain W, Khan YD, Rasool N, Khan SA, Chou KC. iPhosH-PseAAC: Identify Phosphohistidine Sites in Proteins by Blending Statistical Moments and Position Relative Features According to the Chou's 5-Step Rule and General Pseudo Amino Acid Composition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:596-610. [PMID: 31144645 DOI: 10.1109/tcbb.2019.2919025] [Citation(s) in RCA: 48] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Protein phosphorylation is one of the key mechanism in prokaryotes and eukaryotes and is responsible for various biological functions such as protein degradation, intracellular localization, the multitude of cellular processes, molecular association, cytoskeletal dynamics, and enzymatic inhibition/activation. Phosphohistidine (PhosH) has a key role in a number of biological processes, including central metabolism to signalling in eukaryotes and bacteria. Thus, identification of phosphohistidine sites in a protein sequence is crucial, and experimental identification can be expensive, time-taking, and laborious. To address this problem, here, we propose a novel computational model namely iPhosH-PseAAC for prediction of phosphohistidine sites in a given protein sequence using pseudo amino acid composition (PseAAC), statistical moments, and position relative features. The results of the proposed predictor are validated through self-consistency testing, 10-fold cross-validation, and jackknife testing. The self-consistency validation gave the 100 percent accuracy, whereas, for cross-validation, the accuracy achieved is 94.26 percent. Moreover, jackknife testing gave 97.07 percent accuracy for the proposed model. Thus, the proposed model iPhosH-PseAAC for prediction of iPhosH site has the great ability to predict the PhosH sites in given proteins.
Collapse
|
25
|
Feehan R, Montezano D, Slusky JSG. Machine learning for enzyme engineering, selection and design. Protein Eng Des Sel 2021; 34:gzab019. [PMID: 34296736 PMCID: PMC8299298 DOI: 10.1093/protein/gzab019] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Revised: 06/18/2021] [Accepted: 06/23/2021] [Indexed: 11/15/2022] Open
Abstract
Machine learning is a useful computational tool for large and complex tasks such as those in the field of enzyme engineering, selection and design. In this review, we examine enzyme-related applications of machine learning. We start by comparing tools that can identify the function of an enzyme and the site responsible for that function. Then we detail methods for optimizing important experimental properties, such as the enzyme environment and enzyme reactants. We describe recent advances in enzyme systems design and enzyme design itself. Throughout we compare and contrast the data and algorithms used for these tasks to illustrate how the algorithms and data can be best used by future designers.
Collapse
Affiliation(s)
- Ryan Feehan
- Center for Computational Biology, The University of Kansas, 2030 Becker Dr., Lawrence, KS 66047-1620, USA
| | - Daniel Montezano
- Center for Computational Biology, The University of Kansas, 2030 Becker Dr., Lawrence, KS 66047-1620, USA
| | - Joanna S G Slusky
- Center for Computational Biology, The University of Kansas, 2030 Becker Dr., Lawrence, KS 66047-1620, USA
- Department of Molecular Biosciences, The University of Kansas, 1200 Sunnyside Ave. Lawrence, KS 66045-7600, USA
| |
Collapse
|
26
|
iDRP-PseAAC: Identification of DNA Replication Proteins Using General PseAAC and Position Dependent Features. Int J Pept Res Ther 2021; 27:1315-1329. [PMID: 33584161 PMCID: PMC7869428 DOI: 10.1007/s10989-021-10170-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/18/2021] [Indexed: 10/25/2022]
Abstract
DNA replication is one of the specific processes to be considered in all the living organisms, specifically eukaryotes. The prevalence of DNA replication is significant for an evolutionary transition at the beginning of life. DNA replication proteins are those proteins which support the process of replication and are also reported to be important in drug design and discovery. This information depicts that DNA replication proteins have a very important role in human bodies, however, to study their mechanism, their identification is necessary. Thus, it is a very important task but, in any case, an experimental identification is time-consuming, highly-costly and laborious. To cope with this issue, a computational methodology is required for prediction of these proteins, however, no prior method exists. This study comprehends the construction of novel prediction model to serve the proposed purpose. The prediction model is developed based on the artificial neural network by integrating the position relative features and sequence statistical moments in PseAAC for training neural networks. Highest overall accuracy has been achieved through tenfold cross-validation and Jackknife testing that was computed to be 96.22% and 98.56%, respectively. Our astonishing experimental results demonstrated that the proposed predictor surpass the existing models that can be served as a time and cost-effective stratagem for designing novel drugs to strike the contemporary bacterial infection.
Collapse
|
27
|
|
28
|
Nguyen TTD, Le NQK, Ho QT, Phan DV, Ou YY. TNFPred: identifying tumor necrosis factors using hybrid features based on word embeddings. BMC Med Genomics 2020; 13:155. [PMID: 33087125 PMCID: PMC7579990 DOI: 10.1186/s12920-020-00779-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Background Cytokines are a class of small proteins that act as chemical messengers and play a significant role in essential cellular processes including immunity regulation, hematopoiesis, and inflammation. As one important family of cytokines, tumor necrosis factors have association with the regulation of a various biological processes such as proliferation and differentiation of cells, apoptosis, lipid metabolism, and coagulation. The implication of these cytokines can also be seen in various diseases such as insulin resistance, autoimmune diseases, and cancer. Considering the interdependence between this kind of cytokine and others, classifying tumor necrosis factors from other cytokines is a challenge for biological scientists. Methods In this research, we employed a word embedding technique to create hybrid features which was proved to efficiently identify tumor necrosis factors given cytokine sequences. We segmented each protein sequence into protein words and created corresponding word embedding for each word. Then, word embedding-based vector for each sequence was created and input into machine learning classification models. When extracting feature sets, we not only diversified segmentation sizes of protein sequence but also conducted different combinations among split grams to find the best features which generated the optimal prediction. Furthermore, our methodology follows a well-defined procedure to build a reliable classification tool. Results With our proposed hybrid features, prediction models obtain more promising performance compared to seven prominent sequenced-based feature kinds. Results from 10 independent runs on the surveyed dataset show that on an average, our optimal models obtain an area under the curve of 0.984 and 0.998 on 5-fold cross-validation and independent test, respectively. Conclusions These results show that biologists can use our model to identify tumor necrosis factors from other cytokines efficiently. Moreover, this study proves that natural language processing techniques can be applied reasonably to help biologists solve bioinformatics problems efficiently.
Collapse
Affiliation(s)
| | - Nguyen-Quoc-Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei City, 106, Taiwan.,Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei City, 106, Taiwan
| | - Quang-Thai Ho
- Department of Computer Science and Engineering, Yuan Ze University, Taoyuan, 32003, Taiwan
| | - Dinh-Van Phan
- University of Economics, The University of Danang, Danang, 550000, Vietnam
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Taoyuan, 32003, Taiwan.
| |
Collapse
|
29
|
Hussain W, Rasool N, Khan YD. Insights into Machine Learning-based Approaches for Virtual Screening in Drug Discovery: Existing Strategies and Streamlining Through FP-CADD. Curr Drug Discov Technol 2020; 18:463-472. [PMID: 32767944 DOI: 10.2174/1570163817666200806165934] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2020] [Revised: 07/01/2020] [Accepted: 07/03/2020] [Indexed: 11/22/2022]
Abstract
BACKGROUND Machine learning is an active area of research in computer science by the availability of big data collection of all sorts prompting interest in the development of novel tools for data mining. Machine learning methods have wide applications in computer-aided drug discovery methods. Most incredible approaches to machine learning are used in drug designing, which further aid the process of biological modelling in drug discovery. Mainly, two main categories are present which are Ligand-Based Virtual Screening (LBVS) and Structure-Based Virtual Screening (SBVS), however, the machine learning approaches fall mostly in the category of LBVS. OBJECTIVES This study exposits the major machine learning approaches being used in LBVS. Moreover, we have introduced a protocol named FP-CADD which depicts a 4-steps rule of thumb for drug discovery, the four protocols of computer-aided drug discovery (FP-CADD). Various important aspects along with SWOT analysis of FP-CADD are also discussed in this article. CONCLUSION By this thorough study, we have observed that in LBVS algorithms, Support Vector Machines (SVM) and Random Forest (RF) are those which are widely used due to high accuracy and efficiency. These virtual screening approaches have the potential to revolutionize the drug designing field. Also, we believe that the process flow presented in this study, named FP-CADD, can streamline the whole process of computer-aided drug discovery. By adopting this rule, the studies related to drug discovery can be made homogeneous and this protocol can also be considered as an evaluation criterion in the peer-review process of research articles.
Collapse
Affiliation(s)
| | | | - Yaser Daanial Khan
- Department of Computer Science, University of Management and Technology, Lahore, Pakistan
| |
Collapse
|
30
|
Xu ZC, Feng PM, Yang H, Qiu WR, Chen W, Lin H. iRNAD: a computational tool for identifying D modification sites in RNA sequence. Bioinformatics 2020; 35:4922-4929. [PMID: 31077296 DOI: 10.1093/bioinformatics/btz358] [Citation(s) in RCA: 71] [Impact Index Per Article: 17.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2018] [Revised: 03/01/2019] [Accepted: 04/27/2019] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION Dihydrouridine (D) is a common RNA post-transcriptional modification found in eukaryotes, bacteria and a few archaea. The modification can promote the conformational flexibility of individual nucleotide bases. And its levels are increased in cancerous tissues. Therefore, it is necessary to detect D in RNA for further understanding its functional roles. Since wet-experimental techniques for the aim are time-consuming and laborious, it is urgent to develop computational models to identify D modification sites in RNA. RESULTS We constructed a predictor, called iRNAD, for identifying D modification sites in RNA sequence. In this predictor, the RNA samples derived from five species were encoded by nucleotide chemical property and nucleotide density. Support vector machine was utilized to perform the classification. The final model could produce the overall accuracy of 96.18% with the area under the receiver operating characteristic curve of 0.9839 in jackknife cross-validation test. Furthermore, we performed a series of validations from several aspects and demonstrated the robustness and reliability of the proposed model. AVAILABILITY AND IMPLEMENTATION A user-friendly web-server called iRNAD can be freely accessible at http://lin-group.cn/server/iRNAD, which will provide convenience and guide to users for further studying D modification.
Collapse
Affiliation(s)
- Zhao-Chun Xu
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China.,Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Peng-Mian Feng
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Hui Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Wang-Ren Qiu
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Wei Chen
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
31
|
Beyene SS, Ling T, Ristevski B, Chen M. A novel riboswitch classification based on imbalanced sequences achieved by machine learning. PLoS Comput Biol 2020; 16:e1007760. [PMID: 32687488 PMCID: PMC7392346 DOI: 10.1371/journal.pcbi.1007760] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Revised: 07/30/2020] [Accepted: 05/13/2020] [Indexed: 11/24/2022] Open
Abstract
Riboswitch, a part of regulatory mRNA (50-250nt in length), has two main classes: aptamer and expression platform. One of the main challenges raised during the classification of riboswitch is imbalanced data. That is a circumstance in which the records of a sequences of one group are very small compared to the others. Such circumstances lead classifier to ignore minority group and emphasize on majority ones, which results in a skewed classification. We considered sixteen riboswitch families, to be in accord with recent riboswitch classification work, that contain imbalanced sequences. The sequences were split into training and test set using a newly developed pipeline. From 5460 k-mers (k value 1 to 6) produced, 156 features were calculated based on CfsSubsetEval and BestFirst function found in WEKA 3.8. Statistically tested result was significantly difference between balanced and imbalanced sequences (p < 0.05). Besides, each algorithm also showed a significant difference in sensitivity, specificity, accuracy, and macro F-score when used in both groups (p < 0.05). Several k-mers clustered from heat map were discovered to have biological functions and motifs at the different positions like interior loops, terminal loops and helices. They were validated to have a biological function and some are riboswitch motifs. The analysis has discovered the importance of solving the challenges of majority bias analysis and overfitting. Presented results were generalized evaluation of both balanced and imbalanced models, which implies their ability of classifying, to classify novel riboswitches. The Python source code is available at https://github.com/Seasonsling/riboswitch.
Collapse
Affiliation(s)
- Solomon Shiferaw Beyene
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Tianyi Ling
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
- School of Medicine, Zhejiang University, Hangzhou, Zhejiang, China
| | - Blagoj Ristevski
- Faculty of Information and Communication Technologies, Bitola, St. Kliment Ohridski University Bitola, ul. Partizanska Bitola, Republic of North Macedonia
| | - Ming Chen
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
| |
Collapse
|
32
|
Yue Z, Chu X, Xia J. PredCID: prediction of driver frameshift indels in human cancer. Brief Bioinform 2020; 22:5860690. [PMID: 32591774 DOI: 10.1093/bib/bbaa119] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2020] [Revised: 05/14/2020] [Accepted: 05/16/2020] [Indexed: 11/12/2022] Open
Abstract
The discrimination of driver from passenger mutations has been a hot topic in the field of cancer biology. Although recent advances have improved the identification of driver mutations in cancer genomic research, there is no computational method specific for the cancer frameshift indels (insertions or/and deletions) yet. In addition, existing pathogenic frameshift indel predictors may suffer from plenty of missing values because of different choices of transcripts during the variant annotation processes. In this study, we proposed a computational model, called PredCID (Predictor for Cancer driver frameshift InDels), for accurately predicting cancer driver frameshift indels. Gene, DNA, transcript and protein level features are combined together and selected for classification with eXtreme Gradient Boosting classifier. Benchmarking results on the cross-validation dataset and independent dataset showed that PredCID achieves better and robust performance compared with existing noncancer-specific methods in distinguishing cancer driver frameshift indels from passengers and is therefore a valuable method for deeper understanding of frameshift indels in human cancer. PredCID is freely available for academic research at http://bioinfo.ahu.edu.cn:8080/PredCID.
Collapse
Affiliation(s)
| | - Xinlu Chu
- Institutes of Physical Science and Information Technology, Anhui University
| | - Junfeng Xia
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Institutes of Physical Science and Information Technology, Anhui University
| |
Collapse
|
33
|
Chou KC. An Insightful 10-year Recollection Since the Emergence of the 5-steps Rule. Curr Pharm Des 2020; 25:4223-4234. [PMID: 31782354 DOI: 10.2174/1381612825666191129164042] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2019] [Accepted: 11/25/2019] [Indexed: 11/22/2022]
Abstract
OBJECTIVE One of the most challenging and also the most difficult problems is how to formulate a biological sequence with a vector but considerably keep its sequence order information. METHODS To address such a problem, the approach of Pseudo Amino Acid Components or PseAAC has been developed. RESULTS AND CONCLUSION It has become increasingly clear via the 10-year recollection that the aforementioned proposal has been indeed very powerful.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, Massachusetts 02478, United States.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
34
|
|
35
|
Li F, Fan C, Marquez-Lago TT, Leier A, Revote J, Jia C, Zhu Y, Smith AI, Webb GI, Liu Q, Wei L, Li J, Song J. PRISMOID: a comprehensive 3D structure database for post-translational modifications and mutations with functional impact. Brief Bioinform 2020; 21:1069-1079. [PMID: 31161204 PMCID: PMC7299293 DOI: 10.1093/bib/bbz050] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2019] [Revised: 03/26/2019] [Accepted: 03/29/2019] [Indexed: 12/26/2022] Open
Abstract
Post-translational modifications (PTMs) play very important roles in various cell signaling pathways and biological process. Due to PTMs' extremely important roles, many major PTMs have been studied, while the functional and mechanical characterization of major PTMs is well documented in several databases. However, most currently available databases mainly focus on protein sequences, while the real 3D structures of PTMs have been largely ignored. Therefore, studies of PTMs 3D structural signatures have been severely limited by the deficiency of the data. Here, we develop PRISMOID, a novel publicly available and free 3D structure database for a wide range of PTMs. PRISMOID represents an up-to-date and interactive online knowledge base with specific focus on 3D structural contexts of PTMs sites and mutations that occur on PTMs and in the close proximity of PTM sites with functional impact. The first version of PRISMOID encompasses 17 145 non-redundant modification sites on 3919 related protein 3D structure entries pertaining to 37 different types of PTMs. Our entry web page is organized in a comprehensive manner, including detailed PTM annotation on the 3D structure and biological information in terms of mutations affecting PTMs, secondary structure features and per-residue solvent accessibility features of PTM sites, domain context, predicted natively disordered regions and sequence alignments. In addition, high-definition JavaScript packages are employed to enhance information visualization in PRISMOID. PRISMOID equips a variety of interactive and customizable search options and data browsing functions; these capabilities allow users to access data via keyword, ID and advanced options combination search in an efficient and user-friendly way. A download page is also provided to enable users to download the SQL file, computational structural features and PTM sites' data. We anticipate PRISMOID will swiftly become an invaluable online resource, assisting both biologists and bioinformaticians to conduct experiments and develop applications supporting discovery efforts in the sequence-structural-functional relationship of PTMs and providing important insight into mutations and PTM sites interaction mechanisms. The PRISMOID database is freely accessible at http://prismoid.erc.monash.edu/. The database and web interface are implemented in MySQL, JSP, JavaScript and HTML with all major browsers supported.
Collapse
Affiliation(s)
- Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria, Australia
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
| | - Cunshuo Fan
- College of Information Engineering, Northwest A&F University, Yangling, China
| | - Tatiana T Marquez-Lago
- Department of Genetics and Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - André Leier
- Department of Genetics and Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - Jerico Revote
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria, Australia
| | - Cangzhi Jia
- College of Science, Dalian Maritime University, Dalian, China
- School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| | - Yan Zhu
- Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, Victoria, Australia
| | - A Ian Smith
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria, Australia
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
| | - Quanzhong Liu
- College of Information Engineering, Northwest A&F University, Yangling, China
| | - Leyi Wei
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jian Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria, Australia
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
| |
Collapse
|
36
|
Contreras-Torres E, Marrero-Ponce Y, Terán JE, García-Jacas CR, Brizuela CA, Sánchez-Rodríguez JC. MuLiMs-MCoMPAs: A Novel Multiplatform Framework to Compute Tensor Algebra-Based Three-Dimensional Protein Descriptors. J Chem Inf Model 2020; 60:1042-1059. [PMID: 31663741 DOI: 10.1021/acs.jcim.9b00629] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
This report introduces the MuLiMs-MCoMPAs software (acronym for Multi-Linear Maps based on N-Metric and Contact Matrices of 3D Protein and Amino-acid weightings), designed to compute tensor-based 3D protein structural descriptors by applying two- and three-linear algebraic forms. Moreover, these descriptors contemplate generalizing components such as novel 3D protein structural representations, (dis)similarity metrics, and multimetrics to extract geometrical related information between two and three amino acids, weighting schemes based on amino acid properties, matrix normalization procedures that consider simple-stochastic and mutual probability transformations, topological and geometrical cutoffs, amino acid, and group-based MD calculations, and aggregation operators for merging amino acidic and group MDs. The MuLiMs-MCoMPAs software, which belongs to the ToMoCoMD-CAMPS suite, was developed in Java (version 1.8) using the Chemistry Development Kit (CDK) (version 1.4.19) and the Jmol libraries. This software implemented a divide-and-conquer strategy to parallelize the computation of the indices as well as modules for data preprocessing and batch computing functionalities. Furthermore, it consists of two components: (i) a desktop-graphical user interface (GUI) and (ii) an API library. The relevance of this novel approach is demonstrated through two analyses that considered Shannon's entropy-based variability and a principal component analysis. These studies showed that the MuLiMs-MCoMPAs' three-linear descriptor family contains higher informational entropy than several other descriptors generated with available computation tools. Moreover, the MuLiMs-MCoMPAs indices capture additional orthogonal information to the one codified by the available calculation approaches. As a result, two sets of suggested theoretical configurations that contain 13648 two-linear indices and 20263 three-linear indices are available for download at tomocomd.com . Furthermore, as a demonstration of the applicability and easy integration of the MuLiMs library into a QSAR-based expert system, a software application (ProStAF) was generated to predict SCOP protein structural classes and folding rate. It can thus be anticipated that the MuLiMs-MCoMPAs framework will turn into a valuable contribution to the chem- and bioinformatics research fields.
Collapse
Affiliation(s)
- Ernesto Contreras-Torres
- Computer-Aided Molecular "Biosilico" Discovery and Bioinformatics Research International Network (CAMD-BIR IN) , Cumbayá, Quito , Ecuador.,Grupo de Medicina Molecular y Traslacional (MeM&T), Colegio de Ciencias de la Salud (COCSA), Escuela de Medicina, Edificio de Especialidades Médicas; and Instituto de Simulación Computacional (ISC-USFQ) , Universidad San Francisco de Quito (USFQ) , Diego de Robles y vía Interoceánica , Quito 170157 , Pichincha , Ecuador
| | - Yovani Marrero-Ponce
- Grupo de Medicina Molecular y Traslacional (MeM&T), Colegio de Ciencias de la Salud (COCSA), Escuela de Medicina, Edificio de Especialidades Médicas; and Instituto de Simulación Computacional (ISC-USFQ) , Universidad San Francisco de Quito (USFQ) , Diego de Robles y vía Interoceánica , Quito 170157 , Pichincha , Ecuador.,Grupo GINUMED, Facultad de Salud, Programa de Medicina , Corporacion Universitaria Rafal Nuñez , Cartagena , Colombia.,Unidad de Investigación de Diseño de Fármacos y Conectividad Molecular, Departamento de Química Física, Facultad de Farmacia , Universitat de València , 46010 Valéncia , Spain
| | - Julio E Terán
- Grupo de Medicina Molecular y Traslacional (MeM&T), Colegio de Ciencias de la Salud (COCSA), Escuela de Medicina, Edificio de Especialidades Médicas; and Instituto de Simulación Computacional (ISC-USFQ) , Universidad San Francisco de Quito (USFQ) , Diego de Robles y vía Interoceánica , Quito 170157 , Pichincha , Ecuador.,Grupo de Química Computacional y Teórica, Departamento de Ingeniería Química , Universidad San Francisco de Quito (USFQ) , Diego de Robles y vía Interoceánica , Quito 170157 , Pichincha Ecuador
| | - César R García-Jacas
- Cátedras Conacyt-Departamento de Ciencias de la Computación , Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE) , Ensenada , Baja California , México
| | - Carlos A Brizuela
- Departamento de Ciencias de la Computación , Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE) , Ensenada , Baja California , México
| | | |
Collapse
|
37
|
Wei L, Luan S, Nagai LAE, Su R, Zou Q. Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species. Bioinformatics 2020; 35:1326-1333. [PMID: 30239627 DOI: 10.1093/bioinformatics/bty824] [Citation(s) in RCA: 123] [Impact Index Per Article: 30.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2018] [Revised: 09/12/2018] [Accepted: 09/18/2018] [Indexed: 12/20/2022] Open
Abstract
MOTIVATION As one of important epigenetic modifications, DNA N4-methylcytosine (4mC) is recently shown to play crucial roles in restriction-modification systems. For better understanding of their functional mechanisms, it is fundamentally important to identify 4mC modification. Machine learning methods have recently emerged as an effective and efficient approach for the high-throughput identification of 4mC sites, although high predictive error rates are still challenging for existing methods. Therefore, it is highly desirable to develop a computational method to more accurately identify m4C sites. RESULTS In this study, we propose a machine learning based predictor, namely 4mcPred-SVM, for the genome-wide detection of DNA 4mC sites. In this predictor, we present a new feature representation algorithm that sufficiently exploits sequence-based information. To improve the feature representation ability, we use a two-step feature optimization strategy, thereby obtaining the most representative features. Using the resulting features and Support Vector Machine (SVM), we adaptively train the optimal models for different species. Comparative results on benchmark datasets from six species indicate that our predictor is able to achieve generally better performance in predicting 4mC sites as compared to the state-of-the-art predictors. Importantly, the sequence-based features can reliably and robust predict 4mC sites, facilitating the discovery of potentially important sequence characteristics for the prediction of 4mC sites. AVAILABILITY AND IMPLEMENTATION The user-friendly webserver that implements the proposed 4mcPred-SVM is well established, and is freely accessible at http://server.malab.cn/4mcPred-SVM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Leyi Wei
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Shasha Luan
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Luis Augusto Eijy Nagai
- Lab of Functional Analysis In Silico, Institute of Medical Science, University of Tokyo, Tokyo, Japan
| | - Ran Su
- School of Computer Software, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
38
|
Ahmad A, Lin H, Shatabda S. Locate-R: Subcellular localization of long non-coding RNAs using nucleotide compositions. Genomics 2020; 112:2583-2589. [PMID: 32068122 DOI: 10.1016/j.ygeno.2020.02.011] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2019] [Revised: 11/11/2019] [Accepted: 02/12/2020] [Indexed: 12/12/2022]
Abstract
Knowledge of the sub-cellular localization of the most diverse class of transcribed RNA, long non-coding RNAs (lncRNAs) will lead us to identify different types of cancers and other diseases as lncRNAs play key role in related cellular functions. In recent days with the exponential growth of known records, it becomes essential to establish new machine learning based techniques to identify the new one due to faster and cheaper solutions provided compared to laboratory methods. In this paper, we propose Locate-R, a novel method for predicting the sub-cellular location of lncRNAs. We have used only n-gapped l-mer composition and l-mer composition as features and select best 655 features to build the model. This model is based locally deep support vector machines which significantly enhance the prediction accuracy with respect to exiting state-of-the-art methods. Our predictor is readily available for use as a stand-alone web application from: http://locate-r.azurewebsites.net/.
Collapse
Affiliation(s)
- Ahsan Ahmad
- Department of Computer Science and Engineering, United International University, Plot 2, United City, Madani Avenue, Satarkul, Badda, Dhaka 1212, Bangladesh
| | - Hao Lin
- School of Life Science and Technology, University of Electronic Science and Technology of China, China
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Plot 2, United City, Madani Avenue, Satarkul, Badda, Dhaka 1212, Bangladesh.
| |
Collapse
|
39
|
Li F, Leier A, Liu Q, Wang Y, Xiang D, Akutsu T, Webb GI, Smith AI, Marquez-Lago T, Li J, Song J. Procleave: Predicting Protease-specific Substrate Cleavage Sites by Combining Sequence and Structural Information. GENOMICS, PROTEOMICS & BIOINFORMATICS 2020; 18:52-64. [PMID: 32413515 PMCID: PMC7393547 DOI: 10.1016/j.gpb.2019.08.002] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 08/08/2019] [Accepted: 10/23/2019] [Indexed: 10/29/2022]
Abstract
Proteases are enzymes that cleave and hydrolyse the peptide bonds between two specific amino acid residues of target substrate proteins. Protease-controlled proteolysis plays a key role in the degradation and recycling of proteins, which is essential for various physiological processes. Thus, solving the substrate identification problem will have important implications for the precise understanding of functions and physiological roles of proteases, as well as for therapeutic target identification and pharmaceutical applicability. Consequently, there is a great demand for bioinformatics methods that can predict novel substrate cleavage events with high accuracy by utilizing both sequence and structural information. In this study, we present Procleave, a novel bioinformatics approach for predicting protease-specific substrates and specific cleavage sites by taking into account both their sequence and 3D structural information. Structural features of known cleavage sites were represented by discrete values using a LOWESS data-smoothing optimization method, which turned out to be critical for the performance of Procleave. The optimal approximations of all structural parameter values were encoded in a conditional random field (CRF) computational framework, alongside sequence and chemical group-based features. Here, we demonstrate the outstanding performance of Procleave through extensive benchmarking and independent tests. Procleave is capable of correctly identifying most cleavage sites in the case study. Importantly, when applied to the human structural proteome encompassing 17,628 protein structures, Procleave suggests a number of potential novel target substrates and their corresponding cleavage sites of different proteases. Procleave is implemented as a webserver and is freely accessible at http://procleave.erc.monash.edu/.
Collapse
Affiliation(s)
- Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia; Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Andre Leier
- School of Medicine, University of Alabama at Birmingham, Birmingham, AL 35233, USA
| | - Quanzhong Liu
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Yanan Wang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia; Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Dongxu Xiang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia; College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - A Ian Smith
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia; ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| | - Tatiana Marquez-Lago
- School of Medicine, University of Alabama at Birmingham, Birmingham, AL 35233, USA.
| | - Jian Li
- Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia.
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia; Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia; ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia.
| |
Collapse
|
40
|
Ribeiro AJM, Tyzack JD, Borkakoti N, Holliday GL, Thornton JM. A global analysis of function and conservation of catalytic residues in enzymes. J Biol Chem 2019; 295:314-324. [PMID: 31796628 DOI: 10.1074/jbc.rev119.006289] [Citation(s) in RCA: 46] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
The catalytic residues of an enzyme comprise the amino acids located in the active center responsible for accelerating the enzyme-catalyzed reaction. These residues lower the activation energy of reactions by performing several catalytic functions. Decades of enzymology research has established general themes regarding the roles of specific residues in these catalytic reactions, but it has been more difficult to explore these roles in a more systematic way. Here, we review the data on the catalytic residues of 648 enzymes, as annotated in the Mechanism and Catalytic Site Atlas (M-CSA), and compare our results with those in previous studies. We structured this analysis around three key properties of the catalytic residues: amino acid type, catalytic function, and sequence conservation in homologous proteins. As expected, we observed that catalysis is mostly accomplished by a small set of residues performing a limited number of catalytic functions. Catalytic residues are typically highly conserved, but to a smaller degree in homologues that perform different reactions or are nonenzymes (pseudoenzymes). Cross-analysis yielded further insights revealing which residues perform particular functions and how often. We obtained more detailed specificity rules for certain functions by identifying the chemical group upon which the residue acts. Finally, we show the mutation tolerance of the catalytic residues based on their roles. The characterization of the catalytic residues, their functions, and conservation, as presented here, is key to understanding the impact of mutations in evolution, disease, and enzyme design. The tools developed for this analysis are available at the M-CSA website and allow for user specific analysis of the same data.
Collapse
Affiliation(s)
- António J M Ribeiro
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.
| | - Jonathan D Tyzack
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Neera Borkakoti
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Gemma L Holliday
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Janet M Thornton
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| |
Collapse
|
41
|
pLoc_bal-mHum: Predict subcellular localization of human proteins by PseAAC and quasi-balancing training dataset. Genomics 2019; 111:1274-1282. [DOI: 10.1016/j.ygeno.2018.08.007] [Citation(s) in RCA: 56] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2018] [Revised: 08/14/2018] [Accepted: 08/16/2018] [Indexed: 12/17/2022]
|
42
|
Chou KC. Impacts of Pseudo Amino Acid Components and 5-steps Rule to Proteomics and Proteome Analysis. Curr Top Med Chem 2019; 19:2283-2300. [DOI: 10.2174/1568026619666191018100141] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Revised: 08/18/2019] [Accepted: 08/26/2019] [Indexed: 01/27/2023]
Abstract
Stimulated by the 5-steps rule during the last decade or so, computational proteomics has achieved remarkable progresses in the following three areas: (1) protein structural class prediction; (2) protein subcellular location prediction; (3) post-translational modification (PTM) site prediction. The results obtained by these predictions are very useful not only for an in-depth study of the functions of proteins and their biological processes in a cell, but also for developing novel drugs against major diseases such as cancers, Alzheimer’s, and Parkinson’s. Moreover, since the targets to be predicted may have the multi-label feature, two sets of metrics are introduced: one is for inspecting the global prediction quality, while the other for the local prediction quality. All the predictors covered in this review have a userfriendly web-server, through which the majority of experimental scientists can easily obtain their desired data without the need to go through the complicated mathematics.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| |
Collapse
|
43
|
Li F, Wang Y, Li C, Marquez-Lago TT, Leier A, Rawlings ND, Haffari G, Revote J, Akutsu T, Chou KC, Purcell AW, Pike RN, Webb GI, Ian Smith A, Lithgow T, Daly RJ, Whisstock JC, Song J. Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods. Brief Bioinform 2019; 20:2150-2166. [PMID: 30184176 PMCID: PMC6954447 DOI: 10.1093/bib/bby077] [Citation(s) in RCA: 58] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2018] [Revised: 07/26/2018] [Accepted: 08/01/2018] [Indexed: 01/06/2023] Open
Abstract
The roles of proteolytic cleavage have been intensively investigated and discussed during the past two decades. This irreversible chemical process has been frequently reported to influence a number of crucial biological processes (BPs), such as cell cycle, protein regulation and inflammation. A number of advanced studies have been published aiming at deciphering the mechanisms of proteolytic cleavage. Given its significance and the large number of functionally enriched substrates targeted by specific proteases, many computational approaches have been established for accurate prediction of protease-specific substrates and their cleavage sites. Consequently, there is an urgent need to systematically assess the state-of-the-art computational approaches for protease-specific cleavage site prediction to further advance the existing methodologies and to improve the prediction performance. With this goal in mind, in this article, we carefully evaluated a total of 19 computational methods (including 8 scoring function-based methods and 11 machine learning-based methods) in terms of their underlying algorithm, calculated features, performance evaluation and software usability. Then, extensive independent tests were performed to assess the robustness and scalability of the reviewed methods using our carefully prepared independent test data sets with 3641 cleavage sites (specific to 10 proteases). The comparative experimental results demonstrate that PROSPERous is the most accurate generic method for predicting eight protease-specific cleavage sites, while GPS-CCD and LabCaS outperformed other predictors for calpain-specific cleavage sites. Based on our review, we then outlined some potential ways to improve the prediction performance and ease the computational burden by applying ensemble learning, deep learning, positive unlabeled learning and parallel and distributed computing techniques. We anticipate that our study will serve as a practical and useful guide for interested readers to further advance next-generation bioinformatics tools for protease-specific cleavage site prediction.
Collapse
Affiliation(s)
- Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Yanan Wang
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Department of Biology, Institute of Molecular Systems Biology,ETH Zürich, Zürich 8093, Switzerland
| | - Tatiana T Marquez-Lago
- Department of Genetics and Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - André Leier
- Department of Genetics and Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - Neil D Rawlings
- EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Wellcome Trust Genome Campus,Hinxton, Cambridgeshire CB10 1SD, UK
| | - Gholamreza Haffari
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Jerico Revote
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, USA
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Anthony W Purcell
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Robert N Pike
- La Trobe Institute for Molecular Science, La Trobe University, Melbourne, VIC 3086, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - A Ian Smith
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| | - Trevor Lithgow
- Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, Victoria 3800, Australia
| | - Roger J Daly
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - James C Whisstock
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
44
|
Zhang Y, Xie R, Wang J, Leier A, Marquez-Lago TT, Akutsu T, Webb GI, Chou KC, Song J. Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework. Brief Bioinform 2019; 20:2185-2199. [PMID: 30351377 PMCID: PMC6954445 DOI: 10.1093/bib/bby079] [Citation(s) in RCA: 63] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Revised: 07/28/2018] [Accepted: 08/01/2018] [Indexed: 11/15/2022] Open
Abstract
As a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.
Collapse
Affiliation(s)
- Yanju Zhang
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
| | - Ruopeng Xie
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
| | - Jiawei Wang
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, VIC 3800, Australia
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - Tatiana T Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, VIC 3800, Australia
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, USA
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Jiangning Song
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, VIC 3800, Australia
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, VIC 3800, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, VIC 3800, Australia
| |
Collapse
|
45
|
Chou KC. Advances in Predicting Subcellular Localization of Multi-label Proteins and its Implication for Developing Multi-target Drugs. Curr Med Chem 2019; 26:4918-4943. [PMID: 31060481 DOI: 10.2174/0929867326666190507082559] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Revised: 01/29/2019] [Accepted: 01/31/2019] [Indexed: 12/16/2022]
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
46
|
Su ZD, Huang Y, Zhang ZY, Zhao YW, Wang D, Chen W, Chou KC, Lin H. iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC. Bioinformatics 2019; 34:4196-4204. [PMID: 29931187 DOI: 10.1093/bioinformatics/bty508] [Citation(s) in RCA: 124] [Impact Index Per Article: 24.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2018] [Accepted: 06/19/2018] [Indexed: 12/20/2022] Open
Abstract
Motivation Long non-coding RNAs (lncRNAs) are a class of RNA molecules with more than 200 nucleotides. They have important functions in cell development and metabolism, such as genetic markers, genome rearrangements, chromatin modifications, cell cycle regulation, transcription and translation. Their functions are generally closely related to their localization in the cell. Therefore, knowledge about their subcellular locations can provide very useful clues or preliminary insight into their biological functions. Although biochemical experiments could determine the localization of lncRNAs in a cell, they are both time-consuming and expensive. Therefore, it is highly desirable to develop bioinformatics tools for fast and effective identification of their subcellular locations. Results We developed a sequence-based bioinformatics tool called 'iLoc-lncRNA' to predict the subcellular locations of LncRNAs by incorporating the 8-tuple nucleotide features into the general PseKNC (Pseudo K-tuple Nucleotide Composition) via the binomial distribution approach. Rigorous jackknife tests have shown that the overall accuracy achieved by the new predictor on a stringent benchmark dataset is 86.72%, which is over 20% higher than that by the existing state-of-the-art predictor evaluated on the same tests. Availability and implementation A user-friendly webserver has been established at http://lin-group.cn/server/iLoc-LncRNA, by which users can easily obtain their desired results. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zhen-Dong Su
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yan Huang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Zhao-Yue Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Ya-Wei Zhao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Dong Wang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Wei Chen
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan, China.,Gordon Life Science Institute, Boston, MA, USA
| | - Kuo-Chen Chou
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,Gordon Life Science Institute, Boston, MA, USA
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,Gordon Life Science Institute, Boston, MA, USA
| |
Collapse
|
47
|
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
48
|
Liu B, Li K, Huang DS, Chou KC. iEnhancer-EL: identifying enhancers and their strength with ensemble learning approach. Bioinformatics 2019; 34:3835-3842. [PMID: 29878118 DOI: 10.1093/bioinformatics/bty458] [Citation(s) in RCA: 130] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2018] [Accepted: 06/06/2018] [Indexed: 11/14/2022] Open
Abstract
Motivation Identification of enhancers and their strength is important because they play a critical role in controlling gene expression. Although some bioinformatics tools were developed, they are limited in discriminating enhancers from non-enhancers only. Recently, a two-layer predictor called 'iEnhancer-2L' was developed that can be used to predict the enhancer's strength as well. However, its prediction quality needs further improvement to enhance the practical application value. Results A new predictor called 'iEnhancer-EL' was proposed that contains two layer predictors: the first one (for identifying enhancers) is formed by fusing an array of six key individual classifiers, and the second one (for their strength) formed by fusing an array of ten key individual classifiers. All these key classifiers were selected from 171 elementary classifiers formed by SVM (Support Vector Machine) based on kmer, subsequence profile and PseKNC (Pseudo K-tuple Nucleotide Composition), respectively. Rigorous cross-validations have indicated that the proposed predictor is remarkably superior to the existing state-of-the-art one in this area. Availability and implementation A web server for the iEnhancer-EL has been established at http://bioinformatics.hitsz.edu.cn/iEnhancer-EL/, by which users can easily get their desired results without the need to go through the mathematical details. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China.,Gordon Life Science Institute, Belmont, MA, USA
| | - Kai Li
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Belmont, MA, USA.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
49
|
Li F, Li C, Marquez-Lago TT, Leier A, Akutsu T, Purcell AW, Ian Smith A, Lithgow T, Daly RJ, Song J, Chou KC. Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome. Bioinformatics 2019; 34:4223-4231. [PMID: 29947803 DOI: 10.1093/bioinformatics/bty522] [Citation(s) in RCA: 120] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2018] [Accepted: 06/26/2018] [Indexed: 01/28/2023] Open
Abstract
Motivation Kinase-regulated phosphorylation is a ubiquitous type of post-translational modification (PTM) in both eukaryotic and prokaryotic cells. Phosphorylation plays fundamental roles in many signalling pathways and biological processes, such as protein degradation and protein-protein interactions. Experimental studies have revealed that signalling defects caused by aberrant phosphorylation are highly associated with a variety of human diseases, especially cancers. In light of this, a number of computational methods aiming to accurately predict protein kinase family-specific or kinase-specific phosphorylation sites have been established, thereby facilitating phosphoproteomic data analysis. Results In this work, we present Quokka, a novel bioinformatics tool that allows users to rapidly and accurately identify human kinase family-regulated phosphorylation sites. Quokka was developed by using a variety of sequence scoring functions combined with an optimized logistic regression algorithm. We evaluated Quokka based on well-prepared up-to-date benchmark and independent test datasets, curated from the Phospho.ELM and UniProt databases, respectively. The independent test demonstrates that Quokka improves the prediction performance compared with state-of-the-art computational tools for phosphorylation prediction. In summary, our tool provides users with high-quality predicted human phosphorylation sites for hypothesis generation and biological validation. Availability and implementation The Quokka webserver and datasets are freely available at http://quokka.erc.monash.edu/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC, Australia
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC, Australia.,Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland
| | - Tatiana T Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto, Japan
| | - Anthony W Purcell
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC, Australia
| | - A Ian Smith
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC, Australia.,ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC, Australia
| | - Trevor Lithgow
- Biomedicine Discovery Institute and Department of Microbiology, Monash University, Clayton, VIC, Australia
| | - Roger J Daly
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC, Australia.,Monash Centre for Data Science, Monash University, Clayton, VIC, Australia
| | | |
Collapse
|
50
|
Zhang M, Li F, Marquez-Lago TT, Leier A, Fan C, Kwoh CK, Chou KC, Song J, Jia C. MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters. Bioinformatics 2019; 35:2957-2965. [PMID: 30649179 PMCID: PMC6736106 DOI: 10.1093/bioinformatics/btz016] [Citation(s) in RCA: 75] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2018] [Revised: 12/09/2018] [Accepted: 01/05/2019] [Indexed: 12/22/2022] Open
Abstract
MOTIVATION Promoters are short DNA consensus sequences that are localized proximal to the transcription start sites of genes, allowing transcription initiation of particular genes. However, the precise prediction of promoters remains a challenging task because individual promoters often differ from the consensus at one or more positions. RESULTS In this study, we present a new multi-layer computational approach, called MULTiPly, for recognizing promoters and their specific types. MULTiPly took into account the sequences themselves, including both local information such as k-tuple nucleotide composition, dinucleotide-based auto covariance and global information of the entire samples based on bi-profile Bayes and k-nearest neighbour feature encodings. Specifically, the F-score feature selection method was applied to identify the best unique type of feature prediction results, in combination with other types of features that were subsequently added to further improve the prediction performance of MULTiPly. Benchmarking experiments on the benchmark dataset and comparisons with five state-of-the-art tools show that MULTiPly can achieve a better prediction performance on 5-fold cross-validation and jackknife tests. Moreover, the superiority of MULTiPly was also validated on a newly constructed independent test dataset. MULTiPly is expected to be used as a useful tool that will facilitate the discovery of both general and specific types of promoters in the post-genomic era. AVAILABILITY AND IMPLEMENTATION The MULTiPly webserver and curated datasets are freely available at http://flagshipnt.erc.monash.edu/MULTiPly/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Meng Zhang
- School of Science, Dalian Maritime University, Dalian, China
| | - Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
| | - Tatiana T Marquez-Lago
- Department of Genetics, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - André Leier
- Department of Genetics, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Cunshuo Fan
- College of Information Engineering, Northwest A&F University, Yangling, China
| | - Chee Keong Kwoh
- School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| | | | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC, Australia
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian, China
- College of Information Engineering, Northwest A&F University, Yangling, China
| |
Collapse
|