1
|
Weckbecker M, Anžel A, Yang Z, Hattab G. Interpretable molecular encodings and representations for machine learning tasks. Comput Struct Biotechnol J 2024; 23:2326-2336. [PMID: 38867722 PMCID: PMC11167246 DOI: 10.1016/j.csbj.2024.05.035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Revised: 05/13/2024] [Accepted: 05/19/2024] [Indexed: 06/14/2024] Open
Abstract
Molecular encodings and their usage in machine learning models have demonstrated significant breakthroughs in biomedical applications, particularly in the classification of peptides and proteins. To this end, we propose a new encoding method: Interpretable Carbon-based Array of Neighborhoods (iCAN). Designed to address machine learning models' need for more structured and less flexible input, it captures the neighborhoods of carbon atoms in a counting array and improves the utility of the resulting encodings for machine learning models. The iCAN method provides interpretable molecular encodings and representations, enabling the comparison of molecular neighborhoods, identification of repeating patterns, and visualization of relevance heat maps for a given data set. When reproducing a large biomedical peptide classification study, it outperforms its predecessor encoding. When extended to proteins, it outperforms a lead structure-based encoding on 71% of the data sets. Our method offers interpretable encodings that can be applied to all organic molecules, including exotic amino acids, cyclic peptides, and larger proteins, making it highly versatile across various domains and data sets. This work establishes a promising new direction for machine learning in peptide and protein classification in biomedicine and healthcare, potentially accelerating advances in drug discovery and disease diagnosis.
Collapse
Affiliation(s)
- Moritz Weckbecker
- Center for Artificial Intelligence in Public Health Research, (ZKI-PH), Robert Koch Institute, Nordufer 20, Berlin, 13353, Berlin, Germany
| | - Aleksandar Anžel
- Center for Artificial Intelligence in Public Health Research, (ZKI-PH), Robert Koch Institute, Nordufer 20, Berlin, 13353, Berlin, Germany
| | - Zewen Yang
- Center for Artificial Intelligence in Public Health Research, (ZKI-PH), Robert Koch Institute, Nordufer 20, Berlin, 13353, Berlin, Germany
| | - Georges Hattab
- Center for Artificial Intelligence in Public Health Research, (ZKI-PH), Robert Koch Institute, Nordufer 20, Berlin, 13353, Berlin, Germany
- Department of Mathematics and Computer science Freie Universität, Arnimallee 14, Berlin, 14195, Berlin, Germany
| |
Collapse
|
2
|
Li Q, Lenertz M, Armstrong Z, MacRae A, Feng L, Ugrinov A, Yang Z. A Protocol to Depict the Proteolytic Processes Using a Combination of Metal-Organic Materials (MOMs), Electron Paramagnetic Resonance (EPR), and Mass Spectrometry (MS). Bio Protoc 2024; 14:e4909. [PMID: 38213322 PMCID: PMC10777052 DOI: 10.21769/bioprotoc.4909] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 11/20/2023] [Accepted: 11/21/2023] [Indexed: 01/13/2024] Open
Abstract
Proteolysis is a critical biochemical process yet a challenging field to study experimentally due to the self-degradation of a protease and the complex, dynamic degradation steps of a substrate. Mass spectrometry (MS) is the traditional way for proteolytic studies, yet it is challenging when time-resolved, step-by-step details of the degradation process are needed. We recently found a way to resolve the cleavage site, preference/selectivity of cleavage regions, and proteolytic kinetics by combining site-directed spin labeling (SDSL) of protein substrate, time-resolved two-dimensional (2D) electron paramagnetic resonance (EPR) spectroscopy, protease immobilization via metal-organic materials (MOMs), and MS. The method has been demonstrated on a model substrate and protease, yet there is a lack of details on the practical operations to carry out our strategy. Thus, this protocol summarizes the key steps and considerations when carrying out the EPR/MS study on proteolytic processes, which can be generalized to study other protein/polypeptide substrates in proteolysis. Details for the experimental operation and cautions of each step are reported with figures illustrating the concepts. This protocol provides an effective approach to understanding the proteolytic process with the advantages of offering time-resolved, residue-level resolution of structural basis underlying the process. Such information is important for revealing the cleavage site and proteolytic mechanisms of unknown proteases. The advantage of EPR, probing the target substrate regardless of the complexities caused by the proteases and their self-degradation, offers a practically effective, rapid, and easy-to-operate approach to studying proteolysis. Key features • Combining protease immobilization, EPR, spin labeling, and MS experimental methods allows for the analysis of proteolysis process in real time. • Reveals cleavage site, kinetics of product generation, and preference of cleavage regions via time-resolved SDSL-EPR. • MS confirms EPR findings and helps depict the sequences and populations of the cleaved segments in real time. • The demonstrated method can be generalized to other proteins or polypeptide substrates upon proteolysis by other proteases.
Collapse
Affiliation(s)
- Qiaobin Li
- Department of Chemistry and Biochemistry, North Dakota State University, Fargo, ND, 58102, USA
| | - Mary Lenertz
- Department of Chemistry and Biochemistry, North Dakota State University, Fargo, ND, 58102, USA
| | - Zoe Armstrong
- Department of Chemistry and Biochemistry, North Dakota State University, Fargo, ND, 58102, USA
| | - Austin MacRae
- Department of Chemistry and Biochemistry, North Dakota State University, Fargo, ND, 58102, USA
| | - Li Feng
- Department of Chemistry and Biochemistry, North Dakota State University, Fargo, ND, 58102, USA
| | - Angel Ugrinov
- Department of Chemistry and Biochemistry, North Dakota State University, Fargo, ND, 58102, USA
| | - Zhongyu Yang
- Department of Chemistry and Biochemistry, North Dakota State University, Fargo, ND, 58102, USA
| |
Collapse
|
3
|
Heese R, Schmid J, Walczak M, Bortz M. Calibrated simplex-mapping classification. PLoS One 2023; 18:e0279876. [PMID: 36649243 PMCID: PMC9844900 DOI: 10.1371/journal.pone.0279876] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2021] [Accepted: 12/16/2022] [Indexed: 01/18/2023] Open
Abstract
We propose a novel methodology for general multi-class classification in arbitrary feature spaces, which results in a potentially well-calibrated classifier. Calibrated classifiers are important in many applications because, in addition to the prediction of mere class labels, they also yield a confidence level for each of their predictions. In essence, the training of our classifier proceeds in two steps. In a first step, the training data is represented in a latent space whose geometry is induced by a regular (n - 1)-dimensional simplex, n being the number of classes. We design this representation in such a way that it well reflects the feature space distances of the datapoints to their own- and foreign-class neighbors. In a second step, the latent space representation of the training data is extended to the whole feature space by fitting a regression model to the transformed data. With this latent-space representation, our calibrated classifier is readily defined. We rigorously establish its core theoretical properties and benchmark its prediction and calibration properties by means of various synthetic and real-world data sets from different application domains.
Collapse
Affiliation(s)
- Raoul Heese
- Fraunhofer Center for Machine Learning, Kaiserslautern, Germany
- Fraunhofer Institute for Industrial Mathematics ITWM, Kaiserslautern, Germany
- * E-mail:
| | - Jochen Schmid
- Fraunhofer Institute for Industrial Mathematics ITWM, Kaiserslautern, Germany
| | - Michał Walczak
- Fraunhofer Center for Machine Learning, Kaiserslautern, Germany
- Fraunhofer Institute for Industrial Mathematics ITWM, Kaiserslautern, Germany
| | - Michael Bortz
- Fraunhofer Center for Machine Learning, Kaiserslautern, Germany
- Fraunhofer Institute for Industrial Mathematics ITWM, Kaiserslautern, Germany
| |
Collapse
|
4
|
Onah E, Uzor PF, Ugwoke IC, Eze JU, Ugwuanyi ST, Chukwudi IR, Ibezim A. Prediction of HIV-1 protease cleavage site from octapeptide sequence information using selected classifiers and hybrid descriptors. BMC Bioinformatics 2022; 23:466. [DOI: 10.1186/s12859-022-05017-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Accepted: 10/11/2022] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
In most parts of the world, especially in underdeveloped countries, acquired immunodeficiency syndrome (AIDS) still remains a major cause of death, disability, and unfavorable economic outcomes. This has necessitated intensive research to develop effective therapeutic agents for the treatment of human immunodeficiency virus (HIV) infection, which is responsible for AIDS. Peptide cleavage by HIV-1 protease is an essential step in the replication of HIV-1. Thus, correct and timely prediction of the cleavage site of HIV-1 protease can significantly speed up and optimize the drug discovery process of novel HIV-1 protease inhibitors. In this work, we built and compared the performance of selected machine learning models for the prediction of HIV-1 protease cleavage site utilizing a hybrid of octapeptide sequence information comprising bond composition, amino acid binary profile (AABP), and physicochemical properties as numerical descriptors serving as input variables for some selected machine learning algorithms. Our work differs from antecedent studies exploring the same subject in the combination of octapeptide descriptors and method used. Instead of using various subsets of the dataset for training and testing the models, we combined the dataset, applied a 3-way data split, and then used a "stratified" 10-fold cross-validation technique alongside the testing set to evaluate the models.
Results
Among the 8 models evaluated in the “stratified” 10-fold CV experiment, logistic regression, multi-layer perceptron classifier, linear discriminant analysis, gradient boosting classifier, Naive Bayes classifier, and decision tree classifier with AUC, F-score, and B. Acc. scores in the ranges of 0.91–0.96, 0.81–0.88, and 80.1–86.4%, respectively, have the closest predictive performance to the state-of-the-art model (AUC 0.96, F-score 0.80 and B. Acc. ~ 80.0%). Whereas, the perceptron classifier and the K-nearest neighbors had statistically lower performance (AUC 0.77–0.82, F-score 0.53–0.69, and B. Acc. 60.0–68.5%) at p < 0.05. On the other hand, logistic regression, and multi-layer perceptron classifier (AUC of 0.97, F-score > 0.89, and B. Acc. > 90.0%) had the best performance on further evaluation on the testing set, though linear discriminant analysis, gradient boosting classifier, and Naive Bayes classifier equally performed well (AUC > 0.94, F-score > 0.87, and B. Acc. > 86.0%).
Conclusions
Logistic regression and multi-layer perceptron classifiers have comparable predictive performances to the state-of-the-art model when octapeptide sequence descriptors consisting of AABP, bond composition and standard physicochemical properties are used as input variables. In our future work, we hope to develop a standalone software for HIV-1 protease cleavage site prediction utilizing the linear regression algorithm and the aforementioned octapeptide sequence descriptors.
Collapse
|
5
|
Hu L, Li Z, Tang Z, Zhao C, Zhou X, Hu P. Effectively predicting HIV-1 protease cleavage sites by using an ensemble learning approach. BMC Bioinformatics 2022; 23:447. [PMID: 36303135 PMCID: PMC9608884 DOI: 10.1186/s12859-022-04999-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2022] [Accepted: 10/13/2022] [Indexed: 11/10/2022] Open
Abstract
Background The site information of substrates that can be cleaved by human immunodeficiency virus 1 proteases (HIV-1 PRs) is of great significance for designing effective inhibitors against HIV-1 viruses. A variety of machine learning-based algorithms have been developed to predict HIV-1 PR cleavage sites by extracting relevant features from substrate sequences. However, only relying on the sequence information is not sufficient to ensure a promising performance due to the uncertainty in the way of separating the datasets used for training and testing. Moreover, the existence of noisy data, i.e., false positive and false negative cleavage sites, could negatively influence the accuracy performance. Results In this work, an ensemble learning algorithm for predicting HIV-1 PR cleavage sites, namely EM-HIV, is proposed by training a set of weak learners, i.e., biased support vector machine classifiers, with the asymmetric bagging strategy. By doing so, the impact of data imbalance and noisy data can thus be alleviated. Besides, in order to make full use of substrate sequences, the features used by EM-HIV are collected from three different coding schemes, including amino acid identities, chemical properties and variable-length coevolutionary patterns, for the purpose of constructing more relevant feature vectors of octamers. Experiment results on three independent benchmark datasets demonstrate that EM-HIV outperforms state-of-the-art prediction algorithm in terms of several evaluation metrics. Hence, EM-HIV can be regarded as a useful tool to accurately predict HIV-1 PR cleavage sites.
Collapse
Affiliation(s)
- Lun Hu
- grid.9227.e0000000119573309Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Zhenfeng Li
- grid.162110.50000 0000 9291 3229School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan, China
| | - Zehai Tang
- grid.162110.50000 0000 9291 3229School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan, China
| | - Cheng Zhao
- grid.162110.50000 0000 9291 3229School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan, China
| | - Xi Zhou
- grid.9227.e0000000119573309Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Pengwei Hu
- grid.9227.e0000000119573309Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| |
Collapse
|
6
|
Truong TT, Hayn M, Frich CK, Olari L, Ladefoged LK, Jarlstad Olesen MT, Jakobsen JH, Lunabjerg‐Vestergaard CK, Schiøtt B, Münch J, Zelikin AN. Potentiation of Drug Toxicity Through Virus Latency Reversal Promotes Preferential Elimination of HIV Infected Cells. ADVANCED THERAPEUTICS 2022. [DOI: 10.1002/adtp.202200113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Affiliation(s)
- Thanh Tung Truong
- Department of Chemistry Aarhus University Langelandsgade 140 Aarhus C 8000 Denmark
| | - Manuel Hayn
- Institute of Molecular Virology Ulm University Medical Center 89081 Ulm Germany
| | - Camilla Kaas Frich
- Department of Chemistry Aarhus University Langelandsgade 140 Aarhus C 8000 Denmark
| | - Lia‐Raluca Olari
- Institute of Molecular Virology Ulm University Medical Center 89081 Ulm Germany
| | | | | | - Josefine H. Jakobsen
- Department of Chemistry Aarhus University Langelandsgade 140 Aarhus C 8000 Denmark
| | | | - Birgit Schiøtt
- Department of Chemistry Aarhus University Langelandsgade 140 Aarhus C 8000 Denmark
- iNano Interdisciplinary Nanoscience Centre Aarhus University Aarhus 8000 Denmark
| | - Jan Münch
- Institute of Molecular Virology Ulm University Medical Center 89081 Ulm Germany
- iNano Interdisciplinary Nanoscience Centre Aarhus University Aarhus 8000 Denmark
| | - Alexander N. Zelikin
- Department of Chemistry Aarhus University Langelandsgade 140 Aarhus C 8000 Denmark
- iNano Interdisciplinary Nanoscience Centre Aarhus University Aarhus 8000 Denmark
| |
Collapse
|
7
|
Li Z, Hu L, Tang Z, Zhao C. Predicting HIV-1 Protease Cleavage Sites With Positive-Unlabeled Learning. Front Genet 2021; 12:658078. [PMID: 33868387 PMCID: PMC8044780 DOI: 10.3389/fgene.2021.658078] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Accepted: 03/08/2021] [Indexed: 11/13/2022] Open
Abstract
Understanding the substrate specificity of HIV-1 protease plays an essential role in the prevention of HIV infection. A variety of computational models have thus been developed to predict substrate sites that are cleaved by HIV-1 protease, but most of them normally follow a supervised learning scheme to build classifiers by considering experimentally verified cleavable sites as positive samples and unknown sites as negative samples. However, certain noisy can be contained in the negative set, as false negative samples are possibly existed. Hence, the performance of the classifiers is not as accurate as they could be due to the biased prediction results. In this work, unknown substrate sites are regarded as unlabeled samples instead of negative ones. We propose a novel positive-unlabeled learning algorithm, namely PU-HIV, for an effective prediction of HIV-1 protease cleavage sites. Features used by PU-HIV are encoded from different perspectives of substrate sequences, including amino acid identities, coevolutionary patterns and chemical properties. By adjusting the weights of errors generated by positive and unlabeled samples, a biased support vector machine classifier can be built to complete the prediction task. In comparison with state-of-the-art prediction models, benchmarking experiments using cross-validation and independent tests demonstrated the superior performance of PU-HIV in terms of AUC, PR-AUC, and F-measure. Thus, with PU-HIV, it is possible to identify previously unknown, but physiologically existed substrate sites that are able to be cleaved by HIV-1 protease, thus providing valuable insights into designing novel HIV-1 protease inhibitors for HIV treatment.
Collapse
Affiliation(s)
- Zhenfeng Li
- School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
| | - Lun Hu
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Zehai Tang
- School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
| | - Cheng Zhao
- School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
| |
Collapse
|
8
|
Ochoa R, Magnitov M, Laskowski RA, Cossio P, Thornton JM. An automated protocol for modelling peptide substrates to proteases. BMC Bioinformatics 2020; 21:586. [PMID: 33375946 PMCID: PMC7771086 DOI: 10.1186/s12859-020-03931-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2020] [Accepted: 12/09/2020] [Indexed: 11/21/2022] Open
Abstract
Background Proteases are key drivers in many biological processes, in part due to their specificity towards their substrates. However, depending on the family and molecular function, they can also display substrate promiscuity which can also be essential. Databases compiling specificity matrices derived from experimental assays have provided valuable insights into protease substrate recognition. Despite this, there are still gaps in our knowledge of the structural determinants. Here, we compile a set of protease crystal structures with bound peptide-like ligands to create a protocol for modelling substrates bound to protease structures, and for studying observables associated to the binding recognition.
Results As an application, we modelled a subset of protease–peptide complexes for which experimental cleavage data are available to compare with informational entropies obtained from protease–specificity matrices. The modelled complexes were subjected to conformational sampling using the Backrub method in Rosetta, and multiple observables from the simulations were calculated and compared per peptide position. We found that some of the calculated structural observables, such as the relative accessible surface area and the interaction energy, can help characterize a protease’s substrate recognition, giving insights for the potential prediction of novel substrates by combining additional approaches. Conclusion Overall, our approach provides a repository of protease structures with annotated data, and an open source computational protocol to reproduce the modelling and dynamic analysis of the protease–peptide complexes.
Collapse
Affiliation(s)
- Rodrigo Ochoa
- Biophysics of Tropical Diseases, Max Planck Tandem Group, University of Antioquia, 050010, Medellín, Colombia. .,European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - Mikhail Magnitov
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.,Department of Biological and Medical Physics, Moscow Institute of Physics and Technology (National Research University), Dolgoprudny, Russia, 141701
| | - Roman A Laskowski
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Pilar Cossio
- Biophysics of Tropical Diseases, Max Planck Tandem Group, University of Antioquia, 050010, Medellín, Colombia.,Department of Theoretical Biophysics, Max Planck Institute of Biophysics, 60438, Frankfurt am Main, Germany
| | - Janet M Thornton
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| |
Collapse
|
9
|
Hu L, Hu P, Luo X, Yuan X, You ZH. Incorporating the Coevolving Information of Substrates in Predicting HIV-1 Protease Cleavage Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:2017-2028. [PMID: 31056514 DOI: 10.1109/tcbb.2019.2914208] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Human immunodeficiency virus 1 (HIV-1) protease (PR) plays a crucial role in the maturation of the virus. The study of substrate specificity of HIV-1 PR as a new endeavor strives to increase our ability to understand how HIV-1 PR recognizes its various cleavage sites. To predict HIV-1 PR cleavage sites, most of the existing approaches have been developed solely based on the homogeneity of substrate sequence information with supervised classification techniques. Although efficient, these approaches are found to be restricted to the ability of explaining their results and probably provide few insights into the mechanisms by which HIV-1 PR cleaves the substrates in a site-specific manner. In this work, a coevolutionary pattern-based prediction model for HIV-1 PR cleavage sites, namely EvoCleave, is proposed by integrating the coevolving information obtained from substrate sequences with a linear SVM classifier. The experiment results showed that EvoCleave yielded a very promising performance in terms of ROC analysis and f-measure. We also prospectively assessed the biological significance of coevolutionary patterns by applying them to study three fundamental issues of HIV-1 PR cleavage site. The analysis results demonstrated that the coevolutionary patterns offered valuable insights into the understanding of substrate specificity of HIV-1 PR.
Collapse
|
10
|
Taylor EW, Radding W. Understanding Selenium and Glutathione as Antiviral Factors in COVID-19: Does the Viral M pro Protease Target Host Selenoproteins and Glutathione Synthesis? Front Nutr 2020; 7:143. [PMID: 32984400 PMCID: PMC7492384 DOI: 10.3389/fnut.2020.00143] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2020] [Accepted: 07/21/2020] [Indexed: 12/14/2022] Open
Abstract
Glutathione peroxidases (GPX), a family of antioxidant selenoenzymes, functionally link selenium and glutathione, which both show correlations with clinical outcomes in COVID-19. Thus, it is highly significant that cytosolic GPX1 has been shown to interact with an inactive C145A mutant of Mpro, the main cysteine protease of SARS-CoV-2, but not with catalytically active wild-type Mpro. This seemingly anomalous result is what might be expected if GPX1 is a substrate for the active protease, leading to its fragmentation. We show that the GPX1 active site sequence is substantially similar to a known Mpro cleavage site, and is identified as a potential cysteine protease site by the Procleave algorithm. Proteolytic knockdown of GPX1 is highly consistent with previously documented effects of recombinant SARS-CoV Mpro in transfected cells, including increased reactive oxygen species and NF-κB activation. Because NF-κB in turn activates many pro-inflammatory cytokines, this mechanism could contribute to increased inflammation and cytokine storms observed in COVID-19. Using web-based protease cleavage site prediction tools, we show that Mpro may be targeting not only GPX1, but several other selenoproteins including SELENOF and thioredoxin reductase 1, as well as glutamate-cysteine ligase, the rate-limiting enzyme for glutathione synthesis. This hypothesized proteolytic knockdown of components of both the thioredoxin and glutaredoxin systems is consistent with a viral strategy to inhibit DNA synthesis, to increase the pool of ribonucleotides for RNA synthesis, thereby enhancing virion production. The resulting "collateral damage" of increased oxidative stress and inflammation would be exacerbated by dietary deficiencies of selenium and glutathione precursors.
Collapse
Affiliation(s)
- Ethan Will Taylor
- Department of Chemistry and Biochemistry, The University of North Carolina at Greensboro, Greensboro, NC, United States
| | | |
Collapse
|
11
|
Singh D, Sisodia DS, Singh P. Multiobjective evolutionary-based multi-kernel learner for realizing transfer learning in the prediction of HIV-1 protease cleavage sites. Soft comput 2020. [DOI: 10.1007/s00500-019-04487-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
12
|
Singh D, Sisodia DS, Singh P. Compositional framework for multitask learning in the identification of cleavage sites of HIV-1 protease. J Biomed Inform 2020; 102:103376. [PMID: 31935461 DOI: 10.1016/j.jbi.2020.103376] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2019] [Revised: 12/19/2019] [Accepted: 01/08/2020] [Indexed: 11/18/2022]
Abstract
Inadequate patient samples and costly annotated data generations result into the smaller dataset in the biomedical domain. Due to which the predictions with a trained model that usually reveal a single small dataset association are fail to derive robust insights. To cope with the data sparsity, a promising strategy of combining data from the different related tasks is exercised in various application. Motivated by, successful work in the various bioinformatics application, we propose a multitask learning model based on multi-kernel that exploits the dependencies among various related tasks. This work aims to combine the knowledge from experimental studies of the different dataset to build stronger predictive models for HIV-1 protease cleavage sites prediction. In this study, a set of peptide data from one source is referred as 'task' and to integrate interactions from multiple tasks; our method exploits the common features and parameters sharing across the data source. The proposed framework uses feature integration, feature selection, multi-kernel and multifactorial evolutionary algorithm to model multitask learning. The framework considered seven different feature descriptors and four different kernel variants of support vector machines to form the optimal multi-kernel learning model. To validate the effectiveness of the model, the performance parameters such as average accuracy, and area under curve have been evaluated on the suggested model. We also carried out Friedman and post hoc statistical test to substantiate the significant improvement achieved by the proposed framework. The result obtained following the extensive experiment confirms the belief that multitask learning in cleavage site identification can improve the performance.
Collapse
Affiliation(s)
- Deepak Singh
- Department of Computer Science and Engineering, National Institute of Technology, Raipur, C.G, India.
| | - Dilip Singh Sisodia
- Department of Computer Science and Engineering, National Institute of Technology, Raipur, C.G, India.
| | - Pradeep Singh
- Department of Computer Science and Engineering, National Institute of Technology, Raipur, C.G, India.
| |
Collapse
|
13
|
Cognitive Framework for HIV-1 Protease Cleavage Site Classification Using Evolutionary Algorithm. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2019. [DOI: 10.1007/s13369-019-03871-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
14
|
Christou V, Tsipouras MG, Giannakeas N, Tzallas AT, Brown G. Hybrid extreme learning machine approach for heterogeneous neural networks. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2019.04.092] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
15
|
Evolutionary based ensemble framework for realizing transfer learning in HIV-1 Protease cleavage sites prediction. APPL INTELL 2018. [DOI: 10.1007/s10489-018-1323-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
16
|
Christou V, Tsipouras MG, Giannakeas N, Tzallas AT. Hybrid extreme learning machine approach for homogeneous neural networks. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2018.05.064] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
17
|
Manasa J, Varghese V, Pond SLK, Rhee SY, Tzou PL, Fessel WJ, Jang KS, White E, Rögnvaldsson T, Katzenstein DA, Shafer RW. Evolution of gag and gp41 in Patients Receiving Ritonavir-Boosted Protease Inhibitors. Sci Rep 2017; 7:11559. [PMID: 28912582 PMCID: PMC5599673 DOI: 10.1038/s41598-017-11893-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2017] [Accepted: 08/31/2017] [Indexed: 11/15/2022] Open
Abstract
Several groups have proposed that genotypic determinants in gag and the gp41 cytoplasmic domain (gp41-CD) reduce protease inhibitor (PI) susceptibility without PI-resistance mutations in protease. However, no gag and gp41-CD mutations definitively responsible for reduced PI susceptibility have been identified in individuals with virological failure (VF) while receiving a boosted PI (PI/r)-containing regimen. To identify gag and gp41 mutations under selective PI pressure, we sequenced gag and/or gp41 in 61 individuals with VF on a PI/r (n = 40) or NNRTI (n = 20) containing regimen. We quantified nonsynonymous and synonymous changes in both genes and identified sites exhibiting signal for directional or diversifying selection. We also used published gag and gp41 polymorphism data to highlight mutations displaying a high selection index, defined as changing from a conserved to an uncommon amino acid. Many amino acid mutations developed in gag and in gp41-CD in both the PI- and NNRTI-treated groups. However, in neither gene, were there discernable differences between the two groups in overall numbers of mutations, mutations displaying evidence of diversifying or directional selection, or mutations with a high selection index. If gag and/or gp41 encode PI-resistance mutations, they may not be confined to consistent mutations at a few sites.
Collapse
Affiliation(s)
- Justen Manasa
- Division of Infectious Diseases, Department of Medicine Stanford University, Stanford, CA, USA
| | - Vici Varghese
- Division of Infectious Diseases, Department of Medicine Stanford University, Stanford, CA, USA
| | | | - Soo-Yon Rhee
- Division of Infectious Diseases, Department of Medicine Stanford University, Stanford, CA, USA
| | - Philip L Tzou
- Division of Infectious Diseases, Department of Medicine Stanford University, Stanford, CA, USA
| | - W Jeffrey Fessel
- Department of Internal Medicine, Kaiser Permanente Medical Care Program - Northern California, San Francisco, CA, United States
| | - Karen S Jang
- Division of Infectious Diseases, Department of Medicine Stanford University, Stanford, CA, USA
| | - Elizabeth White
- Division of Infectious Diseases, Department of Medicine Stanford University, Stanford, CA, USA
| | | | - David A Katzenstein
- Division of Infectious Diseases, Department of Medicine Stanford University, Stanford, CA, USA
| | - Robert W Shafer
- Division of Infectious Diseases, Department of Medicine Stanford University, Stanford, CA, USA.
| |
Collapse
|
18
|
Singh O, Su ECY. Prediction of HIV-1 protease cleavage site using a combination of sequence, structural, and physicochemical features. BMC Bioinformatics 2016; 17:478. [PMID: 28155640 PMCID: PMC5259813 DOI: 10.1186/s12859-016-1337-6] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Background The human immunodeficiency virus type 1 (HIV-1) aspartic protease is an important enzyme owing to its imperative part in viral development and a causative agent of deadliest disease known as acquired immune deficiency syndrome (AIDS). Development of HIV-1 protease inhibitors can help understand the specificity of substrates which can restrain the replication of HIV-1, thus antagonize AIDS. However, experimental methods in identification of HIV-1 protease cleavage sites are generally time-consuming and labor-intensive. Therefore, using computational methods to predict cleavage sites has become highly desirable. Results In this study, we propose a prediction method in which sequence, structural, and physicochemical features are incorporated in various machine learning algorithms. Then, a bidirectional stepwise selection algorithm is incorporated in feature selection to identify discriminative features. Further, only the selected features are calculated by various encoding schemes and used as input for decision trees, logistic regression, and artificial neural networks. Moreover, a more rigorous three-way data split procedure is applied to evaluate the objective performance of cleavage site prediction. Four benchmark datasets collected from previous studies are used to evaluate the predictive performance. Conclusions Experiment results showed that combinations of sequence, structure, and physicochemical features performed better than single feature type for identification of HIV-1 protease cleavage sites. In addition, incorporation of stepwise feature selection is effective to identify interpretable biological features to depict specificity of the substrates. Moreover, artificial neural networks perform significantly better than the other two classifiers. Finally, the proposed method achieved 80.0% ~ 97.4% in accuracy and 0.815 ~ 0.995 evaluated by independent test sets in a three-way data split procedure. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1337-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Onkar Singh
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan
| | - Emily Chia-Yu Su
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan.
| |
Collapse
|
19
|
Koçak Y, Özyer T, Alhajj R. Utilizing maximal frequent itemsets and social network analysis for HIV data analysis. J Cheminform 2016. [PMCID: PMC5395515 DOI: 10.1186/s13321-016-0184-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
Acquired immune deficiency syndrome is a deadly disease which is caused by human immunodeficiency virus (HIV). This virus attacks patients immune system and effects its ability to fight against diseases. Developing effective medicine requires understanding the life cycle and replication ability of the virus. HIV-1 protease enzyme is used to cleave an octamer peptide into peptides which are used to create proteins by the virus. In this paper, a novel feature extraction method is proposed for understanding important patterns in octamer’s cleavability. This feature extraction method is based on data mining techniques which are used to find important relations inside a dataset by comprehensively analyzing the given data. As demonstrated in this paper, using the extracted information in the classification process yields important results which may be taken into consideration when developing a new medicine. We have used 746 and 1625, Impens and schilling data instances from the 746-dataset. Besides, we have performed social network analysis as a complementary alternative method.
Collapse
|
20
|
Manning T, Walsh P. The importance of physicochemical characteristics and nonlinear classifiers in determining HIV-1 protease specificity. Bioengineered 2016; 7:65-78. [PMID: 27212259 PMCID: PMC4879986 DOI: 10.1080/21655979.2016.1149271] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Revised: 01/25/2016] [Accepted: 01/26/2016] [Indexed: 10/21/2022] Open
Abstract
This paper reviews recent research relating to the application of bioinformatics approaches to determining HIV-1 protease specificity, outlines outstanding issues, and presents a new approach to addressing these issues. Leading machine learning theory for the problem currently suggests that the direct encoding of the physicochemical properties of the amino acid substrates is not required for optimal performance. A number of amino acid encoding approaches which incorporate potentially relevant physicochemical properties of the substrate are identified, and are evaluated using a nonlinear task decomposition based neuroevolution algorithm. The results are evaluated, and compared against a recent benchmark set on a nonlinear classifier using only amino acid sequence and identity information. Ensembles of these nonlinear classifiers using the physicochemical properties of the substrate are demonstrated to consistently outperform the recently published state-of-the-art linear support vector machine based approach in out-of-sample evaluations.
Collapse
Affiliation(s)
- Timmy Manning
- Department of Computer Science, Cork Institute of Technology, Cork, Ireland
| | - Paul Walsh
- Department of Computer Science, Cork Institute of Technology, Cork, Ireland
- NSilico Ltd, Rubicon Innovation Center, Cork, Ireland
| |
Collapse
|
21
|
Bayden AS, Gomez EF, Audie J, Chakravorty DK, Diller DJ. A combined cheminformatic and bioinformatic approach to address the proteolytic stability challenge in peptide-based drug discovery. Biopolymers 2015; 104:775-89. [PMID: 26270398 DOI: 10.1002/bip.22711] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2015] [Revised: 07/22/2015] [Accepted: 08/09/2015] [Indexed: 11/10/2022]
Abstract
We have created models to predict cleavage sites for several human proteases including caspase-1, caspase-3, caspase-6, caspase-7, cathepsin B, cathepsin D, cathepsin G, cathepsin K, cathepsin L, elastase-2, granzyme A, granzyme B, matrix metallopeptidase-2 (MMP2), MMP7, MMP9, thrombin, and trypsin-1. Rather than representing the sequence pattern around the potential cleavage site through a series of flags with each flag representing one of the 20 standard amino acids, we first represent each amino acid by its calculated properties. For these calculated properties, we use validated cheminformatic descriptors, such as molecular weight, logP, and polar surface area, of the individual amino acids. Finally, the cleavage site-specific descriptors are calculated through various combinations of the individual amino acid descriptors for the residues surrounding the cleavage site. Some of these combinations do not take into account the location of the residue, as long as it is in a prescribed neighborhood of the potential cleavage site, whereas others are sensitive to the precise order of the residues in the sequence. The key advantage of this approach is that it allows one to perform meaningful calculations with nonstandard amino acids for which little or no data exists. Finally, using both docking and molecular dynamics simulations, we examine the potential for and limitations of protease crystal structures to impact the design of proteolytically stable peptides.
Collapse
Affiliation(s)
| | - Edwin F Gomez
- Department of Chemistry, University of New Orleans, New Orleans, LA
| | - Joseph Audie
- CMDBioscience Inc., 5 Science Park, New Haven, CT
| | | | | |
Collapse
|