1
|
Luukkonen S, Meijer E, Tricarico GA, Hofmans J, Stouten PFW, van Westen GJP, Lenselink EB. Large-Scale Modeling of Sparse Protein Kinase Activity Data. J Chem Inf Model 2023. [PMID: 37294674 DOI: 10.1021/acs.jcim.3c00132] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Protein kinases are a protein family that plays an important role in several complex diseases such as cancer and cardiovascular and immunological diseases. Protein kinases have conserved ATP binding sites, which when targeted can lead to similar activities of inhibitors against different kinases. This can be exploited to create multitarget drugs. On the other hand, selectivity (lack of similar activities) is desirable in order to avoid toxicity issues. There is a vast amount of protein kinase activity data in the public domain, which can be used in many different ways. Multitask machine learning models are expected to excel for these kinds of data sets because they can learn from implicit correlations between tasks (in this case activities against a variety of kinases). However, multitask modeling of sparse data poses two major challenges: (i) creating a balanced train-test split without data leakage and (ii) handling missing data. In this work, we construct a protein kinase benchmark set composed of two balanced splits without data leakage, using random and dissimilarity-driven cluster-based mechanisms, respectively. This data set can be used for benchmarking and developing protein kinase activity prediction models. Overall, the performance on the dissimilarity-driven cluster-based split is lower than on random split-based sets for all models, indicating poor generalizability of models. Nevertheless, we show that multitask deep learning models, on this very sparse data set, outperform single-task deep learning and tree-based models. Finally, we demonstrate that data imputation does not improve the performance of (multitask) models on this benchmark set.
Collapse
Affiliation(s)
- Sohvi Luukkonen
- Leiden Academic Centre of Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands
| | - Erik Meijer
- Leiden Academic Centre of Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands
| | | | - Johan Hofmans
- Galapagos NV, Generaal De Wittelaan L11 A3, 2800 Mechelen, Belgium
| | - Pieter F W Stouten
- Leiden Academic Centre of Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands
- Galapagos NV, Generaal De Wittelaan L11 A3, 2800 Mechelen, Belgium
- Stouten Pharma Consultancy BV, Kempenarestraat 47, 2860 Sint-Katelijne-Waver, Belgium
| | - Gerard J P van Westen
- Leiden Academic Centre of Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands
| | | |
Collapse
|
2
|
Gopanenko AV, Kolobova AV, Tupikin AE, Kabilov MR, Malygin AA, Karpova GG. Knockdown of the Ribosomal Protein eL38 in HEK293 Cells Changes the Translational Efficiency of Specific Genes. Int J Mol Sci 2021; 22:ijms22094531. [PMID: 33926116 PMCID: PMC8123606 DOI: 10.3390/ijms22094531] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 04/22/2021] [Accepted: 04/24/2021] [Indexed: 11/23/2022] Open
Abstract
The protein eL38 is one of the smallest proteins of the mammalian ribosome, which is a component of its large (60S) subunit. The haploinsufficiency of eL38 in mice leads to the Tail-short mutant phenotype characterized by defects in the development of the axial skeleton caused by the poor translation of mRNA subsets of Hox genes. Using the ribosome profiling assay applied to HEK293 cells knocked down of eL38, we examined the effects of the lack of eL38 in 60S subunits on gene expression at the level of translation. A four-fold decrease in the cell content of eL38 was shown to result in significant changes in the translational efficiencies of 150 genes. Among the genes, whose expression at the level of translation was enhanced, there were mainly those associated with basic metabolic processes; namely, translation, protein folding, chromosome organization, splicing, and others. The set of genes with reduced translation efficiencies contained those that are mostly involved in the processes related to the regulation of transcription, including the activation of Hox genes. Thus, we demonstrated that eL38 insufficiency significantly affects the expression of certain genes at the translational level. Our findings facilitate understanding the possible causes of some anomalies in eL38-deficient animals.
Collapse
|
3
|
Duan Y, Evans DS, Miller RA, Schork NJ, Cummings S, Girke T. signatureSearch: environment for gene expression signature searching and functional interpretation. Nucleic Acids Res 2020; 48:e124. [PMID: 33068417 PMCID: PMC7708038 DOI: 10.1093/nar/gkaa878] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2020] [Revised: 08/19/2020] [Accepted: 09/25/2020] [Indexed: 12/14/2022] Open
Abstract
signatureSearch is an R/Bioconductor package that integrates a suite of existing and novel algorithms into an analysis environment for gene expression signature (GES) searching combined with functional enrichment analysis (FEA) and visualization methods to facilitate the interpretation of the search results. In a typical GES search (GESS), a query GES is searched against a database of GESs obtained from large numbers of measurements, such as different genetic backgrounds, disease states and drug perturbations. Database matches sharing correlated signatures with the query indicate related cellular responses frequently governed by connected mechanisms, such as drugs mimicking the expression responses of a disease. To identify which processes are predominantly modulated in the GESS results, we developed specialized FEA methods combined with drug-target network visualization tools. The provided analysis tools are useful for studying the effects of genetic, chemical and environmental perturbations on biological systems, as well as searching single cell GES databases to identify novel network connections or cell types. The signatureSearch software is unique in that it provides access to an integrated environment for GESS/FEA routines that includes several novel search and enrichment methods, efficient data structures, and access to pre-built GES databases, and allowing users to work with custom databases.
Collapse
Affiliation(s)
- Yuzhu Duan
- Institute for Integrative Genome Biology, 1207F Genomics Building, University of California, Riverside, CA 92521, USA
| | - Daniel S Evans
- California Pacific Medical Center Research Institute, 550 16th Street, 2nd floor, San Francisco, CA 94158, USA
| | - Richard A Miller
- Department of Pathology, University of Michigan, Ann Arbor, MI 48109, USA
| | - Nicholas J Schork
- Department of Quantitative Medicine and Systems Biology, The Translational Genomics Research Institute, 445 N. Fifth Street Phoenix, AZ 85004, USA
| | - Steven R Cummings
- California Pacific Medical Center Research Institute, 550 16th Street, 2nd floor, San Francisco, CA 94158, USA
| | - Thomas Girke
- Institute for Integrative Genome Biology, 1207F Genomics Building, University of California, Riverside, CA 92521, USA
| |
Collapse
|
4
|
Lombard DB, Kohler WJ, Guo AH, Gendron C, Han M, Ding W, Lyu Y, Ching TT, Wang FY, Chakraborty TS, Nikolovska-Coleska Z, Duan Y, Girke T, Hsu AL, Pletcher SD, Miller RA. High-throughput small molecule screening reveals Nrf2-dependent and -independent pathways of cellular stress resistance. SCIENCE ADVANCES 2020; 6:6/40/eaaz7628. [PMID: 33008901 PMCID: PMC7852388 DOI: 10.1126/sciadv.aaz7628] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/07/2019] [Accepted: 08/14/2020] [Indexed: 05/03/2023]
Abstract
Aging is the dominant risk factor for most chronic diseases. Development of antiaging interventions offers the promise of preventing many such illnesses simultaneously. Cellular stress resistance is an evolutionarily conserved feature of longevity. Here, we identify compounds that induced resistance to the superoxide generator paraquat (PQ), the heavy metal cadmium (Cd), and the DNA alkylator methyl methanesulfonate (MMS). Some rescue compounds conferred resistance to a single stressor, while others provoked multiplex resistance. Induction of stress resistance in fibroblasts was predictive of longevity extension in a published large-scale longevity screen in Caenorhabditis elegans, although not in testing performed in worms and flies with a more restricted set of compounds. Transcriptomic analysis and genetic studies implicated Nrf2/SKN-1 signaling in stress resistance provided by two protective compounds, cardamonin and AEG 3482. Small molecules identified in this work may represent attractive tools to elucidate mechanisms of stress resistance in mammalian cells.
Collapse
Affiliation(s)
- David B Lombard
- Department of Pathology, University of Michigan, Ann Arbor, MI, USA.
- Geriatrics Center, University of Michigan, Ann Arbor, MI, USA
| | - William J Kohler
- Department of Pathology, University of Michigan, Ann Arbor, MI, USA
| | - Angela H Guo
- Department of Pathology, University of Michigan, Ann Arbor, MI, USA
| | - Christi Gendron
- Department of Molecular and Integrative Physiology, University of Michigan, Ann Arbor, MI, USA
| | - Melissa Han
- Department of Pathology, University of Michigan, Ann Arbor, MI, USA
| | - Weiqiao Ding
- Department of Internal Medicine, Division of Geriatric and Palliative Medicine, University of Michigan, Ann Arbor, MI, USA
| | - Yang Lyu
- Department of Molecular and Integrative Physiology, University of Michigan, Ann Arbor, MI, USA
| | - Tsui-Ting Ching
- Institute of Biopharmaceutical Sciences, National Yang Ming University, Taipei 112, Taiwan
| | - Feng-Yung Wang
- Institute of Biochemistry and Molecular Biology, National Yang Ming University, Taipei 112, Taiwan
| | - Tuhin S Chakraborty
- Department of Molecular and Integrative Physiology, University of Michigan, Ann Arbor, MI, USA
| | | | - Yuzhu Duan
- Institute for Integrative Genome Biology, University of California Riverside, Riverside, CA, USA
| | - Thomas Girke
- Institute for Integrative Genome Biology, University of California Riverside, Riverside, CA, USA
| | - Ao-Lin Hsu
- Department of Molecular and Integrative Physiology, University of Michigan, Ann Arbor, MI, USA
- Department of Internal Medicine, Division of Geriatric and Palliative Medicine, University of Michigan, Ann Arbor, MI, USA
- Research Center for Healthy Aging, China Medical University, Taichung, Taiwan
| | - Scott D Pletcher
- Geriatrics Center, University of Michigan, Ann Arbor, MI, USA
- Department of Molecular and Integrative Physiology, University of Michigan, Ann Arbor, MI, USA
| | - Richard A Miller
- Department of Pathology, University of Michigan, Ann Arbor, MI, USA
- Geriatrics Center, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
5
|
Dong J, Zhu MF, Yun YH, Lu AP, Hou TJ, Cao DS. BioMedR: an R/CRAN package for integrated data analysis pipeline in biomedical study. Brief Bioinform 2019; 22:474-484. [PMID: 31885044 DOI: 10.1093/bib/bbz150] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2019] [Revised: 10/22/2019] [Accepted: 10/30/2019] [Indexed: 02/01/2023] Open
Abstract
BACKGROUND With the increasing development of biotechnology and information technology, publicly available data in chemistry and biology are undergoing explosive growth. Such wealthy information in these resources needs to be extracted and then transformed to useful knowledge by various data mining methods. However, a main computational challenge is how to effectively represent or encode molecular objects under investigation such as chemicals, proteins, DNAs and even complicated interactions when data mining methods are employed. To further explore these complicated data, an integrated toolkit to represent different types of molecular objects and support various data mining algorithms is urgently needed. RESULTS We developed a freely available R/CRAN package, called BioMedR, for molecular representations of chemicals, proteins, DNAs and pairwise samples of their interactions. The current version of BioMedR could calculate 293 molecular descriptors and 13 kinds of molecular fingerprints for small molecules, 9920 protein descriptors based on protein sequences and six types of generalized scale-based descriptors for proteochemometric modeling, more than 6000 DNA descriptors from nucleotide sequences and six types of interaction descriptors using three different combining strategies. Moreover, this package realized five similarity calculation methods and four powerful clustering algorithms as well as several useful auxiliary tools, which aims at building an integrated analysis pipeline for data acquisition, data checking, descriptor calculation and data modeling. CONCLUSION BioMedR provides a comprehensive and uniform R package to link up different representations of molecular objects with each other and will benefit cheminformatics/bioinformatics and other biomedical users. It is available at: https://CRAN.R-project.org/package=BioMedR and https://github.com/wind22zhu/BioMedR/.
Collapse
Affiliation(s)
- Jie Dong
- National Engineering Laboratory for Deep Processing of Rice and Byproducts, Hunan Key Laboratory of Processed Food for Special Medical Purpose, College of Food Science and Engineering, Central South University of Forestry and Technology, Changsha,410003 P. R. China.,Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, 410003 P. R. China
| | - Min-Feng Zhu
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, 410003 P. R. China
| | - Yong-Huan Yun
- College of Food Science and Engineering, Hainan University, Haikou, 570228 PR China
| | - Ai-Ping Lu
- Institute for Advancing Translational Medicine in Bone & Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR, P. R. China
| | - Ting-Jun Hou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058 Zhejiang, P. R. China
| | - Dong-Sheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, 410003 P. R. China.,Institute for Advancing Translational Medicine in Bone & Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR, P. R. China
| |
Collapse
|
6
|
CSgator: an integrated web platform for compound set analysis. J Cheminform 2019; 11:17. [PMID: 30830479 PMCID: PMC6419788 DOI: 10.1186/s13321-019-0339-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Accepted: 02/26/2019] [Indexed: 12/13/2022] Open
Abstract
Drug discovery typically involves investigation of a set of compounds (e.g. drug screening hits) in terms of target, disease, and bioactivity. CSgator is a comprehensive analytic tool for set-wise interpretation of compounds. It has two unique analytic features of Compound Set Enrichment Analysis (CSEA) and Compound Cluster Analysis (CCA), which allows batch analysis of compound set in terms of (i) target, (ii) bioactivity, (iii) disease, and (iv) structure. CSEA and CCA present enriched profiles of targets and bioactivities in a compound set, which leads to novel insights on underlying drug mode-of-action, and potential targets. Notably, we propose a novel concept of 'Hit Enriched Assays", i.e. bioassays of which hits are enriched among a given set of compounds. As an example, we show its utility in revealing drug mode-of-action or identifying hidden targets for anti-lymphangiogenesis screening hits. CSgator is available at http://csgator.ewha.ac.kr , and most analytic results are downloadable.
Collapse
|
7
|
Wase N, Black P, DiRusso C. Innovations in improving lipid production: Algal chemical genetics. Prog Lipid Res 2018; 71:101-123. [DOI: 10.1016/j.plipres.2018.07.001] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2018] [Revised: 06/25/2018] [Accepted: 07/06/2018] [Indexed: 01/01/2023]
|
8
|
de la Vega de León A, Chen B, Gillet VJ. Effect of missing data on multitask prediction methods. J Cheminform 2018; 10:26. [PMID: 29789977 PMCID: PMC5964064 DOI: 10.1186/s13321-018-0281-z] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2018] [Accepted: 05/14/2018] [Indexed: 01/05/2023] Open
Abstract
There has been a growing interest in multitask prediction in chemoinformatics, helped by the increasing use of deep neural networks in this field. This technique is applied to multitarget data sets, where compounds have been tested against different targets, with the aim of developing models to predict a profile of biological activities for a given compound. However, multitarget data sets tend to be sparse; i.e., not all compound-target combinations have experimental values. There has been little research on the effect of missing data on the performance of multitask methods. We have used two complete data sets to simulate sparseness by removing data from the training set. Different models to remove the data were compared. These sparse sets were used to train two different multitask methods, deep neural networks and Macau, which is a Bayesian probabilistic matrix factorization technique. Results from both methods were remarkably similar and showed that the performance decrease because of missing data is at first small before accelerating after large amounts of data are removed. This work provides a first approximation to assess how much data is required to produce good performance in multitask prediction exercises.
Collapse
Affiliation(s)
| | - Beining Chen
- Department of Chemistry, University of Sheffield, Dainton Building, Brook Hill, Sheffield, S3 7HF, UK
| | - Valerie J Gillet
- Information School, University of Sheffield, Regent Court, 211 Portobello, Sheffield, S1 4DP, UK
| |
Collapse
|
9
|
Backman TWH, Evans DS, Girke T. Large-scale bioactivity analysis of the small-molecule assayed proteome. PLoS One 2017; 12:e0171413. [PMID: 28178331 PMCID: PMC5298297 DOI: 10.1371/journal.pone.0171413] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2016] [Accepted: 01/20/2017] [Indexed: 12/12/2022] Open
Abstract
This study presents an analysis of the small molecule bioactivity profiles across large quantities of diverse protein families represented in PubChem BioAssay. We compared the bioactivity profiles of FDA approved drugs to non-FDA approved compounds, and report several distinct patterns characteristic of the approved drugs. We found that a large fraction of the previously reported higher target promiscuity among FDA approved compounds, compared to non-FDA approved bioactives, was frequently due to cross-reactivity within rather than across protein families. We identified 804 potentially novel protein target candidates for FDA approved drugs, as well as 901 potentially novel target candidates with active non-FDA approved compounds, but no FDA approved drugs with activity against these targets. We also identified 486348 potentially novel compounds active against the same targets as FDA approved drugs, as well as 153402 potentially novel compounds active against targets without active FDA approved drugs. By quantifying the agreement among replicated screens, we estimated that more than half of these novel outcomes are reproducible. Using biclustering, we identified many dense clusters of FDA approved drugs with enriched activity against a common set of protein targets. We also report the distribution of compound promiscuity using a Bayesian statistical model, and report the sensitivity and specificity of two common methods for identifying promiscuous compounds. Aggregator assays exhibited greater accuracy in identifying highly promiscuous compounds, while PAINS substructures were able to identify a much larger set of "middle range" promiscuous compounds. Additionally, we report a large number of promiscuous compounds not identified as aggregators or PAINS. In summary, the results of this study represent a rich reference for selecting novel drug and target protein candidates, as well as for eliminating candidate compounds with unselective activities.
Collapse
Affiliation(s)
- Tyler William H. Backman
- Department of Bioengineering, University of California Riverside, Riverside, California, United States of America
- Institute for Integrative Genome Biology, University of California Riverside, Riverside, California, United States of America
| | - Daniel S. Evans
- California Pacific Medical Center Research Institute, San Francisco, California, United States of America
| | - Thomas Girke
- Institute for Integrative Genome Biology, University of California Riverside, Riverside, California, United States of America
- * E-mail:
| |
Collapse
|