1
|
Fralish Z, Reker D. Finding the most potent compounds using active learning on molecular pairs. Beilstein J Org Chem 2024; 20:2152-2162. [PMID: 39224230 PMCID: PMC11368049 DOI: 10.3762/bjoc.20.185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Accepted: 08/02/2024] [Indexed: 09/04/2024] Open
Abstract
Active learning allows algorithms to steer iterative experimentation to accelerate and de-risk molecular optimizations, but actively trained models might still exhibit poor performance during early project stages where the training data is limited and model exploitation might lead to analog identification with limited scaffold diversity. Here, we present ActiveDelta, an adaptive approach that leverages paired molecular representations to predict improvements from the current best training compound to prioritize further data acquisition. We apply the ActiveDelta concept to both graph-based deep (Chemprop) and tree-based (XGBoost) models during exploitative active learning for 99 Ki benchmarking datasets. We show that both ActiveDelta implementations excel at identifying more potent inhibitors compared to the standard exploitative active learning implementations of Chemprop, XGBoost, and Random Forest. The ActiveDelta approach is also able to identify more chemically diverse inhibitors in terms of their Murcko scaffolds. Finally, deep models such as Chemprop trained on data selected through ActiveDelta approaches can more accurately identify inhibitors in test data created through simulated time-splits. Overall, this study highlights the large potential for molecular pairing approaches to further improve popular active learning strategies in low data regimes by enabling faster and more accurate identification of more diverse molecular hits against critical drug targets.
Collapse
Affiliation(s)
- Zachary Fralish
- Department of Biomedical Engineering, Duke University, Durham, NC 27708, USA
| | - Daniel Reker
- Department of Biomedical Engineering, Duke University, Durham, NC 27708, USA
| |
Collapse
|
2
|
Kim S, Bong H, Jeon M. Dr.Emb Appyter: A web platform for drug discovery using embedding vectors. J Comput Chem 2024. [PMID: 39072889 DOI: 10.1002/jcc.27469] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Revised: 06/30/2024] [Accepted: 07/04/2024] [Indexed: 07/30/2024]
Abstract
Using embedding methods, compounds with similar properties will be closely located in latent space, and these embedding vectors can be used to find other compounds with similar properties based on the distance between compounds. However, they often require computational resources and programming skills. Here we develop Dr.Emb Appyter, a user-friendly web-based chemical compound search platform for drug discovery without any technical barriers. It uses embedding vectors to identify compounds similar to a given query in the embedding space. Dr.Emb Appyter provides various types of embedding methods, such as fingerprinting, SMILES, and transcriptional response-based methods, and embeds numerous compounds using them. The Faiss-based search system efficiently finds the closest compounds of query in the library. Additionally, Dr.Emb Appyter offers information on the top compounds; visualizes the results with 3D scatter plots, heatmaps, and UpSet plots; and analyses the results using a drug-set enrichment analysis. Dr.Emb Appyter is freely available at https://dremb.korea.ac.kr.
Collapse
Affiliation(s)
- Songhyeon Kim
- Department of Medicine, Korea University College of Medicine, Seoul, South Korea
| | - Hyunsu Bong
- Department of Medicine, Korea University College of Medicine, Seoul, South Korea
| | - Minji Jeon
- Department of Medicine, Korea University College of Medicine, Seoul, South Korea
- Department of Biomedical Informatics, Korea University College of Medicine, Seoul, South Korea
- Biomedical Research Center, Korea University Anam Hospital, Seoul, South Korea
| |
Collapse
|
3
|
Fralish Z, Skaluba P, Reker D. Leveraging bounded datapoints to classify molecular potency improvements. RSC Med Chem 2024; 15:2474-2482. [PMID: 39026630 PMCID: PMC11253865 DOI: 10.1039/d4md00325j] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Accepted: 05/19/2024] [Indexed: 07/20/2024] Open
Abstract
Molecular machine learning algorithms are becoming increasingly powerful at predicting the potency of potential drug candidates to guide molecular discovery, lead series prioritization, and structural optimization. However, a substantial amount of inhibition data is bounded and inaccessible to traditional regression algorithms. Here, we develop a novel molecular pairing approach to process this data. This creates a new classification task of predicting which one of two paired molecules is more potent. This novel classification task can be accurately solved by various, established molecular machine learning algorithms, including XGBoost and Chemprop. Across 230 ChEMBL IC50 datasets, both tree-based and neural network-based "DeltaClassifiers" show improvements over traditional regression approaches in correctly classifying molecular potency improvements. The Chemprop-based deep DeltaClassifier outperformed all here evaluated regression approaches for paired molecules with shared and with distinct scaffolds, highlighting the promise of this approach for molecular optimization and scaffold-hopping.
Collapse
Affiliation(s)
- Zachary Fralish
- Department of Biomedical Engineering, Duke University Durham NC 27708 USA
| | - Paul Skaluba
- Department of Biomedical Engineering, Duke University Durham NC 27708 USA
| | - Daniel Reker
- Department of Biomedical Engineering, Duke University Durham NC 27708 USA
| |
Collapse
|
4
|
Zhang Z, Bian Y, Xie A, Han P, Zhou S. Can Pretrained Models Really Learn Better Molecular Representations for AI-Aided Drug Discovery? J Chem Inf Model 2024; 64:2921-2930. [PMID: 38145387 PMCID: PMC11005046 DOI: 10.1021/acs.jcim.3c01707] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 11/29/2023] [Accepted: 11/29/2023] [Indexed: 12/26/2023]
Abstract
Self-supervised pretrained models are gaining increasingly more popularity in AI-aided drug discovery, leading to more and more pretrained models with the promise that they can extract better feature representations for molecules. Yet, the quality of learned representations has not been fully explored. In this work, inspired by the two phenomena of Activity Cliffs (ACs) and Scaffold Hopping (SH) in traditional Quantitative Structure-Activity Relationship analysis, we propose a method named Representation-Property Relationship Analysis (RePRA) to evaluate the quality of the representations extracted by the pretrained model and visualize the relationship between the representations and properties. The concepts of ACs and SH are generalized from the structure-activity context to the representation-property context, and the underlying principles of RePRA are analyzed theoretically. Two scores are designed to measure the generalized ACs and SH detected by RePRA, and therefore, the quality of representations can be evaluated. In experiments, representations of molecules from 10 target tasks generated by 7 pretrained models are analyzed. The results indicate that the state-of-the-art pretrained models can overcome some shortcomings of canonical Extended-Connectivity FingerPrints, while the correlation between the basis of the representation space and specific molecular substructures are not explicit. Thus, some representations could be even worse than the canonical fingerprints. Our method enables researchers to evaluate the quality of molecular representations generated by their proposed self-supervised pretrained models. And our findings can guide the community to develop better pretraining techniques to regularize the occurrence of ACs and SH.
Collapse
Affiliation(s)
- Ziqiao Zhang
- Shanghai
Key Lab of Intelligent Information Processing, and School of Computer
Science, Fudan University, Shanghai 200438, China
| | | | - Ailin Xie
- Shanghai
Key Lab of Intelligent Information Processing, and School of Computer
Science, Fudan University, Shanghai 200438, China
| | - Pengju Han
- Shanghai
Key Lab of Intelligent Information Processing, and School of Computer
Science, Fudan University, Shanghai 200438, China
| | - Shuigeng Zhou
- Shanghai
Key Lab of Intelligent Information Processing, and School of Computer
Science, Fudan University, Shanghai 200438, China
| |
Collapse
|
5
|
Jiang L, Qu S, Yu Z, Wang J, Liu X. MOASL: Predicting drug mechanism of actions through similarity learning with transcriptomic signature. Comput Biol Med 2024; 169:107853. [PMID: 38104518 DOI: 10.1016/j.compbiomed.2023.107853] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Revised: 11/02/2023] [Accepted: 12/11/2023] [Indexed: 12/19/2023]
Abstract
Understanding the mechanisms of actions (MOAs) of compounds is crucial in drug discovery. A common step in drug MOAs annotation is to query the dysregulated gene signatures induced by drugs in a reference library of pre-defined signatures. However, traditional similarity-based computational strategies face challenges when dealing with high-dimensional and noisy transcriptional signature data. To address this issue, we introduce MOASL (MOAs prediction via Similarity Learning), a novel approach that contrastive to learn similarity embeddings among signatures with shared MOAs automatically. We evaluated the accuracy of signature matching on various transcriptional activity score (TAS) datasets and individual cell lines by using MOASL. The results show MOASL achieved higher performance over several statistical and machine learning methods. Furthermore, we provided the rationale of our model by visualizing the signature annotation procedure. Using MOASL, the MOAs label of query signature could be conveniently defined by calculating the similarity between the query embedding and the reference embeddings. Finally, we applied MOASL to repurpose thousands of compounds as glucocorticoid receptor (GR) agonists, accurately identifying 8 out of the top 10 compounds. MOASL is conveniently accessible on GitHub at https://github.com/jianglikun/MOASL, empowering researchers and practitioners in the field of drug discovery to predict the MOAs of drug.
Collapse
Affiliation(s)
- Likun Jiang
- Department of Computer Science, Xiamen University, Xiamen 361005, PR China; National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, PR China
| | - Susu Qu
- Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, PR China; Chinese Institute for Brain Research, Beijing 102206, PR China
| | - Zhengqiu Yu
- National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, PR China; School of Medicine, Xiamen University, Xiamen 361005, PR China
| | - Jianmin Wang
- The Interdisciplinary Graduate Program in Integrative Biotechnology and Translational Medicine, Yonsei University, Incheon 21983, South Korea
| | - Xiangrong Liu
- Department of Computer Science, Xiamen University, Xiamen 361005, PR China; National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, PR China.
| |
Collapse
|
6
|
Bustillo L, Laino T, Rodrigues T. The rise of automated curiosity-driven discoveries in chemistry. Chem Sci 2023; 14:10378-10384. [PMID: 37799997 PMCID: PMC10548516 DOI: 10.1039/d3sc03367h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2023] [Accepted: 09/07/2023] [Indexed: 10/07/2023] Open
Abstract
The quest for generating novel chemistry knowledge is critical in scientific advancement, and machine learning (ML) has emerged as an asset in this pursuit. Through interpolation among learned patterns, ML can tackle tasks that were previously deemed demanding to machines. This distinctive capacity of ML provides invaluable aid to bench chemists in their daily work. However, current ML tools are typically designed to prioritize experiments with the highest likelihood of success, i.e., higher predictive confidence. In this perspective, we build on current trends that suggest a future in which ML could be just as beneficial in exploring uncharted search spaces through simulated curiosity. We discuss how low and 'negative' data can catalyse one-/few-shot learning, and how the broader use of curious ML and novelty detection algorithms can propel the next wave of chemical discoveries. We anticipate that ML for curiosity-driven research will help the community overcome potentially biased assumptions and uncover unexpected findings in the chemical sciences at an accelerated pace.
Collapse
Affiliation(s)
- Latimah Bustillo
- Research Institute for Medicines (iMed), Faculdade de Farmácia, Universidade de Lisboa Lisbon Portugal
| | - Teodoro Laino
- IBM Research Europe Säumerstrasse 4 8803 Rüschlikon Switzerland
- National Center for Competence in Research-Catalysis (NCCR-Catalysis) Zurich Switzerland
| | - Tiago Rodrigues
- Research Institute for Medicines (iMed), Faculdade de Farmácia, Universidade de Lisboa Lisbon Portugal
| |
Collapse
|
7
|
Zhang Y, Menke J, He J, Nittinger E, Tyrchan C, Koch O, Zhao H. Similarity-based pairing improves efficiency of siamese neural networks for regression tasks and uncertainty quantification. J Cheminform 2023; 15:75. [PMID: 37649050 PMCID: PMC10469421 DOI: 10.1186/s13321-023-00744-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 08/10/2023] [Indexed: 09/01/2023] Open
Abstract
Siamese networks, representing a novel class of neural networks, consist of two identical subnetworks sharing weights but receiving different inputs. Here we present a similarity-based pairing method for generating compound pairs to train Siamese neural networks for regression tasks. In comparison with the conventional exhaustive pairing, it reduces the algorithm complexity from O(n2) to O(n). It also results in a better prediction performance consistently on the three physicochemical datasets, using a multilayer perceptron with the circular fingerprint as a proof of concept. We further include into a Siamese neural network the transformer-based Chemformer, which extracts task-specific features from the simplified molecular-input line-entry system representation of compounds. Additionally, we propose a means to measure the prediction uncertainty by utilizing the variance in predictions from a set of reference compounds. Our results demonstrate that the high prediction accuracy correlates with the high confidence. Finally, we investigate implications of the similarity property principle in machine learning.
Collapse
Affiliation(s)
- Yumeng Zhang
- Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, 43183, Gothenburg, Sweden
- Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden
| | - Janosch Menke
- Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, 43183, Gothenburg, Sweden.
- Institute of Pharmaceutical and Medicinal Chemistry, Westfälische Wilhelms-Universität Münster, 48149, Münster, Germany.
| | - Jiazhen He
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, 43183, Gothenburg, Sweden
| | - Eva Nittinger
- Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, 43183, Gothenburg, Sweden
| | - Christian Tyrchan
- Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, 43183, Gothenburg, Sweden
| | - Oliver Koch
- Institute of Pharmaceutical and Medicinal Chemistry, Westfälische Wilhelms-Universität Münster, 48149, Münster, Germany
| | - Hongtao Zhao
- Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, 43183, Gothenburg, Sweden.
| |
Collapse
|
8
|
Visheratina A, Visheratin A, Kumar P, Veksler M, Kotov NA. Chirality Analysis of Complex Microparticles using Deep Learning on Realistic Sets of Microscopy Images. ACS NANO 2023; 17:7431-7442. [PMID: 37058327 DOI: 10.1021/acsnano.2c12056] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Nanoscale chirality is an actively growing research field spurred by the giant chiroptical activity, enantioselective biological activity, and asymmetric catalytic activity of chiral nanostructures. Compared to chiral molecules, the handedness of chiral nano- and microstructures can be directly established via electron microscopy, which can be utilized for the automatic analysis of chiral nanostructures and prediction of their properties. However, chirality in complex materials may have multiple geometric forms and scales. Computational identification of chirality from electron microscopy images rather than optical measurements is convenient but is fundamentally challenging, too, because (1) image features differentiating left- and right-handed particles can be ambiguous and (2) three-dimensional structure essential for chirality is 'flattened' into two-dimensional projections. Here, we show that deep learning algorithms can identify twisted bowtie-shaped microparticles with nearly 100% accuracy and classify them as left- and right-handed with as high as 99% accuracy. Importantly, such accuracy was achieved with as few as 30 original electron microscopy images of bowties. Furthermore, after training on bowtie particles with complex nanostructured features, the model can recognize other chiral shapes with different geometries without retraining for their specific chiral geometry with 93% accuracy, indicating the true learning abilities of the employed neural networks. These findings indicate that our algorithm trained on a practically feasible set of experimental data enables automated analysis of microscopy data for the accelerated discovery of chiral particles and their complex systems for multiple applications.
Collapse
Affiliation(s)
- Anastasia Visheratina
- Department of Chemical Engineering and Biointerfaces Institute, University of Michigan, Ann Arbor, Michigan 48109, United States
| | | | - Prashant Kumar
- Department of Chemical Engineering and Biointerfaces Institute, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Michael Veksler
- Department of Chemical Engineering and Biointerfaces Institute, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Nicholas A Kotov
- Department of Chemical Engineering and Biointerfaces Institute, University of Michigan, Ann Arbor, Michigan 48109, United States
- Department of Aeronautics, Faculty of Engineering, Imperial College London, South Kensington Campus London, SW7 2AZ, United Kingdom
| |
Collapse
|
9
|
Hajamohideen F, Shaffi N, Mahmud M, Subramanian K, Al Sariri A, Vimbi V, Abdesselam A. Four-way classification of Alzheimer's disease using deep Siamese convolutional neural network with triplet-loss function. Brain Inform 2023; 10:5. [PMID: 36806042 PMCID: PMC9937523 DOI: 10.1186/s40708-023-00184-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Accepted: 01/03/2023] [Indexed: 02/19/2023] Open
Abstract
Alzheimer's disease (AD) is a neurodegenerative disease that causes irreversible damage to several brain regions, including the hippocampus causing impairment in cognition, function, and behaviour. Early diagnosis of the disease will reduce the suffering of the patients and their family members. Towards this aim, in this paper, we propose a Siamese Convolutional Neural Network (SCNN) architecture that employs the triplet-loss function for the representation of input MRI images as k-dimensional embeddings. We used both pre-trained and non-pretrained CNNs to transform images into the embedding space. These embeddings are subsequently used for the 4-way classification of Alzheimer's disease. The model efficacy was tested using the ADNI and OASIS datasets which produced an accuracy of 91.83% and 93.85%, respectively. Furthermore, obtained results are compared with similar methods proposed in the literature.
Collapse
Affiliation(s)
- Faizal Hajamohideen
- College of Computing and Information Sciences, University of Technology and Applied Sciences, Jamia Street, 311 Sohar, Sultanate of Oman
| | - Noushath Shaffi
- College of Computing and Information Sciences, University of Technology and Applied Sciences, Jamia Street, 311 Sohar, Sultanate of Oman
| | - Mufti Mahmud
- Department of Computer Science, Nottingham Trent University, Clifton Lane, NG11 8NS Nottingham, UK
- Medical Technologies Innovation Facility, Nottingham Trent University, Clifton Lane, NG11 8NS Nottingham, UK
- Computing and Informatics Research Centre, Nottingham Trent University, Clifton Lane, NG11 8NS Nottingham, UK
| | - Karthikeyan Subramanian
- College of Computing and Information Sciences, University of Technology and Applied Sciences, Jamia Street, 311 Sohar, Sultanate of Oman
| | - Arwa Al Sariri
- College of Computing and Information Sciences, University of Technology and Applied Sciences, Jamia Street, 311 Sohar, Sultanate of Oman
| | - Viswan Vimbi
- College of Computing and Information Sciences, University of Technology and Applied Sciences, Jamia Street, 311 Sohar, Sultanate of Oman
| | - Abdelhamid Abdesselam
- Department of Computer Science, Sultan Qaboos University, 123 Muscat, Sultanate of Oman
| | - for the Alzheimer’s Disease Neuroimaging Initiative
- College of Computing and Information Sciences, University of Technology and Applied Sciences, Jamia Street, 311 Sohar, Sultanate of Oman
- Department of Computer Science, Nottingham Trent University, Clifton Lane, NG11 8NS Nottingham, UK
- Medical Technologies Innovation Facility, Nottingham Trent University, Clifton Lane, NG11 8NS Nottingham, UK
- Computing and Informatics Research Centre, Nottingham Trent University, Clifton Lane, NG11 8NS Nottingham, UK
- Department of Computer Science, Sultan Qaboos University, 123 Muscat, Sultanate of Oman
| |
Collapse
|
10
|
Li TH, Wang CC, Zhang L, Chen X. SNRMPACDC: computational model focused on Siamese network and random matrix projection for anticancer synergistic drug combination prediction. Brief Bioinform 2023; 24:6843566. [PMID: 36418927 DOI: 10.1093/bib/bbac503] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Revised: 09/22/2022] [Accepted: 10/24/2022] [Indexed: 11/25/2022] Open
Abstract
Synergistic drug combinations can improve the therapeutic effect and reduce the drug dosage to avoid toxicity. In previous years, an in vitro approach was utilized to screen synergistic drug combinations. However, the in vitro method is time-consuming and expensive. With the rapid growth of high-throughput data, computational methods are becoming efficient tools to predict potential synergistic drug combinations. Considering the limitations of the previous computational methods, we developed a new model named Siamese Network and Random Matrix Projection for AntiCancer Drug Combination prediction (SNRMPACDC). Firstly, the Siamese convolutional network and random matrix projection were used to process the features of the two drugs into drug combination features. Then, the features of the cancer cell line were processed through the convolutional network. Finally, the processed features were integrated and input into the multi-layer perceptron network to get the predicted score. Compared with the traditional method of splicing drug features into drug combination features, SNRMPACDC improved the interpretability of drug combination features to a certain extent. In addition, the introduction of convolutional networks can better extract the potential information in the features. SNRMPACDC achieved the root mean-squared error of 15.01 and the Pearson correlation coefficient of 0.75 in 5-fold cross-validation of regression prediction for response data. In addition, SNRMPACDC achieved the AUC of 0.91 ± 0.03 and the AUPR of 0.62 ± 0.05 in 5-fold cross-validation of classification prediction of synergistic or not. These results are almost better than all the previous models. SNRMPACDC would be an effective approach to infer potential anticancer synergistic drug combinations.
Collapse
Affiliation(s)
- Tian-Hao Li
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China
| | - Chun-Chun Wang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China
| | - Li Zhang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China
| | - Xing Chen
- Artificial Intelligence Research Institute, China University of Mining and Technology, Xuzhou, 221116, China
| |
Collapse
|
11
|
Li Y, Hu XG, Wang L, Li PP, You ZH. MNMDCDA: prediction of circRNA-disease associations by learning mixed neighborhood information from multiple distances. Brief Bioinform 2022; 23:6831006. [PMID: 36384071 DOI: 10.1093/bib/bbac479] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Revised: 09/25/2022] [Accepted: 10/10/2022] [Indexed: 11/18/2022] Open
Abstract
Emerging evidence suggests that circular RNA (circRNA) is an important regulator of a variety of pathological processes and serves as a promising biomarker for many complex human diseases. Nevertheless, there are relatively few known circRNA-disease associations, and uncovering new circRNA-disease associations by wet-lab methods is time consuming and costly. Considering the limitations of existing computational methods, we propose a novel approach named MNMDCDA, which combines high-order graph convolutional networks (high-order GCNs) and deep neural networks to infer associations between circRNAs and diseases. Firstly, we computed different biological attribute information of circRNA and disease separately and used them to construct multiple multi-source similarity networks. Then, we used the high-order GCN algorithm to learn feature embedding representations with high-order mixed neighborhood information of circRNA and disease from the constructed multi-source similarity networks, respectively. Finally, the deep neural network classifier was implemented to predict associations of circRNAs with diseases. The MNMDCDA model obtained AUC scores of 95.16%, 94.53%, 89.80% and 91.83% on four benchmark datasets, i.e., CircR2Disease, CircAtlas v2.0, Circ2Disease and CircRNADisease, respectively, using the 5-fold cross-validation approach. Furthermore, 25 of the top 30 circRNA-disease pairs with the best scores of MNMDCDA in the case study were validated by recent literature. Numerous experimental results indicate that MNMDCDA can be used as an effective computational tool to predict circRNA-disease associations and can provide the most promising candidates for biological experiments.
Collapse
Affiliation(s)
- Yang Li
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China
| | - Xue-Gang Hu
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China
| | - Lei Wang
- Big Data and Intelligent Computing Research Center, Guangxi Academy of Sciences, Nanning 530007, China.,College of Information Science and Engineering, Zaozhuang University, Shandong 277100, China
| | - Pei-Pei Li
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China
| | - Zhu-Hong You
- Big Data and Intelligent Computing Research Center, Guangxi Academy of Sciences, Nanning 530007, China.,School of Computer Science, Northwestern Polytechnical University, Xi'an Shaanxi 710129, China
| |
Collapse
|
12
|
Li Y, Sun C, Wei JM, Liu J. Drug-Protein interaction prediction by correcting the effect of incomplete information in heterogeneous information. Bioinformatics 2022; 38:5073-5080. [PMID: 36111859 DOI: 10.1093/bioinformatics/btac629] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Revised: 08/30/2022] [Accepted: 09/15/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Large-scale heterogeneous data provide diverse perspectives for predicting drug-protein interactions (DPIs). However, the available information on molecular interactions and clinical associations related to drugs or proteins is incomplete because there may be unproven interactions and associations. This incomplete information in the available data is presented in the form of non-interaction and non-correlation, which may mislead the prediction model. Existing methods fuse incomplete and complete information without considering their integrity, so the negative effects of incomplete information still exist. RESULTS We develop a network-based DPI prediction method named BRWCP, which uses the complete information network to correct the prediction results acquired by the incomplete information network. By integrating relevant heterogeneous information that may be incomplete, the feature similarities of drugs and proteins are obtained. Combining the feature similarities and known DPIs, an incomplete information-based drug-protein heterogeneous network is constructed. Then, a bidirectional random walk with pruning algorithm is adopted in this heterogeneous network to predict potential DPIs. Next, the predicted DPIs are combined with the chemical fingerprint similarity of drugs and amino acid sequence similarity of proteins to construct the complete information network. The bidirectional random walk with pruning algorithm is applied in the new network to obtain the final prediction results until it converges. Experimental results show that BRWCP is superior to several state-of-the-art DPI prediction methods, and case studies further confirm its ability to tap potential DPIs. AVAILABILITY AND IMPLEMENTATION The code and data used in BRWCP are available at https://github.com/lyfdomain/BRWCP. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yanfei Li
- College of Computer Science, Nankai University, Tianjin 300071, China.,Institute of Big Data, Nankai University, Tianjin 300071, China
| | - Chang Sun
- College of Computer Science, Nankai University, Tianjin 300071, China.,Institute of Big Data, Nankai University, Tianjin 300071, China
| | - Jin-Mao Wei
- College of Computer Science, Nankai University, Tianjin 300071, China.,Institute of Big Data, Nankai University, Tianjin 300071, China
| | - Jian Liu
- College of Computer Science, Nankai University, Tianjin 300071, China.,Institute of Big Data, Nankai University, Tianjin 300071, China
| |
Collapse
|
13
|
Shin SH, Oh SM, Yoon Park JH, Lee KW, Yang H. OptNCMiner: a deep learning approach for the discovery of natural compounds modulating disease-specific multi-targets. BMC Bioinformatics 2022; 23:218. [PMID: 35672685 PMCID: PMC9175487 DOI: 10.1186/s12859-022-04752-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2022] [Accepted: 05/25/2022] [Indexed: 11/22/2022] Open
Abstract
Background Due to their diverse bioactivity, natural product (NP)s have been developed as commercial products in the pharmaceutical, food and cosmetic sectors as natural compound (NC)s and in the form of extracts. Following administration, NCs typically interact with multiple target proteins to elicit their effects. Various machine learning models have been developed to predict multi-target modulating NCs with desired physiological effects. However, due to deficiencies with existing chemical-protein interaction datasets, which are mostly single-labeled and limited, the existing models struggle to predict new chemical-protein interactions. New techniques are needed to overcome these limitations. Results We propose a novel NC discovery model called OptNCMiner that offers various advantages. The model is trained via end-to-end learning with a feature extraction step implemented, and it predicts multi-target modulating NCs through multi-label learning. In addition, it offers a few-shot learning approach to predict NC-protein interactions using a small training dataset. OptNCMiner achieved better prediction performance in terms of recall than conventional classification models. It was tested for the prediction of NC-protein interactions using small datasets and for a use case scenario to identify multi-target modulating NCs for type 2 diabetes mellitus complications. Conclusions OptNCMiner identifies NCs that modulate multiple target proteins, which facilitates the discovery and the understanding of biological activity of novel NCs with desirable health benefits.
Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04752-5.
Collapse
Affiliation(s)
- Seo Hyun Shin
- Department of Agricultural Biotechnology, Seoul National University, Seoul, 08826, Republic of Korea
| | - Seung Man Oh
- Department of Agricultural Biotechnology, Seoul National University, Seoul, 08826, Republic of Korea
| | - Jung Han Yoon Park
- Bio-MAX Institute, Seoul National University, Seoul, 08826, Republic of Korea
| | - Ki Won Lee
- Department of Agricultural Biotechnology, Seoul National University, Seoul, 08826, Republic of Korea. .,Bio-MAX Institute, Seoul National University, Seoul, 08826, Republic of Korea. .,Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, 08826, Republic of Korea.
| | - Hee Yang
- Bio-MAX Institute, Seoul National University, Seoul, 08826, Republic of Korea.
| |
Collapse
|
14
|
Altalib MK, Salim N. Similarity-Based Virtual Screen Using Enhanced Siamese Deep Learning Methods. ACS OMEGA 2022; 7:4769-4786. [PMID: 35187297 PMCID: PMC8851658 DOI: 10.1021/acsomega.1c04587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 01/17/2022] [Indexed: 06/14/2023]
Abstract
Traditional drug production is a long and complex process that leads to new drug production. The virtual screening technique is a computational method that allows chemical compounds to be screened at an acceptable time and cost. Several databases contain information on various aspects of biologically active substances. Simple statistical tools are difficult to use because of the enormous amount of information and complex data samples of molecules that are structurally heterogeneous recorded in these databases. Many techniques for capturing the biological similarity between a test compound and a known target ligand in LBVS have been established. However, despite the good performances of the above methods compared to their prior, especially when dealing with molecules that have homogeneous active structural elements, they are not satisfied when dealing with molecules that are structurally heterogeneous. Deep learning models have recently achieved considerable success in a variety of disciplines due to their powerful generalization and feature extraction capabilities. Also, the Siamese network has been used in similarity models for more complicated data samples, especially with heterogeneous data samples. The main aim of this study is to enhance the performance of similarity searching, especially with molecules that are structurally heterogeneous. The Siamese architecture will be enhanced using two similarity distance layers with one fusion layer to further improve the similarity measurements between molecules and then adding many layers after the fusion layer for some models to improve the retrieval recall. In this architecture, several methods of deep learning have been used, which are long short-term memory (LSTM), gated recurrent unit (GRU), convolutional neural network-one dimension (CNN1D), and convolutional neural network-two dimensions (CNN2D). A series of experiments have been carried out on real-world data sets, and the results have shown that the proposed methods outperformed the existing methods.
Collapse
Affiliation(s)
- Mohammed Khaldoon Altalib
- School
of Computing, Universiti Teknologi Malaysia, Johor Bahru 81310, Malaysia
- Computer
Science Department, College of Education for Pure Sciences, University of Mosul, 41002 Mosul, Iraq
| | - Naomie Salim
- School
of Computing, Universiti Teknologi Malaysia, Johor Bahru 81310, Malaysia
| |
Collapse
|
15
|
Altalib MK, Salim N. Similarity-Based Virtual Screen Using Enhanced Siamese Multi-Layer Perceptron. Molecules 2021; 26:6669. [PMID: 34771076 PMCID: PMC8588560 DOI: 10.3390/molecules26216669] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Revised: 10/24/2021] [Accepted: 11/01/2021] [Indexed: 11/30/2022] Open
Abstract
Traditional drug development is a slow and costly process that leads to the production of new drugs. Virtual screening (VS) is a computational procedure that measures the similarity of molecules as one of its primary tasks. Many techniques for capturing the biological similarity between a test compound and a known target ligand have been established in ligand-based virtual screens (LBVSs). However, despite the good performances of the above methods compared to their predecessors, especially when dealing with molecules that have structurally homogenous active elements, they are not satisfied when dealing with molecules that are structurally heterogeneous. The main aim of this study is to improve the performance of similarity searching, especially with molecules that are structurally heterogeneous. The Siamese network will be used due to its capability to deal with complicated data samples in many fields. The Siamese multi-layer perceptron architecture will be enhanced by using two similarity distance layers with one fused layer, then multiple layers will be added after the fusion layer, and then the nodes of the model that contribute less or nothing during inference according to their signal-to-noise ratio values will be pruned. Several benchmark datasets will be used, which are: the MDL Drug Data Report (MDDR-DS1, MDDR-DS2, and MDDR-DS3), the Maximum Unbiased Validation (MUV), and the Directory of Useful Decoys (DUD). The results show the outperformance of the proposed method on standard Tanimoto coefficient (TAN) and other methods. Additionally, it is possible to reduce the number of nodes in the Siamese multilayer perceptron model while still keeping the effectiveness of recall on the same level.
Collapse
Affiliation(s)
- Mohammed Khaldoon Altalib
- School of Computing, Universiti Teknologi Malaysia, Johor Bahru 81310, Malaysia
- Computer Science Department, Education for Pure Science College, University of Mosul, Mosul 41002, Iraq
| | - Naomie Salim
- School of Computing, Universiti Teknologi Malaysia, Johor Bahru 81310, Malaysia
| |
Collapse
|
16
|
Kpanou R, Osseni MA, Tossou P, Laviolette F, Corbeil J. On the robustness of generalization of drug-drug interaction models. BMC Bioinformatics 2021; 22:477. [PMID: 34607569 PMCID: PMC8489092 DOI: 10.1186/s12859-021-04398-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Accepted: 09/10/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Deep learning methods are a proven commodity in many fields and endeavors. One of these endeavors is predicting the presence of adverse drug-drug interactions (DDIs). The models generated can predict, with reasonable accuracy, the phenotypes arising from the drug interactions using their molecular structures. Nevertheless, this task requires improvement to be truly useful. Given the complexity of the predictive task, an extensive benchmarking on structure-based models for DDIs prediction was performed to evaluate their drawbacks and advantages. RESULTS We rigorously tested various structure-based models that predict drug interactions using different splitting strategies to simulate different real-world scenarios. In addition to the effects of different training and testing setups on the robustness and generalizability of the models, we then explore the contribution of traditional approaches such as multitask learning and data augmentation. CONCLUSION Structure-based models tend to generalize poorly to unseen drugs despite their ability to identify new DDIs among drugs seen during training accurately. Indeed, they efficiently propagate information between known drugs and could be valuable for discovering new DDIs in a database. However, these models will most probably fail when exposed to unknown drugs. While multitask learning does not help in our case to solve the problem, the use of data augmentation does at least mitigate it. Therefore, researchers must be cautious of the bias of the random evaluation scheme, especially if their goal is to discover new DDIs.
Collapse
Affiliation(s)
- Rogia Kpanou
- Computer Science and Software Engineering, Université Laval, 1065, av. de la Médecine, Quebec, CA Canada
- InVivo AI, Mila - 180 Corporate Lab L, 6650, 01 Rue Saint-Urbain, Montreal, CA H2S 3G9 Canada
| | - Mazid Abiodoun Osseni
- Computer Science and Software Engineering, Université Laval, 1065, av. de la Médecine, Quebec, CA Canada
| | - Prudencio Tossou
- Computer Science and Software Engineering, Université Laval, 1065, av. de la Médecine, Quebec, CA Canada
- InVivo AI, Mila - 180 Corporate Lab L, 6650, 01 Rue Saint-Urbain, Montreal, CA H2S 3G9 Canada
| | - Francois Laviolette
- Computer Science and Software Engineering, Université Laval, 1065, av. de la Médecine, Quebec, CA Canada
| | - Jacques Corbeil
- Department of Molecular Medicine, Université Laval, 1065, av. de la Médecine, Quebec, CA Canada
| |
Collapse
|
17
|
An X, Chen X, Yi D, Li H, Guan Y. Representation of molecules for drug response prediction. Brief Bioinform 2021; 23:6375515. [PMID: 34571534 DOI: 10.1093/bib/bbab393] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Revised: 08/28/2021] [Accepted: 08/30/2021] [Indexed: 12/18/2022] Open
Abstract
The rapid development of machine learning and deep learning algorithms in the recent decade has spurred an outburst of their applications in many research fields. In the chemistry domain, machine learning has been widely used to aid in drug screening, drug toxicity prediction, quantitative structure-activity relationship prediction, anti-cancer synergy score prediction, etc. This review is dedicated to the application of machine learning in drug response prediction. Specifically, we focus on molecular representations, which is a crucial element to the success of drug response prediction and other chemistry-related prediction tasks. We introduce three types of commonly used molecular representation methods, together with their implementation and application examples. This review will serve as a brief introduction of the broad field of molecular representations.
Collapse
Affiliation(s)
- Xin An
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Xi Chen
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Daiyao Yi
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Hongyang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
18
|
Chen Y, Zhang L. How much can deep learning improve prediction of the responses to drugs in cancer cell lines? Brief Bioinform 2021; 23:6370847. [PMID: 34529029 DOI: 10.1093/bib/bbab378] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Revised: 08/21/2021] [Accepted: 08/24/2021] [Indexed: 12/24/2022] Open
Abstract
The drug response prediction problem arises from personalized medicine and drug discovery. Deep neural networks have been applied to the multi-omics data being available for over 1000 cancer cell lines and tissues for better drug response prediction. We summarize and examine state-of-the-art deep learning methods that have been published recently. Although significant progresses have been made in deep learning approach in drug response prediction, deep learning methods show their weakness for predicting the response of a drug that does not appear in the training dataset. In particular, all the five evaluated deep learning methods performed worst than the similarity-regularized matrix factorization (SRMF) method in our drug blind test. We outline the challenges in applying deep learning approach to drug response prediction and suggest unique opportunities for deep learning integrated with established bioinformatics analyses to overcome some of these challenges.
Collapse
Affiliation(s)
- Yurui Chen
- Department of Mathematics and Computational Biology Programme, National University of Singapore, 119076, Singapore
| | - Louxin Zhang
- Department of Mathematics and Computational Biology Programme, National University of Singapore, 119076, Singapore
| |
Collapse
|
19
|
Xiong Z, Jeon M, Allaway RJ, Kang J, Park D, Lee J, Jeon H, Ko M, Jiang H, Zheng M, Tan AC, Guo X, Dang KK, Tropsha A, Hecht C, Das TK, Carlson HA, Abagyan R, Guinney J, Schlessinger A, Cagan R. Crowdsourced identification of multi-target kinase inhibitors for RET- and TAU- based disease: The Multi-Targeting Drug DREAM Challenge. PLoS Comput Biol 2021; 17:e1009302. [PMID: 34520464 PMCID: PMC8483411 DOI: 10.1371/journal.pcbi.1009302] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Revised: 09/30/2021] [Accepted: 07/23/2021] [Indexed: 01/22/2023] Open
Abstract
A continuing challenge in modern medicine is the identification of safer and more efficacious drugs. Precision therapeutics, which have one molecular target, have been long promised to be safer and more effective than traditional therapies. This approach has proven to be challenging for multiple reasons including lack of efficacy, rapidly acquired drug resistance, and narrow patient eligibility criteria. An alternative approach is the development of drugs that address the overall disease network by targeting multiple biological targets ('polypharmacology'). Rational development of these molecules will require improved methods for predicting single chemical structures that target multiple drug targets. To address this need, we developed the Multi-Targeting Drug DREAM Challenge, in which we challenged participants to predict single chemical entities that target pro-targets but avoid anti-targets for two unrelated diseases: RET-based tumors and a common form of inherited Tauopathy. Here, we report the results of this DREAM Challenge and the development of two neural network-based machine learning approaches that were applied to the challenge of rational polypharmacology. Together, these platforms provide a potentially useful first step towards developing lead therapeutic compounds that address disease complexity through rational polypharmacology.
Collapse
Affiliation(s)
- Zhaoping Xiong
- Shanghai Institute for Advanced Immunochemical Studies, ShanghaiTech University, Shanghai, China
| | - Minji Jeon
- Department of Computer Science and Engineering, Korea University, Seoul, Republic of Korea
| | | | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul, Republic of Korea
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul, Republic of Korea
| | - Donghyeon Park
- Department of Computer Science and Engineering, Korea University, Seoul, Republic of Korea
| | - Jinhyuk Lee
- Department of Computer Science and Engineering, Korea University, Seoul, Republic of Korea
| | - Hwisang Jeon
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul, Republic of Korea
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
| | - Miyoung Ko
- Department of Computer Science and Engineering, Korea University, Seoul, Republic of Korea
| | - Hualiang Jiang
- Shanghai Institute for Advanced Immunochemical Studies, ShanghaiTech University, Shanghai, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
| | - Aik Choon Tan
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, Florida, United States of America
| | - Xindi Guo
- Sage Bionetworks, Seattle, Washington, United States of America
| | | | - Kristen K. Dang
- Sage Bionetworks, Seattle, Washington, United States of America
| | - Alex Tropsha
- Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Chana Hecht
- Department of Cell, Developmental, and Regenerative Biology, Icahn School of Medicine at Mount Sinai, New York City, New York, United States of America
| | - Tirtha K. Das
- Department of Cell, Developmental, and Regenerative Biology, Icahn School of Medicine at Mount Sinai, New York City, New York, United States of America
| | - Heather A. Carlson
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Ruben Abagyan
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, California, United States of America
| | - Justin Guinney
- Sage Bionetworks, Seattle, Washington, United States of America
| | - Avner Schlessinger
- Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, New York City, New York, United States of America
| | - Ross Cagan
- Department of Cell, Developmental, and Regenerative Biology, Icahn School of Medicine at Mount Sinai, New York City, New York, United States of America
- Institute of Cancer Sciences, University of Glasgow; Glasgow, Scotland, United Kingdom
| |
Collapse
|
20
|
Tynes M, Gao W, Burrill DJ, Batista ER, Perez D, Yang P, Lubbers N. Pairwise Difference Regression: A Machine Learning Meta-algorithm for Improved Prediction and Uncertainty Quantification in Chemical Search. J Chem Inf Model 2021; 61:3846-3857. [PMID: 34347460 DOI: 10.1021/acs.jcim.1c00670] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Machine learning (ML) plays a growing role in the design and discovery of chemicals, aiming to reduce the need to perform expensive experiments and simulations. ML for such applications is promising but difficult, as models must generalize to vast chemical spaces from small training sets and must have reliable uncertainty quantification metrics to identify and prioritize unexplored regions. Ab initio computational chemistry and chemical intuition alike often take advantage of differences between chemical conditions, rather than their absolute structure or state, to generate more reliable results. We have developed an analogous comparison-based approach for ML regression, called pairwise difference regression (PADRE), which is applicable to arbitrary underlying learning models and operates on pairs of input data points. During training, the model learns to predict differences between all possible pairs of input points. During prediction, the test points are paired with all training set points, giving rise to a set of predictions that can be treated as a distribution of which the mean is treated as a final prediction and the dispersion is treated as an uncertainty measure. Pairwise difference regression was shown to reliably improve the performance of the random forest algorithm across five chemical ML tasks. Additionally, the pair-derived dispersion is both well correlated with model error and performs well in active learning. We also show that this method is competitive with state-of-the-art neural network techniques. Thus, pairwise difference regression is a promising tool for candidate selection algorithms used in chemical discovery.
Collapse
Affiliation(s)
- Michael Tynes
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States.,Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Wenhao Gao
- Computer, Computational, and Statistical Sciences Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States.,Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Daniel J Burrill
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States.,Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Enrique R Batista
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States.,Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Danny Perez
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Ping Yang
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Nicholas Lubbers
- Computer, Computational, and Statistical Sciences Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| |
Collapse
|
21
|
Jang G, Park S, Lee S, Kim S, Park S, Kang J. Predicting mechanism of action of novel compounds using compound structure and transcriptomic signature coembedding. Bioinformatics 2021; 37:i376-i382. [PMID: 34252937 PMCID: PMC8275331 DOI: 10.1093/bioinformatics/btab275] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/23/2021] [Indexed: 12/27/2022] Open
Abstract
MOTIVATION Identifying mechanism of actions (MoA) of novel compounds is crucial in drug discovery. Careful understanding of MoA can avoid potential side effects of drug candidates. Efforts have been made to identify MoA using the transcriptomic signatures induced by compounds. However, these approaches fail to reveal MoAs in the absence of actual compound signatures. RESULTS We present MoAble, which predicts MoAs without requiring compound signatures. We train a deep learning-based coembedding model to map compound signatures and compound structure into the same embedding space. The model generates low-dimensional compound signature representation from the compound structures. To predict MoAs, pathway enrichment analysis is performed based on the connectivity between embedding vectors of compounds and those of genetic perturbation. Results show that MoAble is comparable to the methods that use actual compound signatures. We demonstrate that MoAble can be used to reveal MoAs of novel compounds without measuring compound signatures with the same prediction accuracy as that with measuring them. AVAILABILITY AND IMPLEMENTATION MoAble is available at https://github.com/dmis-lab/moable. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gwanghoon Jang
- Department of Computer Science and Engineering, Korea University, Seoul, Republic of Korea
| | - Sungjoon Park
- Department of Computer Science and Engineering, Korea University, Seoul, Republic of Korea
| | - Sanghoon Lee
- Department of Computer Science and Engineering, Korea University, Seoul, Republic of Korea
| | - Sunkyu Kim
- Department of Computer Science and Engineering, Korea University, Seoul, Republic of Korea
| | - Sejeong Park
- Department of Computer Science and Engineering, Korea University, Seoul, Republic of Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul, Republic of Korea.,Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul, Republic of Korea
| |
Collapse
|
22
|
Sun C, Cao Y, Wei JM, Liu J. Autoencoder-based Drug-Target Interaction Prediction by Preserving the Consistency of Chemical Properties and Functions of Drugs. Bioinformatics 2021; 37:3618-3625. [PMID: 34019069 DOI: 10.1093/bioinformatics/btab384] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2020] [Revised: 05/06/2021] [Accepted: 05/18/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Exploring the potential drug-target interactions (DTIs) is a key step in drug discovery and repurposing. In recent years, predicting the probable DTIs through computational methods has gradually become a research hot spot. However, most of the previous studies failed to judiciously take into account the consistency between the chemical properties of drug and its functions. The changes of these relationships may lead to a severely negative effect on the prediction of DTIs. RESULTS We propose an autoencoder-based method, AEFS, under spatial consistency constraints to predict DTIs. A heterogeneous network is established to integrate the information of drugs, proteins and diseases. The original drug features are projected to an embedding (protein) space by a multi-layer encoder, and further projected into label (disease) space by a decoder. In this process, the clinical information of drugs is introduced to assist the DTI prediction. By maintaining the distribution of drug correlation in the original feature, embedding and label space, AEFS keeps the consistency between chemical properties and functions of drugs. Experimental comparisons indicate that AEFS is more robust for imbalanced data and of significantly superior performance in DTI prediction. Case studies further confirm its ability to mine the latent drug-target interactions. AVAILABILITY The code of AEFS is available at https://github.com/JackieSun818/AEFS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chang Sun
- College of Computer Science, Nankai University, Tianjin, 300071, China.,Institute of Big Data, Nankai University, Tianjin, 300071, China
| | - Yangkun Cao
- School of Artificial Intelligence, Jilin University, Changchun, 130012, China
| | - Jin-Mao Wei
- College of Computer Science, Nankai University, Tianjin, 300071, China.,Institute of Big Data, Nankai University, Tianjin, 300071, China
| | - Jian Liu
- College of Computer Science, Nankai University, Tianjin, 300071, China.,Institute of Big Data, Nankai University, Tianjin, 300071, China
| |
Collapse
|
23
|
Mostavi M, Chiu YC, Chen Y, Huang Y. CancerSiamese: one-shot learning for predicting primary and metastatic tumor types unseen during model training. BMC Bioinformatics 2021; 22:244. [PMID: 33980137 PMCID: PMC8117642 DOI: 10.1186/s12859-021-04157-w] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 04/27/2021] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND The state-of-the-art deep learning based cancer type prediction can only predict cancer types whose samples are available during the training where the sample size is commonly large. In this paper, we consider how to utilize the existing training samples to predict cancer types unseen during the training. We hypothesize the existence of a set of type-agnostic expression representations that define the similarity/dissimilarity between samples of the same/different types and propose a novel one-shot learning model called CancerSiamese to learn this common representation. CancerSiamese accepts a pair of query and support samples (gene expression profiles) and learns the representation of similar or dissimilar cancer types through two parallel convolutional neural networks joined by a similarity function. RESULTS We trained CancerSiamese for cancer type prediction for primary and metastatic tumors using samples from the Cancer Genome Atlas (TCGA) and MET500. Network transfer learning was utilized to facilitate the training of the CancerSiamese models. CancerSiamese was tested for different N-way predictions and yielded an average accuracy improvement of 8% and 4% over the benchmark 1-Nearest Neighbor (1-NN) classifier for primary and metastatic tumors, respectively. Moreover, we applied the guided gradient saliency map and feature selection to CancerSiamese to examine 100 and 200 top marker-gene candidates for the prediction of primary and metastatic cancers, respectively. Functional analysis of these marker genes revealed several cancer related functions between primary and metastatic tumors. CONCLUSION This work demonstrated, for the first time, the feasibility of predicting unseen cancer types whose samples are limited. Thus, it could inspire new and ingenious applications of one-shot and few-shot learning solutions for improving cancer diagnosis, prognostic, and our understanding of cancer.
Collapse
Affiliation(s)
- Milad Mostavi
- Greehey Children's Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX, 78229, USA
- Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, TX, 78249, USA
| | - Yu-Chiao Chiu
- Greehey Children's Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX, 78229, USA
| | - Yidong Chen
- Greehey Children's Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX, 78229, USA.
- Department of Population Health Sciences, University of Texas Health San Antonio, San Antonio, TX, 78229, USA.
| | - Yufei Huang
- Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, TX, 78249, USA.
- Department of Population Health Sciences, University of Texas Health San Antonio, San Antonio, TX, 78229, USA.
| |
Collapse
|
24
|
Fernández-Llaneza D, Ulander S, Gogishvili D, Nittinger E, Zhao H, Tyrchan C. Siamese Recurrent Neural Network with a Self-Attention Mechanism for Bioactivity Prediction. ACS OMEGA 2021; 6:11086-11094. [PMID: 34056263 PMCID: PMC8153912 DOI: 10.1021/acsomega.1c01266] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Accepted: 04/01/2021] [Indexed: 05/05/2023]
Abstract
Activity prediction plays an essential role in drug discovery by directing search of drug candidates in the relevant chemical space. Despite being applied successfully to image recognition and semantic similarity, the Siamese neural network has rarely been explored in drug discovery where modelling faces challenges such as insufficient data and class imbalance. Here, we present a Siamese recurrent neural network model (SiameseCHEM) based on bidirectional long short-term memory architecture with a self-attention mechanism, which can automatically learn discriminative features from the SMILES representations of small molecules. Subsequently, it is used to categorize bioactivity of small molecules via N-shot learning. Trained on random SMILES strings, it proves robust across five different datasets for the task of binary or categorical classification of bioactivity. Benchmarking against two baseline machine learning models which use the chemistry-rich ECFP fingerprints as the input, the deep learning model outperforms on three datasets and achieves comparable performance on the other two. The failure of both baseline methods on SMILES strings highlights that the deep learning model may learn task-specific chemistry features encoded in SMILES strings.
Collapse
|
25
|
Lim S, Lu Y, Cho CY, Sung I, Kim J, Kim Y, Park S, Kim S. A review on compound-protein interaction prediction methods: Data, format, representation and model. Comput Struct Biotechnol J 2021; 19:1541-1556. [PMID: 33841755 PMCID: PMC8008185 DOI: 10.1016/j.csbj.2021.03.004] [Citation(s) in RCA: 42] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Revised: 02/28/2021] [Accepted: 03/01/2021] [Indexed: 01/27/2023] Open
Abstract
There has recently been a rapid progress in computational methods for determining protein targets of small molecule drugs, which will be termed as compound protein interaction (CPI). In this review, we comprehensively review topics related to computational prediction of CPI. Data for CPI has been accumulated and curated significantly both in quantity and quality. Computational methods have become powerful ever to analyze such complex the data. Thus, recent successes in the improved quality of CPI prediction are due to use of both sophisticated computational techniques and higher quality information in the databases. The goal of this article is to provide reviews of topics related to CPI, such as data, format, representation, to computational models, so that researchers can take full advantages of these resources to develop novel prediction methods. Chemical compounds and protein data from various resources were discussed in terms of data formats and encoding schemes. For the CPI methods, we grouped prediction methods into five categories from traditional machine learning techniques to state-of-the-art deep learning techniques. In closing, we discussed emerging machine learning topics to help both experimental and computational scientists leverage the current knowledge and strategies to develop more powerful and accurate CPI prediction methods.
Collapse
Affiliation(s)
- Sangsoo Lim
- Bioinformatics Institute, Seoul National University, Seoul, Republic of Korea
| | - Yijingxiu Lu
- Department of Computer Science and Engineering, College of Engineering, Seoul National University, Seoul, Republic of Korea
| | - Chang Yun Cho
- Institute of Engineering Research, Seoul National University, Seoul, Republic of Korea
| | - Inyoung Sung
- Institute of Engineering Research, Seoul National University, Seoul, Republic of Korea
| | - Jungwoo Kim
- Department of Computer Science and Engineering, College of Engineering, Seoul National University, Seoul, Republic of Korea
| | - Youngkuk Kim
- Department of Computer Science and Engineering, College of Engineering, Seoul National University, Seoul, Republic of Korea
| | - Sungjoon Park
- Department of Computer Science and Engineering, College of Engineering, Seoul National University, Seoul, Republic of Korea
| | - Sun Kim
- Bioinformatics Institute, Seoul National University, Seoul, Republic of Korea
- Department of Computer Science and Engineering, College of Engineering, Seoul National University, Seoul, Republic of Korea
- Institute of Engineering Research, Seoul National University, Seoul, Republic of Korea
- Interdisciplinary Program in Bioinformatics, College of Natural Sciences, Seoul National University, Seoul, Republic of Korea
| |
Collapse
|
26
|
Kim H, Kim E, Lee I, Bae B, Park M, Nam H. Artificial Intelligence in Drug Discovery: A Comprehensive Review of Data-driven and Machine Learning Approaches. BIOTECHNOL BIOPROC E 2021; 25:895-930. [PMID: 33437151 PMCID: PMC7790479 DOI: 10.1007/s12257-020-0049-y] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2020] [Revised: 05/27/2020] [Accepted: 06/03/2020] [Indexed: 02/07/2023]
Abstract
As expenditure on drug development increases exponentially, the overall drug discovery process requires a sustainable revolution. Since artificial intelligence (AI) is leading the fourth industrial revolution, AI can be considered as a viable solution for unstable drug research and development. Generally, AI is applied to fields with sufficient data such as computer vision and natural language processing, but there are many efforts to revolutionize the existing drug discovery process by applying AI. This review provides a comprehensive, organized summary of the recent research trends in AI-guided drug discovery process including target identification, hit identification, ADMET prediction, lead optimization, and drug repositioning. The main data sources in each field are also summarized in this review. In addition, an in-depth analysis of the remaining challenges and limitations will be provided, and proposals for promising future directions in each of the aforementioned areas.
Collapse
Affiliation(s)
- Hyunho Kim
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Gwangju, 61005 Korea
| | - Eunyoung Kim
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Gwangju, 61005 Korea
| | - Ingoo Lee
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Gwangju, 61005 Korea
| | - Bongsung Bae
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Gwangju, 61005 Korea
| | - Minsu Park
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Gwangju, 61005 Korea
| | - Hojung Nam
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Gwangju, 61005 Korea
| |
Collapse
|
27
|
Abstract
Similarity has always been a key aspect in computer science and statistics. Any time two element vectors are compared, many different similarity approaches can be used, depending on the final goal of the comparison (Euclidean distance, Pearson correlation coefficient, Spearman's rank correlation coefficient, and others). But if the comparison has to be applied to more complex data samples, with features having different dimensionality and types which might need compression before processing, these measures would be unsuitable. In these cases, a siamese neural network may be the best choice: it consists of two identical artificial neural networks each capable of learning the hidden representation of an input vector. The two neural networks are both feedforward perceptrons, and employ error back-propagation during training; they work parallelly in tandem and compare their outputs at the end, usually through a cosine distance. The output generated by a siamese neural network execution can be considered the semantic similarity between the projected representation of the two input vectors. In this overview we first describe the siamese neural network architecture, and then we outline its main applications in a number of computational fields since its appearance in 1994. Additionally, we list the programming languages, software packages, tutorials, and guides that can be practically used by readers to implement this powerful machine learning model.
Collapse
|
28
|
Fotis C, Meimetis N, Sardis A, Alexopoulos LG. DeepSIBA: chemical structure-based inference of biological alterations using deep learning. Mol Omics 2020; 17:108-120. [PMID: 33188379 DOI: 10.1039/d0mo00129e] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Predicting whether a chemical structure leads to a desired or adverse biological effect can have a significant impact for in silico drug discovery. In this study, we developed a deep learning model where compound structures are represented as graphs and then linked to their biological footprint. To make this complex problem computationally tractable, compound differences were mapped to biological effect alterations using Siamese Graph Convolutional Neural Networks. The proposed model was able to encode molecular graph pairs and identify structurally dissimilar compounds that affect similar biological processes with high precision. Additionally, by utilizing deep ensembles to estimate uncertainty, we were able to provide reliable and accurate predictions for chemical structures that are very different from the ones used during training. Finally, we present a novel inference approach, where the trained models are used to estimate the signaling pathway signature of a compound perturbation, using only its chemical structure as input, and subsequently identify which substructures influenced the predicted pathways. As a use case, this approach was used to infer important substructures and affected signaling pathways of FDA-approved anticancer drugs.
Collapse
Affiliation(s)
- C Fotis
- Biomedical Systems Laboratory, National Technical University of Athens, Athens, Greece.
| | | | | | | |
Collapse
|
29
|
Memon SA, Khan KA, Naveed H. HECNet: a hierarchical approach to enzyme function classification using a Siamese Triplet Network. Bioinformatics 2020; 36:4583-4589. [DOI: 10.1093/bioinformatics/btaa536] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2020] [Revised: 04/13/2020] [Accepted: 05/18/2020] [Indexed: 01/14/2023] Open
Abstract
Abstract
Motivation
Understanding an enzyme’s function is one of the most crucial problem domains in computational biology. Enzymes are a key component in all organisms and many industrial processes as they help in fighting diseases and speed up essential chemical reactions. They have wide applications and therefore, the discovery of new enzymatic proteins can accelerate biological research and commercial productivity. Biological experiments, to determine an enzyme’s function, are time-consuming and resource expensive.
Results
In this study, we propose a novel computational approach to predict an enzyme’s function up to the fourth level of the Enzyme Commission (EC) Number. Many studies have attempted to predict an enzyme’s function. Yet, no approach has properly tackled the fourth and final level of the EC number. The fourth level holds great significance as it gives us the most specific information of how an enzyme performs its function. Our method uses innovative deep learning approaches along with an efficient hierarchical classification scheme to predict an enzyme’s precise function. On a dataset of 11 353 enzymes and 402 classes, we achieved a hierarchical accuracy and Macro-F1 score of 91.2% and 81.9%, respectively, on the 4th level. Moreover, our method can be used to predict the function of enzyme isoforms with considerable success. This methodology is broadly applicable for genome-wide prediction that can subsequently lead to automated annotation of enzyme databases and the identification of better/cheaper enzymes for commercial activities.
Availability and implementation
The web-server can be freely accessed at http://hecnet.cbrlab.org/.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Safyan Aman Memon
- Computational Biology Research Lab (CBRL), Department of Computer Science, National University of Computer and Emerging Sciences, Islamabad 44000, Pakistan
| | - Kinaan Aamir Khan
- Computational Biology Research Lab (CBRL), Department of Computer Science, National University of Computer and Emerging Sciences, Islamabad 44000, Pakistan
| | - Hammad Naveed
- Computational Biology Research Lab (CBRL), Department of Computer Science, National University of Computer and Emerging Sciences, Islamabad 44000, Pakistan
| |
Collapse
|