1
|
Tomal JH, Welch WJ, Zamar RH. Robust ranking by ensembling of diverse models and assessment metrics. J STAT COMPUT SIM 2022. [DOI: 10.1080/00949655.2022.2093873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Affiliation(s)
- Jabed H. Tomal
- Department of Mathematics and Statistics, Thompson Rivers University, Kamloops, British Columbia, Canada
| | - William J. Welch
- Department of Statistics, The University of British Columbia, Vancouver, British Columbia, Canada
| | - Ruben H. Zamar
- Department of Statistics, The University of British Columbia, Vancouver, British Columbia, Canada
| |
Collapse
|
2
|
Chowdhury RI, Tomal JH. Risk prediction for repeated measures health outcomes: A divide and recombine framework. INFORMATICS IN MEDICINE UNLOCKED 2022. [DOI: 10.1016/j.imu.2022.100847] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
|
3
|
Mao J, Akhtar J, Zhang X, Sun L, Guan S, Li X, Chen G, Liu J, Jeon HN, Kim MS, No KT, Wang G. Comprehensive strategies of machine-learning-based quantitative structure-activity relationship models. iScience 2021; 24:103052. [PMID: 34553136 PMCID: PMC8441174 DOI: 10.1016/j.isci.2021.103052] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Early quantitative structure-activity relationship (QSAR) technologies have unsatisfactory versatility and accuracy in fields such as drug discovery because they are based on traditional machine learning and interpretive expert features. The development of Big Data and deep learning technologies significantly improve the processing of unstructured data and unleash the great potential of QSAR. Here we discuss the integration of wet experiments (which provide experimental data and reliable verification), molecular dynamics simulation (which provides mechanistic interpretation at the atomic/molecular levels), and machine learning (including deep learning) techniques to improve QSAR models. We first review the history of traditional QSAR and point out its problems. We then propose a better QSAR model characterized by a new iterative framework to integrate machine learning with disparate data input. Finally, we discuss the application of QSAR and machine learning to many practical research fields, including drug development and clinical trials.
Collapse
Affiliation(s)
- Jiashun Mao
- The Interdisciplinary Graduate Program in Integrative Biotechnology and Translational Medicine, Yonsei University, Incheon 21983, Republic of Korea
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, 1088 Xueyuan Avenue, Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Computational Science and Material Design, Shenzhen, Guangdong 518055 China
| | - Javed Akhtar
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, 1088 Xueyuan Avenue, Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Cell Microenvironment and Disease Research, Shenzhen, Guangdong 518055, China
| | - Xiao Zhang
- Shanghai Rural Commercial Bank Co., Ltd, Shanghai 200002, China
| | - Liang Sun
- Department of Physics, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong, China
| | - Shenghui Guan
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, 1088 Xueyuan Avenue, Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Computational Science and Material Design, Shenzhen, Guangdong 518055 China
| | - Xinyu Li
- School of Life and Health Sciences and Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen 518172, China
| | - Guangming Chen
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, 1088 Xueyuan Avenue, Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Cell Microenvironment and Disease Research, Shenzhen, Guangdong 518055, China
| | - Jiaxin Liu
- Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul 03722, Republic of Korea
| | - Hyeon-Nae Jeon
- Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul 03722, Republic of Korea
| | - Min Sung Kim
- Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul 03722, Republic of Korea
| | - Kyoung Tai No
- The Interdisciplinary Graduate Program in Integrative Biotechnology and Translational Medicine, Yonsei University, Incheon 21983, Republic of Korea
| | - Guanyu Wang
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, 1088 Xueyuan Avenue, Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Computational Science and Material Design, Shenzhen, Guangdong 518055 China
- Guangdong Provincial Key Laboratory of Cell Microenvironment and Disease Research, Shenzhen, Guangdong 518055, China
| |
Collapse
|
4
|
Hsu GG, Tomal JH, Welch WJ. EPX: An R package for the ensemble of subsets of variables for highly unbalanced binary classification. Comput Biol Med 2021; 136:104760. [PMID: 34416572 DOI: 10.1016/j.compbiomed.2021.104760] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2021] [Revised: 08/06/2021] [Accepted: 08/07/2021] [Indexed: 11/26/2022]
Abstract
BACKGROUND AND OBJECTIVE In binary classification problems with a rare class of interest, there is relatively little information available for the rare class to build a model. On the other hand, the number of useful variables to develop a model for classification can be high-dimensional. For example, in drug discovery, there are usually a very few bioactive compounds in a large chemical library, whereas thousands of potentially useful explanatory variables characterize a compound's chemical structure. The sparsity of information for the rare class of interest makes it difficult for the standard classification models to exploit the richness of the useful feature variables. Thus, the objective of this paper is to develop an R package which clusters the feature variables into diverse subsets to be aggregated into a powerful ensemble for the detection of a rare class object. METHODS The ensemble of phalanxes (EPX) builds a classifier by exploiting the richness of feature variables using several diverse subsets of variables, called phalanxes, and outperforms many competitive state-of-the-art classification methods in terms of predictive ranking of the rare class of interest. RESULTS We present an R package EPX which implements the algorithm to form the ensemble of phalanxes as well as its associated functions. We further show how the ensemble of phalanxes can be constructed using parallel computing to lower the computational burden given high-dimensional data. CONCLUSIONS The R package EPX shows a flexible way of clustering feature variable space into smaller and diverse subsets of variables to develop an ensemble of phalanxes which better ranks a rare class object in a highly unbalanced two class classification problem. The ensemble EPX will be useful to detect the rare drug-like active biomolecules for development in drug discovery (Tomal et al., Mar. 2016) [1] and homologous proteins using similarity scores of amino acid sequences in protein homology (Tomal et al., 2019) [2]. The package EPX is freely available to download from CRAN (https://CRAN.R-project.org/package=EPX).
Collapse
Affiliation(s)
- Grace G Hsu
- Department of Statistics, University of British Columbia, 3182 Earth Sciences Building, 2207 Main Mall, Vancouver, BC, V6T 1Z4, Canada
| | - Jabed H Tomal
- Department of Mathematics and Statistics, Thompson Rivers University, 805 TRU Way, Kamloops, BC, V2C 0C8, Canada.
| | - William J Welch
- Department of Statistics, University of British Columbia, 3182 Earth Sciences Building, 2207 Main Mall, Vancouver, BC, V6T 1Z4, Canada
| |
Collapse
|
5
|
Zhang X, Niu W, Tang T, Hou C, Guo Y, Kong R. A Strategy to Find Novel Candidate DKAs Inhibitors Using Modified QSAR Model with Favorable Druggability Properties. Chem Res Chin Univ 2019. [DOI: 10.1007/s40242-019-9183-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
6
|
Choi H, Kang H, Chung KC, Park H. Development and application of a comprehensive machine learning program for predicting molecular biochemical and pharmacological properties. Phys Chem Chem Phys 2019; 21:5189-5199. [PMID: 30775759 DOI: 10.1039/c8cp07002d] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
We establish a comprehensive quantitative structure-activity relationship (QSAR) model termed AlphaQ through the machine learning algorithm to associate the fully quantum mechanical molecular descriptors with various biochemical and pharmacological properties. Preliminarily, a novel method for molecular structural alignments was developed in such a way to maximize the quantum mechanical cross correlations among the molecules. Besides the improvement of structural alignments, three-dimensional (3D) distribution of the molecular electrostatic potential was introduced as the unique numerical descriptor for individual molecules. These dual modifications lead to a substantial accuracy enhancement in multifarious 3D-QSAR prediction models of AlphaQ. Most remarkably, AlphaQ has been proven to be applicable to structurally diverse molecules to the extent that it outperforms the conventional QSAR methods in estimating the inhibitory activity against thrombin, the water-cyclohexane distribution coefficient, the permeability across the membrane of the Caco-2 cell, and the metabolic stability in human liver microsomes. Due to the simplicity in model building and the high predictive capability for varying biochemical and pharmacological properties, AlphaQ is anticipated to serve as a valuable screening tool at both early and late stages of drug discovery.
Collapse
Affiliation(s)
- Hwanho Choi
- Department of Bioscience and Biotechnology, Sejong University, 209 Neungdong-ro, Kwangjin-gu, Seoul 05006, Korea.
| | | | | | | |
Collapse
|
7
|
Subramanian G, Poda G. In silico ligand-based modeling of hBACE-1 inhibitors. Chem Biol Drug Des 2017; 91:817-827. [PMID: 29139199 DOI: 10.1111/cbdd.13147] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2017] [Revised: 10/24/2017] [Accepted: 11/01/2017] [Indexed: 02/06/2023]
Abstract
Alzheimer's disease is a chronic neurodegenerative disease affecting more than 30 million people worldwide. Development of small molecule inhibitors of human β-secretase 1 (hBACE-1) is being the focus of pharmaceutical industry for the past 15-20 years. Here, we successfully applied multiple ligand-based in silico modeling techniques to understand the inhibitory activities of a diverse set of small molecule hBACE-1 inhibitors reported in the scientific literature. Strikingly, the use of only a small subset of 230 (13%) molecules allowed us to develop quality models that performed reasonably well on the validation set of 1,476 (87%) inhibitors. Varying the descriptor sets and the complexity of the modeling techniques resulted in only minor improvements to the model's performance. The current results demonstrate that predictive models can be built by choosing appropriate modeling techniques in spite of using small datasets consisting of diverse chemical classes, a scenario typical in triaging of high-throughput screening results to identify false negatives. We hope that these encouraging results will help the community to develop more predictive models that would support research efforts for the debilitating Alzheimer's disease. Additionally, the integrated diversity of the techniques employed will stimulate scientists in the field to use in silico statistical modeling techniques like these to derive better models to help advance the drug discovery projects faster.
Collapse
Affiliation(s)
| | - Gennady Poda
- Drug Discovery Program, Ontario Institute for Cancer Research, Toronto, ON, Canada.,Leslie Dan Faculty of Pharmacy, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
8
|
Tromelin A, Chabanet C, Audouze K, Koensgen F, Guichard E. Multivariate statistical analysis of a large odorants database aimed at revealing similarities and links between odorants and odors. FLAVOUR FRAG J 2017. [DOI: 10.1002/ffj.3430] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Affiliation(s)
- Anne Tromelin
- UMR CSGA: CNRS, INRA; Université de Bourgogne Franche-Comté; 21000 Dijon France
| | - Claire Chabanet
- UMR CSGA: CNRS, INRA; Université de Bourgogne Franche-Comté; 21000 Dijon France
| | - Karine Audouze
- MTi, Sorbonne Paris Cité; Université Paris Diderot; INSERM UMR-S 973 75013 Paris France
| | - Florian Koensgen
- UMR CSGA: CNRS, INRA; Université de Bourgogne Franche-Comté; 21000 Dijon France
| | - Elisabeth Guichard
- UMR CSGA: CNRS, INRA; Université de Bourgogne Franche-Comté; 21000 Dijon France
| |
Collapse
|
9
|
Qi M, Wang T, Yi Y, Gao N, Kong J, Wang J. Joint L 2,1 Norm and Fisher Discrimination Constrained Feature Selection for Rational Synthesis of Microporous Aluminophosphates. Mol Inform 2016; 36. [PMID: 27863104 DOI: 10.1002/minf.201600076] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2016] [Accepted: 10/21/2016] [Indexed: 11/11/2022]
Abstract
Feature selection has been regarded as an effective tool to help researchers understand the generating process of data. For mining the synthesis mechanism of microporous AlPOs, this paper proposes a novel feature selection method by joint l2,1 norm and Fisher discrimination constraints (JNFDC). In order to obtain more effective feature subset, the proposed method can be achieved in two steps. The first step is to rank the features according to sparse and discriminative constraints. The second step is to establish predictive model with the ranked features, and select the most significant features in the light of the contribution of improving the predictive accuracy. To the best of our knowledge, JNFDC is the first work which employs the sparse representation theory to explore the synthesis mechanism of six kinds of pore rings. Numerical simulations demonstrate that our proposed method can select significant features affecting the specified structural property and improve the predictive accuracy. Moreover, comparison results show that JNFDC can obtain better predictive performances than some other state-of-the-art feature selection methods.
Collapse
Affiliation(s)
- Miao Qi
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, China
| | - Ting Wang
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, China
| | - Yugen Yi
- School of Software, Jiangxi Normal University, Nanchang, 330022, China
| | - Na Gao
- State Key Laboratory of Inorganic Synthesis and Preparative Chemistry, College of Chemistry, Jilin University, Changchun, 130012, China
| | - Jun Kong
- Key Laboratory for Applied Statistics of MOE, Northeast Normal University, Changchun, China
| | - Jianzhong Wang
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, China
| |
Collapse
|
10
|
Abstract
INTRODUCTION Neural networks are becoming a very popular method for solving machine learning and artificial intelligence problems. The variety of neural network types and their application to drug discovery requires expert knowledge to choose the most appropriate approach. AREAS COVERED In this review, the authors discuss traditional and newly emerging neural network approaches to drug discovery. Their focus is on backpropagation neural networks and their variants, self-organizing maps and associated methods, and a relatively new technique, deep learning. The most important technical issues are discussed including overfitting and its prevention through regularization, ensemble and multitask modeling, model interpretation, and estimation of applicability domain. Different aspects of using neural networks in drug discovery are considered: building structure-activity models with respect to various targets; predicting drug selectivity, toxicity profiles, ADMET and physicochemical properties; characteristics of drug-delivery systems and virtual screening. EXPERT OPINION Neural networks continue to grow in importance for drug discovery. Recent developments in deep learning suggests further improvements may be gained in the analysis of large chemical data sets. It's anticipated that neural networks will be more widely used in drug discovery in the future, and applied in non-traditional areas such as drug delivery systems, biologically compatible materials, and regenerative medicine.
Collapse
Affiliation(s)
- Igor I Baskin
- a Faculty of Physics , M.V. Lomonosov Moscow State University , Moscow , Russia.,b A.M. Butlerov Institute of Chemistry , Kazan Federal University , Kazan , Russia
| | - David Winkler
- c CSIRO Manufacturing , Clayton , VIC , Australia.,d Monash Institute for Pharmaceutical Sciences , Monash University , Parkville , VIC , Australia.,e Latrobe Institute for Molecular Science , Bundoora , VIC , Australia.,f School of Chemical and Physical Sciences , Flinders University , Bedford Park , SA , Australia
| | - Igor V Tetko
- g Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH) , Institute of Structural Biology , Neuherberg , Germany.,h BigChem GmbH , Neuherberg , Germany
| |
Collapse
|