1
|
LEI B, ZANG Y, XUE Z, GE Y, LI W, ZHAI Q, JIAO L. [Ensemble hologram quantitative structure activity relationship model of the chromatographic retention index of aldehydes and ketones]. Se Pu 2021; 39:331-337. [PMID: 34227314 PMCID: PMC9403813 DOI: 10.3724/sp.j.1123.2020.06011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Indexed: 11/25/2022] Open
Abstract
Chromatographic retention index (RI) is an important parameter for describing the retention behavior of substances in chromatographic analysis. Experimentally determining the RI values of different aldehyde and ketone compounds in all kinds of polar stationary phases is expensive and time consuming. Quantitative structure activity relationship (QSAR) is an important chemometric technique that has been widely used to correlate the properties of chemicals to their molecular structures. Irrespective of whether the properties of a molecule have been experimentally determined, they can be calculated using QSAR models. It is therefore necessary and advisable to establish the QSAR model for predicting the RI value of aldehydes and ketones. Hologram QSAR (HQSAR) is a highly efficient QSAR approach that can easily generate QSAR models with good statistics and high prediction accuracy. A specific fragment of fingerprint, known as a molecular hologram, is proposed in the HQSAR approach and used as a structural descriptor to build the proposed QSAR model. In general, individual HQSAR models are built in QSAR researches. However, individual QSAR models are usually affected by underfitting and overfitting. The ensemble modeling method, which integrate several individual models through certain consensus strategies, can overcome the shortcomings of individual models. It is worth studying whether ensemble modeling can improve the prediction ability of the HQSAR method in order to build more accurate and reliable QSAR models. Therefore, this study investigates the QSAR model for chromatographic RI of aldehydes and ketones using ensemble modeling and the HQSAR method. Two individual HQSAR models comprising 34 compounds in two stationary phases, DB-210 and HP-Innowax, were established. The prediction ability of the two established models was assessed by external test set validation and leave-one-out cross validation (LOO-CV). The investigated 34 compounds were randomly assigned into two groups. Group Ⅰ comprised 26 compounds, and Group Ⅱ comprised 8 compounds. In the validation of the external test set, Group Ⅰ was employed to manually optimize the two fragment parameters (fragment distinction (FD) and fragment size (FS)) and build the HQSAR models. Group Ⅱ was used as the test set to assess the predictive performance of the developed models. For the DB-210 stationary phase, the optimal individual HQSAR model was obtained while setting the FD and FS to "donor/acceptor atoms (DA)" and 1-9, respectively. For the HP-Innowax stationary phase, the optimal individual HQSAR model was obtained by setting the FD and FS to "DA" and 4-7 respectively. The squared correlation coefficient of cross validation ( [Formula: see text] for predicting the RI values of the DB-210 and HP-Innowax stationary phases were 0.927 and 0.919, 0.956 and 0.979, 0.929 and 0.963, 0.927 and 0.958, and 0.935 and 0.963, respectively. Compared to the individual HQSAR models, the established ensemble HQSAR models show better robustness and accuracy, thus establishing that ensemble modeling is an effective approach. The combination of HQSAR and the ensemble modeling method is a practicable and promising method for studying and predicting the RI values of aldehydes and ketones.
Collapse
|
2
|
Xu Z, Chen X, Meng L, Yu M, Li L, Shi W. Sample Consensus Model and Unsupervised Variable Consensus Model for Improving the Accuracy of a Calibration Model. APPLIED SPECTROSCOPY 2019; 73:747-758. [PMID: 31149831 DOI: 10.1177/0003702819852174] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
In the quantitative analysis of spectral data, small sample size and high dimensionality of spectral variables often lead to poor accuracy of a calibration model. We proposed two methods, namely sample consensus and unsupervised variable consensus models, in order to solve the problem of poor accuracy. Three public near-infrared (NIR) or infrared (IR) spectroscopy data from corn, wine, and soil were used to build the partial least squares regression (PLSR) model. Then, Monte Carlo sampling and unsupervised variable clustering methods of a self-organizing map were coupled with the consensus modeling strategy to establish the multiple sub-models. Finally, sample consensus and unsupervised variable consensus models were obtained by assigning the weights to each PLSR sub-model. The calculated results show that both sample consensus and unsupervised variable consensus models can significantly improve the accuracy of the calibration model compared to the single PLSR model. The effectiveness of these two methods points out a new approach to achieve a further accurate result, which can take full advantage of the sample information and valid variable information.
Collapse
Affiliation(s)
- Zhou Xu
- 1 National and Local Joint Engineering Research Center of Reliability Analysis and Testing for Mechanical and Electrical Products, Zhejiang Sci-Tech University, Hangzhou, China
| | - Xiaojing Chen
- 2 College of Mathematics, Physics and Electronic Information Engineering, Wenzhou University, Wenzhou, China
| | - Liuwei Meng
- 3 Research and Development Department, Hangzhou Goodhere Biotechnology Co., Ltd., Hangzhou, China
| | - Mingen Yu
- 3 Research and Development Department, Hangzhou Goodhere Biotechnology Co., Ltd., Hangzhou, China
| | - Limin Li
- 2 College of Mathematics, Physics and Electronic Information Engineering, Wenzhou University, Wenzhou, China
| | - Wen Shi
- 2 College of Mathematics, Physics and Electronic Information Engineering, Wenzhou University, Wenzhou, China
| |
Collapse
|
3
|
Li Q, Huang Y, Song X, Zhang J, Min S. Moving window smoothing on the ensemble of competitive adaptive reweighted sampling algorithm. SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2019; 214:129-138. [PMID: 30776713 DOI: 10.1016/j.saa.2019.02.023] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/23/2018] [Revised: 01/20/2019] [Accepted: 02/10/2019] [Indexed: 05/14/2023]
Abstract
A novel chemometrical method, named as MWS-ECARS, which is based on using the moving window smoothing upon an ensemble of competitive adaptive reweighted sampling, is proposed as the spectral variable selection approach for multivariate calibration in this study. In terms of elimination of uninformative variables, an ensemble of CARS is carried out first and MWS is then performed to search for effective variables around the high frequency variables. The variable subset with the lowest standard error of cross-validation (SECV) is treated as the optimal threshold and the corresponding moving window width is regarded as the optimal window width. The method was applied to mid-infrared (MIR) spectra of active ingredient in pesticide, near-infrared (NIR) spectra of soil organic matter and NIR spectra of total nitrogen in Solanaceae plants for variable selection. Overall results show that MWS-ECARS is a promising selection method with an improved prediction performance over three variable selection methods of variable importance projection (VIP), uninformative variables elimination (UVE) and genetic algorithms (GA).
Collapse
Affiliation(s)
- Qianqian Li
- School of Marine Science, China University of Geosciences in Beijing, Beijing 100086, China; College of Science, China Agricultural University, Beijing 100193, China
| | - Yue Huang
- College of Food Science and Nutritional Engineering, China Agricultural University, Beijing 100193, China.
| | - Xiangzhong Song
- College of Science, China Agricultural University, Beijing 100193, China
| | - Jixiong Zhang
- College of Science, China Agricultural University, Beijing 100193, China
| | - Shungeng Min
- College of Science, China Agricultural University, Beijing 100193, China
| |
Collapse
|
4
|
Liu K, Chen X, Li L, Chen H, Ruan X, Liu W. A consensus successive projections algorithm – multiple linear regression method for analyzing near infrared spectra. Anal Chim Acta 2015; 858:16-23. [DOI: 10.1016/j.aca.2014.12.033] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2014] [Revised: 12/10/2014] [Accepted: 12/16/2014] [Indexed: 11/26/2022]
|
5
|
Segall MD, Barber C. Addressing toxicity risk when designing and selecting compounds in early drug discovery. Drug Discov Today 2014; 19:688-93. [DOI: 10.1016/j.drudis.2014.01.006] [Citation(s) in RCA: 82] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2013] [Revised: 12/17/2013] [Accepted: 01/14/2014] [Indexed: 12/15/2022]
|
6
|
Kabankin AS, Radkevich LA. Collective recognition strategy for estimating hepatoprotector activity of various chemical compounds in increasing liver repair potential. Pharm Chem J 2013. [DOI: 10.1007/s11094-013-0922-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
7
|
Liew CY, Yap CW. QSAR and Predictors of Eye and Skin Effects. Mol Inform 2013; 32:281-90. [PMID: 27481523 DOI: 10.1002/minf.201200119] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2012] [Accepted: 01/16/2013] [Indexed: 01/19/2023]
Abstract
In this study, the ensemble of features and training samples was examined with a collection of support vector machines. The effects of data sampling methods, ratio of positive to negative compounds, and types of base models combiner to produce ensemble models were explored. The ensemble method was applied to produce four separate in silico models to classify the labels for eye/skin corrosion (H314), skin irritation (H315), serious eye damage (H318), and eye irritation (H319), which are defined in the "Globally Harmonized System of Classification and Labelling of Chemicals". To the best of our knowledge, the training set used in this work is one of the largest (made of publicly available data) with acceptable prediction performances. These models were distributed via PaDEL-DDPredictor (http://padel.nus.edu.sg/software/padelddpredictor) that can be downloaded freely for public use.
Collapse
Affiliation(s)
- Chin Yee Liew
- Pharmaceutical Data Exploration Laboratory, Department of Pharmacy, National University of Singapore, 18 Science Drive 4, Singapore 117543 fax: +65-67791554
| | - Chun Wei Yap
- Pharmaceutical Data Exploration Laboratory, Department of Pharmacy, National University of Singapore, 18 Science Drive 4, Singapore 117543 fax: +65-67791554.
| |
Collapse
|
8
|
Abstract
Frequent failure of drug candidates during development stages remains the major deterrent for an early introduction of new drug molecules. The drug toxicity is the major cause of expensive late-stage development failures. An early identification/optimization of the most favorable molecule will naturally save considerable cost, time, human efforts and minimize animal sacrifice. (Quantitative) Structure Activity Relationships [(Q)SARs] represent statistically derived predictive models correlating biological activity (including desirable therapeutic effect and undesirable side effects) of chemicals (drugs/toxicants/environmental pollutants) with molecular descriptors and/or properties. (Q)SAR models which categorize the available data into two or more groups/classes are known as classification models. Numerous techniques of diverse nature are being presently employed for development of classification models. Though there is an increasing use of classification models for prediction of either biological activity or toxicity, the future trend will naturally be towards the development of classification models capable of simultaneous prediction of biological activity, toxicity, and pharmacokinetic parameters so as to accelerate development of bioavailable safe drug molecules.
Collapse
|
9
|
Pilkington NCV, Trotter MWB, Holden SB. Multiple Kernel Learning for Drug Discovery. Mol Inform 2012; 31:313-22. [PMID: 27477100 DOI: 10.1002/minf.201100146] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2011] [Accepted: 03/12/2012] [Indexed: 01/04/2023]
Abstract
The support vector machine (SVM) methodology has become a popular and well-used component of present chemometric analysis. We assess a relatively recent development of the algorithm, multiple kernel learning (MKL), on published structure-property relationship (SPR) data. The MKL algorithm learns a weighting across multiple kernel-based representations of the data during supervised classifier creation and, thereby, may be used to describe the influence of distinct groups of structural descriptors upon a single structure-property classifier without explicitly omitting any of them. We observe a statistically significant performance improvement over a conventional, single kernel SVM on all three SPR data sets analysed. Furthermore, MKL output is observed to provide useful information regarding the relative influence of five distinct descriptor subsets present in each data set.
Collapse
Affiliation(s)
- Nicholas C V Pilkington
- University of Cambridge Computer Laboratory, 15 JJ Thomson Avenue, Cambridge, CB3 0FD, UK phone: +44 (0)1223 763725
| | - Matthew W B Trotter
- Anne McLaren Laboratory for Regenerative Medicine & Department of Surgery, University of Cambridge, UK.,Celgene Institute for Translational Research Europe (CITRE), Sevilla, Spain
| | - Sean B Holden
- University of Cambridge Computer Laboratory, 15 JJ Thomson Avenue, Cambridge, CB3 0FD, UK phone: +44 (0)1223 763725.
| |
Collapse
|
10
|
QSAR classification of metabolic activation of chemicals into covalently reactive species. Mol Divers 2012; 16:389-400. [PMID: 22370994 DOI: 10.1007/s11030-012-9364-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2011] [Accepted: 02/13/2012] [Indexed: 12/22/2022]
Abstract
Metabolic activation of chemicals into covalently reactive species might lead to toxicological consequences such as tissue necrosis, carcinogenicity, teratogenicity, or immune-mediated toxicities. Early prediction of this undesirable outcome can help in selecting candidates with increased chance of success, thus, reducing attrition at all stages of drug development. The ensemble modelling of mixed features was used for the development of a model to classify the metabolic activation of chemicals into covalently reactive species. The effects of the quality of base classifiers and performance measure for sorting were examined. An ensemble model of 13 naive Bayes classifiers was built from a diverse set of 1,479 compounds. The ensemble model was validated internally with five-fold cross validation and it has achieved sensitivity of 67.4% and specificity of 93.4% when tested on the training set. The final ensemble model was made available for public use.
Collapse
|
11
|
Liew CY, Lim YC, Yap CW. Mixed learning algorithms and features ensemble in hepatotoxicity prediction. J Comput Aided Mol Des 2011; 25:855-71. [DOI: 10.1007/s10822-011-9468-3] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2011] [Accepted: 08/23/2011] [Indexed: 12/22/2022]
|
12
|
Neumann D, Merkwirth C, Lamprecht A. Nanoparticle design characterized by In Silico preparation parameter prediction using ensemble models. J Pharm Sci 2010; 99:1982-96. [DOI: 10.1002/jps.21941] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
13
|
Soto A, Cecchini R, Vazquez G, Ponzoni I. Multi-Objective Feature Selection in QSAR Using a Machine Learning Approach. ACTA ACUST UNITED AC 2009. [DOI: 10.1002/qsar.200960053] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
14
|
Chen X, Liang YZ, Yuan DL, Xu QS. A modified uncorrelated linear discriminant analysis model coupled with recursive feature elimination for the prediction of bioactivity. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2009; 20:1-26. [PMID: 19343582 DOI: 10.1080/10629360902724127] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
To meet the requirements of providing accurate, robust, and interpretable prediction of bioactivity, a modified uncorrelated linear discriminant analysis (M-ULDA) model was developed. In addition, a feature selection method called recursive feature elimination (RFE), originally used for support vector machine (SVM), was introduced and modified to fit the scheme of ULDA. From the evaluation of six pharmaceutical datasets, the M-UDLA coupled with RFE showed better or comparable classification accuracy with respect to other well-studied methods such as SVM and decision trees. The RFE used for ULDA has the advantage of increasing the computational speed and provides useful insights into biochemical mechanisms related to pharmaceutical activity by significantly reducing the number of variables used for the final model.
Collapse
Affiliation(s)
- X Chen
- College of Chemistry and Chemical Engineering, Central South University, Changsha, People's Republic of China
| | | | | | | |
Collapse
|
15
|
Nigsch F, Bender A, Jenkins JL, Mitchell JBO. Ligand-Target Prediction Using Winnow and Naive Bayesian Algorithms and the Implications of Overall Performance Statistics. J Chem Inf Model 2008; 48:2313-25. [DOI: 10.1021/ci800079x] [Citation(s) in RCA: 81] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Florian Nigsch
- Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom; Lead Discovery Informatics, Center for Proteomic Chemistry, Novartis Institutes for BioMedical Research, 250 Massachusetts Avenue, Cambridge, Massachusetts 02139; and Division of Medicinal Chemistry, Leiden/Amsterdam Center for Drug Research, Leiden University, Einsteinweg 55, 2333 CC, Leiden, The Netherlands
| | - Andreas Bender
- Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom; Lead Discovery Informatics, Center for Proteomic Chemistry, Novartis Institutes for BioMedical Research, 250 Massachusetts Avenue, Cambridge, Massachusetts 02139; and Division of Medicinal Chemistry, Leiden/Amsterdam Center for Drug Research, Leiden University, Einsteinweg 55, 2333 CC, Leiden, The Netherlands
| | - Jeremy L. Jenkins
- Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom; Lead Discovery Informatics, Center for Proteomic Chemistry, Novartis Institutes for BioMedical Research, 250 Massachusetts Avenue, Cambridge, Massachusetts 02139; and Division of Medicinal Chemistry, Leiden/Amsterdam Center for Drug Research, Leiden University, Einsteinweg 55, 2333 CC, Leiden, The Netherlands
| | - John B. O. Mitchell
- Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom; Lead Discovery Informatics, Center for Proteomic Chemistry, Novartis Institutes for BioMedical Research, 250 Massachusetts Avenue, Cambridge, Massachusetts 02139; and Division of Medicinal Chemistry, Leiden/Amsterdam Center for Drug Research, Leiden University, Einsteinweg 55, 2333 CC, Leiden, The Netherlands
| |
Collapse
|
16
|
Simmons K, Kinney J, Owens A, Kleier DA, Bloch K, Argentar D, Walsh A, Vaidyanathan G. Practical Outcomes of Applying Ensemble Machine Learning Classifiers to High-Throughput Screening (HTS) Data Analysis and Screening. J Chem Inf Model 2008; 48:2196-206. [DOI: 10.1021/ci800164u] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Affiliation(s)
- Kirk Simmons
- Simmons Consulting, 52 Windybush Way, Titusville, New Jersey 08560, DuPont Stine Haskell Research Laboratories, 1090 Elkton Road, Newark, Delaware 19711, DuPont Engineering Research and Technology, POB 80249, Wilmington, Delaware 19880, Drexel University, 3141 Chestnut Street, Philadelphia, Pennsylvania 19104, Sun Edge, LLC, 147 Tuckahoe Lane, Bear, Delaware 19701, and Quantum Leap Innovations, 3 Innovation Way, Suite 100, Newark, Delaware 19711
| | - John Kinney
- Simmons Consulting, 52 Windybush Way, Titusville, New Jersey 08560, DuPont Stine Haskell Research Laboratories, 1090 Elkton Road, Newark, Delaware 19711, DuPont Engineering Research and Technology, POB 80249, Wilmington, Delaware 19880, Drexel University, 3141 Chestnut Street, Philadelphia, Pennsylvania 19104, Sun Edge, LLC, 147 Tuckahoe Lane, Bear, Delaware 19701, and Quantum Leap Innovations, 3 Innovation Way, Suite 100, Newark, Delaware 19711
| | - Aaron Owens
- Simmons Consulting, 52 Windybush Way, Titusville, New Jersey 08560, DuPont Stine Haskell Research Laboratories, 1090 Elkton Road, Newark, Delaware 19711, DuPont Engineering Research and Technology, POB 80249, Wilmington, Delaware 19880, Drexel University, 3141 Chestnut Street, Philadelphia, Pennsylvania 19104, Sun Edge, LLC, 147 Tuckahoe Lane, Bear, Delaware 19701, and Quantum Leap Innovations, 3 Innovation Way, Suite 100, Newark, Delaware 19711
| | - Daniel A. Kleier
- Simmons Consulting, 52 Windybush Way, Titusville, New Jersey 08560, DuPont Stine Haskell Research Laboratories, 1090 Elkton Road, Newark, Delaware 19711, DuPont Engineering Research and Technology, POB 80249, Wilmington, Delaware 19880, Drexel University, 3141 Chestnut Street, Philadelphia, Pennsylvania 19104, Sun Edge, LLC, 147 Tuckahoe Lane, Bear, Delaware 19701, and Quantum Leap Innovations, 3 Innovation Way, Suite 100, Newark, Delaware 19711
| | - Karen Bloch
- Simmons Consulting, 52 Windybush Way, Titusville, New Jersey 08560, DuPont Stine Haskell Research Laboratories, 1090 Elkton Road, Newark, Delaware 19711, DuPont Engineering Research and Technology, POB 80249, Wilmington, Delaware 19880, Drexel University, 3141 Chestnut Street, Philadelphia, Pennsylvania 19104, Sun Edge, LLC, 147 Tuckahoe Lane, Bear, Delaware 19701, and Quantum Leap Innovations, 3 Innovation Way, Suite 100, Newark, Delaware 19711
| | - Dave Argentar
- Simmons Consulting, 52 Windybush Way, Titusville, New Jersey 08560, DuPont Stine Haskell Research Laboratories, 1090 Elkton Road, Newark, Delaware 19711, DuPont Engineering Research and Technology, POB 80249, Wilmington, Delaware 19880, Drexel University, 3141 Chestnut Street, Philadelphia, Pennsylvania 19104, Sun Edge, LLC, 147 Tuckahoe Lane, Bear, Delaware 19701, and Quantum Leap Innovations, 3 Innovation Way, Suite 100, Newark, Delaware 19711
| | - Alicia Walsh
- Simmons Consulting, 52 Windybush Way, Titusville, New Jersey 08560, DuPont Stine Haskell Research Laboratories, 1090 Elkton Road, Newark, Delaware 19711, DuPont Engineering Research and Technology, POB 80249, Wilmington, Delaware 19880, Drexel University, 3141 Chestnut Street, Philadelphia, Pennsylvania 19104, Sun Edge, LLC, 147 Tuckahoe Lane, Bear, Delaware 19701, and Quantum Leap Innovations, 3 Innovation Way, Suite 100, Newark, Delaware 19711
| | - Ganesh Vaidyanathan
- Simmons Consulting, 52 Windybush Way, Titusville, New Jersey 08560, DuPont Stine Haskell Research Laboratories, 1090 Elkton Road, Newark, Delaware 19711, DuPont Engineering Research and Technology, POB 80249, Wilmington, Delaware 19880, Drexel University, 3141 Chestnut Street, Philadelphia, Pennsylvania 19104, Sun Edge, LLC, 147 Tuckahoe Lane, Bear, Delaware 19701, and Quantum Leap Innovations, 3 Innovation Way, Suite 100, Newark, Delaware 19711
| |
Collapse
|
17
|
Nigsch F, Mitchell JBO. How to winnow actives from inactives: introducing molecular orthogonal sparse bigrams (MOSBs) and multiclass Winnow. J Chem Inf Model 2008; 48:306-18. [PMID: 18220378 DOI: 10.1021/ci700350n] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
In the present paper we combine the Winnow algorithm and an advanced scheme for feature generation into a tool for multiclass classification. The Winnow algorithm, specifically designed in the late 1980s to work well with high-dimensional data, by design ignores most of the irrelevant features for the scoring of each single training/test case. To augment the pool of available molecular features we use the Winnow algorithm in conjunction with a process that creates additional features from a set of given ones. We adapt a technique formerly employed in text classification termed "orthogonal sparse bigrams" and extend the use of that method to the domain of cheminformatics. Using circular molecular fingerprints as initial features, we create "molecular orthogonal sparse bigrams" (MOSBs) and report their successful application to the task of classification of bioactive molecules. Additionally, we introduce a memory-efficient way of bagging individual classifiers, avoiding the need to hold the complete training data set in memory. To compare the performance of our method with published results, we use the Hert data set of 8293 active molecules in 11 classes. We compare our method to Random Forest and find that our method not only is comparable or better in classification accuracy (up to 50% higher in MCC [Matthews correlation coefficient], 98% higher in fraction of correct predictions) but also is quicker to train (by a factor between 2 and 18, depending on the feature generation), more memory efficient, and able to cope more easily with large data sets when we seeded the actives into a pool of 94290 inactive molecules. It is shown that this method can be used with different fingerprints.
Collapse
Affiliation(s)
- Florian Nigsch
- Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | | |
Collapse
|
18
|
Abstract
We present a comparative assessment of several state-of-the-art machine learning tools for mining drug data, including support vector machines (SVMs) and the ensemble decision tree methods boosting, bagging, and random forest, using eight data sets and two sets of descriptors. We demonstrate, by rigorous multiple comparison statistical tests, that these techniques can provide consistent improvements in predictive performance over single decision trees. However, within these methods, there is no clearly best-performing algorithm. This motivates a more in-depth investigation into the properties of random forests. We identify a set of parameters for the random forest that provide optimal performance across all the studied data sets. Additionally, the tree ensemble structure of the forest may provide an interpretable model, a considerable advantage over SVMs. We test this possibility and compare it with standard decision tree models.
Collapse
Affiliation(s)
- Craig L Bruce
- School of Chemistry, University of Nottingham, University Park, Nottingham NG7 2RD, UK
| | | | | | | |
Collapse
|