26
|
Li ZR, McCormick TH. An Expectation Conditional Maximization approach for Gaussian graphical models. J Comput Graph Stat 2019; 28:767-777. [PMID: 33033426 PMCID: PMC7540244 DOI: 10.1080/10618600.2019.1609976] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2017] [Revised: 04/02/2019] [Accepted: 04/09/2019] [Indexed: 10/26/2022]
Abstract
Bayesian graphical models are a useful tool for understanding dependence relationships among many variables, particularly in situations with external prior information. In high-dimensional settings, the space of possible graphs becomes enormous, rendering even state-of-the-art Bayesian stochastic search computationally infeasible. We propose a deterministic alternative to estimate Gaussian and Gaussian copula graphical models using an Expectation Conditional Maximization (ECM) algorithm, extending the EM approach from Bayesian variable selection to graphical model estimation. We show that the ECM approach enables fast posterior exploration under a sequence of mixture priors, and can incorporate multiple sources of information.
Collapse
|
27
|
Richard Li Z, McCormick TH, Clark SJ. Bayesian Joint Spike-and-Slab Graphical Lasso. PROCEEDINGS OF MACHINE LEARNING RESEARCH 2019; 97:3877-3885. [PMID: 33521648 PMCID: PMC7845917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In this article, we propose a new class of priors for Bayesian inference with multiple Gaussian graphical models. We introduce Bayesian treatments of two popular procedures, the group graphical lasso and the fused graphical lasso, and extend them to a continuous spike-and-slab framework to allow self-adaptive shrinkage and model selection simultaneously. We develop an EM algorithm that performs fast and dynamic explorations of posterior modes. Our approach selects sparse models efficiently and automatically with substantially smaller bias than would be induced by alternative regularization procedures. The performance of the proposed methods are demonstrated through simulation and two real data examples.
Collapse
|
28
|
McCormick TH, Li ZR, Calvert C, Crampin AC, Kahn K, Clark SJ. Probabilistic Cause-of-death Assignment using Verbal Autopsies. J Am Stat Assoc 2016; 111:1036-1049. [PMID: 27990036 PMCID: PMC5154628 DOI: 10.1080/01621459.2016.1152191] [Citation(s) in RCA: 60] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2014] [Revised: 12/01/2015] [Indexed: 10/22/2022]
Abstract
In regions without complete-coverage civil registration and vital statistics systems there is uncertainty about even the most basic demographic indicators. In such regions the majority of deaths occur outside hospitals and are not recorded. Worldwide, fewer than one-third of deaths are assigned a cause, with the least information available from the most impoverished nations. In populations like this, verbal autopsy (VA) is a commonly used tool to assess cause of death and estimate cause-specific mortality rates and the distribution of deaths by cause. VA uses an interview with caregivers of the decedent to elicit data describing the signs and symptoms leading up to the death. This paper develops a new statistical tool known as InSilicoVA to classify cause of death using information acquired through VA. InSilicoVA shares uncertainty between cause of death assignments for specific individuals and the distribution of deaths by cause across the population. Using side-by-side comparisons with both observed and simulated data, we demonstrate that InSilicoVA has distinct advantages compared to currently available methods.
Collapse
|
29
|
Dai LN, Chen CD, Lin XK, Wang YB, Xia LG, Liu P, Chen XM, Li ZR. Retroperitoneal laparoscopy management for ureteral fibroepithelial polyps causing hydronephrosis in children: a report of five cases. J Pediatr Urol 2015; 11:257.e1-5. [PMID: 25982337 DOI: 10.1016/j.jpurol.2015.02.019] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/22/2014] [Accepted: 02/15/2015] [Indexed: 11/15/2022]
Abstract
INTRODUCTION Hydronephrosis is a common disease in children and may be caused by ureteral fibroepithelial polyps (UFP). Ureteral fibroepithelial polyps are rare in children and are difficult to precisely diagnose before surgery. Surgical treatment for symptomatic UFP is recommended. At the present institution, retroperitoneal laparoscopy has been used to treat five boys with UFP since 2006. OBJECTIVE To highlight the significance of UFP as an etiological factor of hydronephrosis in children and evaluate the applicative value of retroperitoneal laparoscopy in the treatment of children with UFP. METHODS Between 2006 and 2013 five boys underwent retroperitoneal laparoscopy at the present institution. They were identified with UFP by review of the clinical database. Detailed data were collected, including: radiographic studies, gross anatomical pathology, and pathology and radiology reports. All boys had been followed up at least every 6 months. RESULTS All of the boys were aged between 7 and 16 years (mean 9.8 years). The main symptoms were flank pain (all five) and hematuria (three). Radiographic examination showed that all of the boys presented with incomplete ureteral obstruction and hydronephrosis. The ureteral fibroepithelial polyps were located near the left UPJ or the left proximal ureter. All of the boys had the UFP removed: three underwent retroperitoneal laparoscopic dismembered Anderson-Hynes pyeloplasty and polypectomy, and two had retroperitoneal laparoscopic ureteral anastomosis. These polyps were all on the left side and between 15 and 35 mm in length (mean 22 mm) (Figure). All of the boys recovered well and were discharged from hospital. The postoperative histological report confirmed that the specimens were UFP. Hydronephrosis was periodically assessed by ultrasonography (using the same method as pre-surgical ultrasonography) after surgery. Mean follow-up was 33 months (range 6-58 months) and no complications were found afterwards. CONCLUSIONS Ureteral fibroepithelial polyps are rare but rather important as they can cause UPJ obstruction, which often manifests as hydronephrosis. It is most important to confirm the site of ureteral obstruction before surgery as this may have an effect on the surgical management. It is recommended that UFP be successfully managed in children with retroperitoneal laparoscopy.
Collapse
|
30
|
Rao HB, Zhu F, Yang GB, Li ZR, Chen YZ. Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res 2011; 39:W385-90. [PMID: 21609959 PMCID: PMC3125735 DOI: 10.1093/nar/gkr284] [Citation(s) in RCA: 105] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Sequence-derived structural and physicochemical features have been extensively used for analyzing and predicting structural, functional, expression and interaction profiles of proteins and peptides. PROFEAT has been developed as a web server for computing commonly used features of proteins and peptides from amino acid sequence. To facilitate more extensive studies of protein and peptides, numerous improvements and updates have been made to PROFEAT. We added new functions for computing descriptors of protein–protein and protein–small molecule interactions, segment descriptors for local properties of protein sequences, topological descriptors for peptide sequences and small molecule structures. We also added new feature groups for proteins and peptides (pseudo-amino acid composition, amphiphilic pseudo-amino acid composition, total amino acid properties and atomic-level topological descriptors) as well as for small molecules (atomic-level topological descriptors). Overall, PROFEAT computes 11 feature groups of descriptors for proteins and peptides, and a feature group of more than 400 descriptors for small molecules plus the derived features for protein–protein and protein–small molecule interactions. Our computational algorithms have been extensively tested and used in a number of published works for predicting proteins of specific structural or functional classes, protein–protein interactions, peptides of specific functions and quantitative structure activity relationships of small molecules. PROFEAT is accessible free of charge at http://bidd.cz3.nus.edu.sg/cgi-bin/prof/protein/profnew.cgi.
Collapse
|
31
|
Tan NX, Rao HB, Li ZR, Li XY. Prediction of chemical carcinogenicity by machine learning approaches. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2009; 20:27-75. [PMID: 19343583 DOI: 10.1080/10629360902724085] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
In this paper we report a successful application of machine learning approaches to the prediction of chemical carcinogenicity. Two different approaches, namely a support vector machine (SVM) and artificial neural network (ANN), were evaluated for predicting chemical carcinogenicity from molecular structure descriptors. A diverse set of 844 compounds, including 600 carcinogenic (CG+) and 244 noncarcinogenic (CG-) molecules, was used to estimate the accuracies of these approaches. The database was divided into two sets: the model construction set and the independent test set. Relevant molecular descriptors were selected by a hybrid feature selection method combining Fischer's score and Monte Carlo simulated annealing from a wide set of molecular descriptors, including physiochemical properties, constitutional, topological, and geometrical descriptors. The first model validation method was based a five-fold cross-validation method, in which the model construction set is split into five subsets. The five-fold cross-validation was used to select descriptors and optimise the model parameters by maximising the averaged overall accuracy. The final SVM model gave an averaged prediction accuracy of 90.7% for CG+ compounds, 81.6% for CG- compounds and 88.1% for the overall accuracy, while the corresponding ANN model provided an averaged prediction accuracy of 86.1% for CG+ compounds, 79.3% for CG- compounds and 84.2% for the overall accuracy. These results indicate that the hybrid feature selection method is very efficient and the selected descriptors are truly relevant to the carcinogenicity of compounds. Another model validation method, i.e. a hold-out method, was used to build the classification model using the selected descriptors and the optimised model parameters, in which the whole model construction set was used to build the classification model and the independent test set was used to test the predictive ability of the model. The SVM model gave a prediction accuracy of 87.6% for CG+ compounds, 79.1% for CG- compounds and 85.0% for the overall accuracy. The ANN model gave a prediction accuracy of 85.6% for CG+ compounds, 79.1% for CG- compounds and 83.6% for the overall accuracy. The results indicate that the built models are potentially useful for facilitating the prediction of chemical carcinogenicity of untested compounds.
Collapse
|
32
|
Zhang NF, Li ZR, Wei HY, Liu ZH, Hernigou P. Steroid-induced osteonecrosis: the number of lesions is related to the dosage. ACTA ACUST UNITED AC 2008; 90:1239-43. [PMID: 18757967 DOI: 10.1302/0301-620x.90b9.20056] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Severe acute respiratory syndrome (SARS) is a newly described infectious disease caused by the SARS coronavirus which attacks the immune system and pulmonary epithelium. It is treated with regular high doses of corticosteroids. Our aim was to determine the relationship between the dosage of steroids and the number and distribution of osteonecrotic lesions in patients treated with steroids during the SARS epidemic in Beijing, China in 2003. We identified 114 patients for inclusion in the study. Of these, 43 with osteonecrosis received a significantly higher cumulative and peak methylprednisolone-equivalent dose than 71 patients with no osteonecrosis identified by MRI. We confirmed that the number of osteonecrotic lesions was directly related to the dosage of steroids and that a very high dose, a peak dose of more than 200 mg or a cumulative methylprednisolone-equivalent dose of more than 4000 mg, is a significant risk factor for multifocal osteonecrosis with both epiphyseal and diaphyseal lesions. Patients with diaphyseal osteonecrosis received a significantly higher cumulative methylprednisolone-equivalent dose than those with epiphyseal osteonecrosis. Multifocal osteonecrosis should be suspected if a patient is diagnosed with osteonecrosis in the shaft of a long bone.
Collapse
|
33
|
Ma XH, Wang R, Yang SY, Li ZR, Xue Y, Wei YC, Low BC, Chen YZ. Evaluation of virtual screening performance of support vector machines trained by sparsely distributed active compounds. J Chem Inf Model 2008; 48:1227-37. [PMID: 18533644 DOI: 10.1021/ci800022e] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Virtual screening performance of support vector machines (SVM) depends on the diversity of training active and inactive compounds. While diverse inactive compounds can be routinely generated, the number and diversity of known actives are typically low. We evaluated the performance of SVM trained by sparsely distributed actives in six MDDR biological target classes composed of a high number of known actives (983-1645) of high, intermediate, and low structural diversity (muscarinic M1 receptor agonists, NMDA receptor antagonists, thrombin inhibitors, HIV protease inhibitors, cephalosporins, and renin inhibitors). SVM trained by regularly sparse data sets of 100 actives show improved yields at substantially reduced false-hit rates compared to those of published studies and those of Tanimoto-based similarity searching method based on the same data sets and molecular descriptors. SVM trained by very sparse data sets of 40 actives (2.4%-4.1% of the known actives) predicted 17.5-39.5%, 23.0-48.1%, and 70.2-92.4% of the remaining 943-1605 actives in the high, intermediate, and low diversity classes, respectively, 13.8-68.7% of which are outside the training compound families. SVM predicted 99.97% and 97.1% of the 9.997 M PUBCHEM and 167K remaining MDDR compounds as inactive and 2.6%-8.3% of the 19,495-38,483 MDDR compounds similar to the known actives as active. These suggest that SVM has substantial capability in identifying novel active compounds from sparse active data sets at low false-hit rates.
Collapse
|
34
|
Qin L, Zhang G, Sheng H, Wang XL, Wang YX, Yeung KW, Griffith JF, Li ZR, Leung KS, Yao XS. Phytoestrogenic compounds for prevention of steroid-associated osteonecrosis. JOURNAL OF MUSCULOSKELETAL & NEURONAL INTERACTIONS 2008; 8:18-21. [PMID: 18398255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
|
35
|
Han LY, Ma XH, Lin HH, Jia J, Zhu F, Xue Y, Li ZR, Cao ZW, Ji ZL, Chen YZ. A support vector machines approach for virtual screening of active compounds of single and multiple mechanisms from large libraries at an improved hit-rate and enrichment factor. J Mol Graph Model 2007; 26:1276-86. [PMID: 18218332 DOI: 10.1016/j.jmgm.2007.12.002] [Citation(s) in RCA: 65] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2007] [Revised: 12/05/2007] [Accepted: 12/05/2007] [Indexed: 01/04/2023]
Abstract
Support vector machines (SVM) and other machine-learning (ML) methods have been explored as ligand-based virtual screening (VS) tools for facilitating lead discovery. While exhibiting good hit selection performance, in screening large compound libraries, these methods tend to produce lower hit-rate than those of the best performing VS tools, partly because their training-sets contain limited spectrum of inactive compounds. We tested whether the performance of SVM can be improved by using training-sets of diverse inactive compounds. In retrospective database screening of active compounds of single mechanism (HIV protease inhibitors, DHFR inhibitors, dopamine antagonists) and multiple mechanisms (CNS active agents) from large libraries of 2.986 million compounds, the yields, hit-rates, and enrichment factors of our SVM models are 52.4-78.0%, 4.7-73.8%, and 214-10,543, respectively, compared to those of 62-95%, 0.65-35%, and 20-1200 by structure-based VS and 55-81%, 0.2-0.7%, and 110-795 by other ligand-based VS tools in screening libraries of >or=1 million compounds. The hit-rates are comparable and the enrichment factors are substantially better than the best results of other VS tools. 24.3-87.6% of the predicted hits are outside the known hit families. SVM appears to be potentially useful for facilitating lead discovery in VS of large compound libraries.
Collapse
|
36
|
Li H, Yap CW, Ung CY, Xue Y, Li ZR, Han LY, Lin HH, Chen YZ. Machine learning approaches for predicting compounds that interact with therapeutic and ADMET related proteins. J Pharm Sci 2007; 96:2838-60. [PMID: 17786989 DOI: 10.1002/jps.20985] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Computational methods for predicting compounds of specific pharmacodynamic and ADMET (absorption, distribution, metabolism, excretion and toxicity) property are useful for facilitating drug discovery and evaluation. Recently, machine learning methods such as neural networks and support vector machines have been explored for predicting inhibitors, antagonists, blockers, agonists, activators and substrates of proteins related to specific therapeutic and ADMET property. These methods are particularly useful for compounds of diverse structures to complement QSAR methods, and for cases of unavailable receptor 3D structure to complement structure-based methods. A number of studies have demonstrated the potential of these methods for predicting such compounds as substrates of P-glycoprotein and cytochrome P450 CYP isoenzymes, inhibitors of protein kinases and CYP isoenzymes, and agonists of serotonin receptor and estrogen receptor. This article is intended to review the strategies, current progresses and underlying difficulties in using machine learning methods for predicting these protein binders and as potential virtual screening tools. Algorithms for proper representation of the structural and physicochemical properties of compounds are also evaluated.
Collapse
|
37
|
Yap CW, Xue Y, Li ZR, Chen YZ. Application of support vector machines to in silico prediction of cytochrome p450 enzyme substrates and inhibitors. Curr Top Med Chem 2007; 6:1593-607. [PMID: 16918471 DOI: 10.2174/156802606778108942] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Cytochrome P450 enzymes are responsible for phase I metabolism of the majority of drugs and xenobiotics. Identification of the substrates and inhibitors of these enzymes is important for the analysis of drug metabolism, prediction of drug-drug interactions and drug toxicity, and the design of drugs that modulate cytochrome P450 mediated metabolism. The substrates and inhibitors of these enzymes are structurally diverse. It is thus desirable to explore methods capable of predicting compounds of diverse structures without over-fitting. Support vector machine is an attractive method with these qualities, which has been employed for predicting the substrates and inhibitors of several cytochrome P450 isoenzymes as well as compounds of various other pharmacodynamic, pharmacokinetic, and toxicological properties. This article introduces the methodology, evaluates the performance, and discusses the underlying difficulties and future prospects of the application of support vector machines to in silico prediction of cytochrome P450 substrates and inhibitors.
Collapse
|
38
|
Li ZR, Han LY, Xue Y, Yap CW, Li H, Jiang L, Chen YZ. MODEL—molecular descriptor lab: A web-based server for computing structural and physicochemical features of compounds. Biotechnol Bioeng 2007; 97:389-96. [PMID: 17013940 DOI: 10.1002/bit.21214] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Molecular descriptors represent structural and physicochemical features of compounds. They have been extensively used for developing statistical models, such as quantitative structure activity relationship (QSAR) and artificial neural networks (NN), for computer prediction of the pharmacodynamic, pharmacokinetic, or toxicological properties of compounds from their structure. While computer programs have been developed for computing molecular descriptors, there is a lack of a freely accessible one. We have developed a web-based server, MODEL (Molecular Descriptor Lab), for computing a comprehensive set of 3,778 molecular descriptors, which is significantly more than the approximately 1,600 molecular descriptors computed by other software. Our computational algorithms have been extensively tested and the computed molecular descriptors have been used in a number of published works of statistical models for predicting variety of pharmacodynamic, pharmacokinetic, and toxicological properties of compounds. Several testing studies on the computed molecular descriptors are discussed. MODEL is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/model/model.cgi free of charge for academic use.
Collapse
|
39
|
Li H, Ung CY, Yap CW, Xue Y, Li ZR, Chen YZ. Prediction of estrogen receptor agonists and characterization of associated molecular descriptors by statistical learning methods. J Mol Graph Model 2006; 25:313-23. [PMID: 16497524 DOI: 10.1016/j.jmgm.2006.01.007] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2005] [Revised: 12/21/2005] [Accepted: 01/19/2006] [Indexed: 01/04/2023]
Abstract
Specific estrogen receptor (ER) agonists have been used for hormone replacement therapy, contraception, osteoporosis prevention, and prostate cancer treatment. Some ER agonists and partial-agonists induce cancer and endocrine function disruption. Methods for predicting ER agonists are useful for facilitating drug discovery and chemical safety evaluation. Structure-activity relationships and rule-based decision forest models have been derived for predicting ER binders at impressive accuracies of 87.1-97.6% for ER binders and 80.2-96.0% for ER non-binders. However, these are not designed for identifying ER agonists and they were developed from a subset of known ER binders. This work explored several statistical learning methods (support vector machines, k-nearest neighbor, probabilistic neural network and C4.5 decision tree) for predicting ER agonists from comprehensive set of known ER agonists and other compounds. The corresponding prediction systems were developed and tested by using 243 ER agonists and 463 ER non-agonists, respectively, which are significantly larger in number and structural diversity than those in previous studies. A feature selection method was used for selecting molecular descriptors responsible for distinguishing ER agonists from non-agonists, some of which are consistent with those used in other studies and the findings from X-ray crystallography data. The prediction accuracies of these methods are comparable to those of earlier studies despite the use of significantly more diverse range of compounds. SVM gives the best accuracy of 88.9% for ER agonists and 98.1% for non-agonists. Our study suggests that statistical learning methods such as SVM are potentially useful for facilitating the prediction of ER agonists and for characterizing the molecular descriptors associated with ER agonists.
Collapse
|
40
|
Mi D, Liu GR, Wang JS, Li ZR. Relationships between the folding rate constant and the topological parameters of small two-state proteins based on general random walk model. J Theor Biol 2006; 241:152-7. [PMID: 16386276 DOI: 10.1016/j.jtbi.2005.11.011] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2005] [Revised: 09/27/2005] [Accepted: 11/10/2005] [Indexed: 11/19/2022]
Abstract
In this paper, we propose an analytically tractable model of protein folding based on one-dimensional general random walk. A second-order differential equation for the mean folding time of a single protein is constructed which can be used to derive the observed relationship between the folding rate constant and the number of native contacts. The parameters appearing in the model can be determined by fitting the theoretical prediction to the experimental result. In addition, taking into account the fact that the number of native contacts is almost proportional to the relative contact order, we can also explain the observed relationship between the folding rate constant and the relative contact order.
Collapse
|
41
|
Yap CW, Xue Y, Li H, Li ZR, Ung CY, Han LY, Zheng CJ, Cao ZW, Chen YZ. Prediction of compounds with specific pharmacodynamic, pharmacokinetic or toxicological property by statistical learning methods. Mini Rev Med Chem 2006; 6:449-59. [PMID: 16613581 DOI: 10.2174/138955706776361501] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Computational methods for predicting compounds of specific pharmacodynamic, pharmacokinetic, or toxicological property are useful for facilitating drug discovery and drug safety evaluation. The quantitative structure-activity relationship (QSAR) and quantitative structure-property relationship (QSPR) methods are the most successfully used statistical learning methods for predicting compounds of specific property. More recently, other statistical learning methods such as neural networks and support vector machines have been explored for predicting compounds of higher structural diversity than those covered by QSAR and QSPR. These methods have shown promising potential in a number of studies. This article is intended to review the strategies, current progresses and underlying difficulties in using statistical learning methods for predicting compounds of specific property. It also evaluates algorithms commonly used for representing structural and physicochemical properties of compounds.
Collapse
|
42
|
Han LY, Lin HH, Li ZR, Zheng CJ, Cao ZW, Xie B, Chen YZ. PEARLS: Program for Energetic Analysis of Receptor−Ligand System. J Chem Inf Model 2006; 46:445-50. [PMID: 16426079 DOI: 10.1021/ci0502146] [Citation(s) in RCA: 51] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Analysis of the energetics of small molecule ligand-protein, ligand-nucleic acid, and protein-nucleic acid interactions facilitates the quantitative understanding of molecular interactions that regulate the function and conformation of proteins. It has also been extensively used for ranking potential new ligands in virtual drug screening. We developed a Web-based software, PEARLS (Program for Energetic Analysis of Ligand-Receptor Systems), for computing interaction energies of ligand-protein, ligand-nucleic acid, protein-nucleic acid, and ligand-protein-nucleic acid complexes from their 3D structures. AMBER molecular force field, Morse potential, and empirical energy functions are used to compute the van der Waals, electrostatic, hydrogen bond, metal-ligand bonding, and water-mediated hydrogen bond energies between the binding molecules. The change in the solvation free energy of molecular binding is estimated by using an empirical solvation free energy model. Contribution from ligand conformational entropy change is also estimated by a simple model. The computed free energy for a number of PDB ligand-receptor complexes were studied and compared to experimental binding affinity. A substantial degree of correlation between the computed free energy and experimental binding affinity was found, which suggests that PEARLS may be useful in facilitating energetic analysis of ligand-protein, ligand-nucleic acid, and protein-nucleic acid interactions. PEARLS can be accessed at http://ang.cz3.nus.edu.sg/cgi-bin/prog/rune.pl.
Collapse
|
43
|
Yap CW, Li ZR, Chen YZ. Quantitative structure-pharmacokinetic relationships for drug clearance by using statistical learning methods. J Mol Graph Model 2005; 24:383-95. [PMID: 16290201 DOI: 10.1016/j.jmgm.2005.10.004] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2005] [Revised: 10/04/2005] [Accepted: 10/04/2005] [Indexed: 10/25/2022]
Abstract
Quantitative structure-pharmacokinetic relationships (QSPkR) have increasingly been used for the prediction of the pharmacokinetic properties of drug leads. Several QSPkR models have been developed to predict the total clearance (CL(tot)) of a compound. These models give good prediction accuracy but they are primarily based on a limited number of related compounds which are significantly lesser in number and diversity than the 503 compounds with known CL(tot) described in the literature. It is desirable to examine whether these and other statistical learning methods can be used for predicting the CL(tot) of a more diverse set of compounds. In this work, three statistical learning methods, general regression neural network (GRNN), support vector regression (SVR) and k-nearest neighbour (KNN) were explored for modeling the CL(tot) of all of the 503 known compounds. Six different sets of molecular descriptors, DS-MIXED, DS-3DMoRSE, DS-ATS, DS-GETAWAY, DS-RDF and DS-WHIM, were evaluated for their usefulness in the prediction of CL(tot). GRNN-, SVR- and KNN-developed models have average-fold errors in the range of 1.63 to 1.96, 1.66-1.95 and 1.90-2.23, respectively. For the best GRNN-, SVR- and KNN-developed models, the percentage of compounds with predicted CL(tot) within two-fold error of actual values are in the range of 61.9-74.3% and are comparable or slightly better than those of earlier studies. QSPkR models developed by using DS-MIXED, which is a collection of constitutional, geometrical, topological and electrotopological descriptors, generally give better prediction accuracies than those developed by using other descriptor sets. These results suggest that GRNN, SVR, and their consensus model are potentially useful for predicting QSPkR properties of drug leads.
Collapse
|
44
|
Xue Y, Li ZR, Yap CW, Sun LZ, Chen X, Chen YZ. Effect of molecular descriptor feature selection in support vector machine classification of pharmacokinetic and toxicological properties of chemical agents. ACTA ACUST UNITED AC 2005; 44:1630-8. [PMID: 15446820 DOI: 10.1021/ci049869h] [Citation(s) in RCA: 116] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Statistical-learning methods have been developed for facilitating the prediction of pharmacokinetic and toxicological properties of chemical agents. These methods employ a variety of molecular descriptors to characterize structural and physicochemical properties of molecules. Some of these descriptors are specifically designed for the study of a particular type of properties or agents, and their use for other properties or agents might generate noise and affect the prediction accuracy of a statistical learning system. This work examines to what extent the reduction of this noise can improve the prediction accuracy of a statistical learning system. A feature selection method, recursive feature elimination (RFE), is used to automatically select molecular descriptors for support vector machines (SVM) prediction of P-glycoprotein substrates (P-gp), human intestinal absorption of molecules (HIA), and agents that cause torsades de pointes (TdP), a rare but serious side effect. RFE significantly reduces the number of descriptors for each of these properties thereby increasing the computational speed for their classification. The SVM prediction accuracies of P-gp and HIA are substantially increased and that of TdP remains unchanged by RFE. These prediction accuracies are comparable to those of earlier studies derived from a selective set of descriptors. Our study suggests that molecular feature selection is useful for improving the speed and, in some cases, the accuracy of statistical learning methods for the prediction of pharmacokinetic and toxicological properties of chemical agents.
Collapse
|
45
|
Wang JF, Li ZR, Cai CZ, Chen YZ. Assessment of approximate string matching in a biomedical text retrieval problem. Comput Biol Med 2005; 35:717-24. [PMID: 16124992 DOI: 10.1016/j.compbiomed.2004.06.002] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2004] [Accepted: 06/02/2004] [Indexed: 11/19/2022]
Abstract
Text-based search is widely used for biomedical data mining and knowledge discovery. Character errors in literatures affect the accuracy of data mining. Methods for solving this problem are being explored. This work tests the usefulness of the Smith-Waterman algorithm with affine gap penalty as a method for biomedical literature retrieval. Names of medicinal herbs collected from herbal medicine literatures are matched with those from medicinal chemistry literatures by using this algorithm at different string identity levels (80-100%). The optimum performance is at string identity of 88%, at which the recall and precision are 96.9% and 97.3%, respectively. Our study suggests that the Smith-Waterman algorithm is useful for improving the success rate of biomedical text retrieval.
Collapse
|
46
|
Li H, Ung CY, Yap CW, Xue Y, Li ZR, Cao ZW, Chen YZ. Prediction of Genotoxicity of Chemical Compounds by Statistical Learning Methods. Chem Res Toxicol 2005; 18:1071-80. [PMID: 15962942 DOI: 10.1021/tx049652h] [Citation(s) in RCA: 55] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Various toxicological profiles, such as genotoxic potential, need to be studied in drug discovery processes and submitted to the drug regulatory authorities for drug safety evaluation. As part of the effort for developing low cost and efficient adverse drug reaction testing tools, several statistical learning methods have been used for developing genotoxicity prediction systems with an accuracy of up to 73.8% for genotoxic (GT+) and 92.8% for nongenotoxic (GT-) agents. These systems have been developed and tested by using less than 400 known GT+ and GT- agents, which is significantly less in number and diversity than the 860 GT+ and GT- agents known at present. There is a need to examine if a similar level of accuracy can be achieved for the more diverse set of molecules and to evaluate other statistical learning methods not yet applied to genotoxicity prediction. This work is intended for testing several statistical learning methods by using 860 GT+ and GT- agents, which include support vector machines (SVM), probabilistic neural network (PNN), k-nearest neighbor (k-NN), and C4.5 decision tree (DT). A feature selection method, recursive feature elimination, is used for selecting molecular descriptors relevant to genotoxicity study. The overall accuracies of SVM, k-NN, and PNN are comparable to and those of DT lower than the results from earlier studies, with SVM giving the highest accuracies of 77.8% for GT+ and 92.7% for GT- agents. Our study suggests that statistical learning methods, particularly SVM, k-NN, and PNN, are useful for facilitating the prediction of genotoxic potential of a diverse set of molecules.
Collapse
|
47
|
Li ZR, Han X, Liu GR. Protein designability analysis in sequence principal component space using 2D lattice model. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2004; 76:21-29. [PMID: 15313539 DOI: 10.1016/j.cmpb.2004.04.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/21/2003] [Revised: 04/14/2004] [Accepted: 04/14/2004] [Indexed: 05/24/2023]
Abstract
The number of proteins that fold into a certain structure differs drastically. The designability of a protein structure, which is defined as the number of sequences that have that structure as their unique lowest energy state, is studied in this paper using a simplified lattice model. The two-letter (HP) code and the pair-contact energy model are employed in the formulation of the relationship between the protein sequences and the compact structures. Due to the correlations between different dimensions, principal component analysis (PCA) is carried out to remove these correlations and develop reliable approximations of probability density functions of the protein sequences and the compact structures. An estimation of designability is derived using these probability density functions. Good correlation between estimated designabilities and those obtained through enumerative calculations is successfully achieved.
Collapse
|
48
|
Balas EA, Su KC, Solem JF, Li ZR, Brown G. Upgrading clinical decision support with published evidence: what can make the biggest difference? Stud Health Technol Inform 1999; 52 Pt 2:845-8. [PMID: 10384580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/13/2023]
Abstract
BACKGROUND To enhance clinical decision support, presented messages are increasingly supplemented with information from the medical literature. The goal of this study was to identify types of evidence that can lead to the biggest difference. METHODS Seven versions of a questionnaire were mailed to randomly selected active family practice physicians and internists across the United States. They were asked about the perceived values of evidence from randomized controlled trials, locally developed recommendations, no evidence, cost-effectiveness studies, expert opinion, epidemiologic studies, and clinical studies. Analysis of variance and pairwise comparisons were used for statistical testing. RESULTS Seventy-six (52%) physicians responded. On a Likert scale from one to six, randomized controlled clinical trial was the highest rated evidence (mean 5.07, SD +/- 1.14). Such evidence was significantly superior to locally developed recommendations and no evidence at all (P < .05). The interaction was also strong between the types of evidence and clinical areas (P = .0001). CONCLUSION While most health care organizations present data without interpretation or simply try to enforce locally developed recommendations, such approaches appear to be inferior to techniques of reporting data with pertinent controlled evidence from the literature. Investigating physicians' perceptions is likely to benefit the design of computer generated messages.
Collapse
|
49
|
Li ZR, Tian AJ, Yang YY. Preparing for the third millennium: the views of life informatics. Stud Health Technol Inform 1999; 52 Pt 1:394-6. [PMID: 10384486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/13/2023]
Abstract
The chief aspects of this paper are the condition of the birth of life informatics and its tasks, basic concepts, principles, and structure. There are three phases of combining informatics with medicine: product, technological, and theoretic application of which the goals are respectively the informatization of numerical and word processing, data of medical treatment, and the knowledge of medicine. While reached the third phase we have dealt with two types of biological information, physical and nonphysical, i.e., body information (i.e., the information about body's components and structure), and life information (i.e., the information about life codes and life programs). Life informatics is a main branch of bioinformatics. It is a new member of the medical informatics family, and as such is younger than health informatics, nursing informatics, and dental informatics. It's task is to assist biologists and medical doctors to recognize and interfere the human life information procedure just as they are doing well with human body's matter and energy system. Its basic concepts are life information, life information medicine, and life information therapy. Its most important principles are information materialism, general informatics, and information determinism. Its main branches are biomolecule, cellular, organic, individual, and social informatics. In the third millennium, the life informatics will be a leading discipline in biology, medicine and informatics, which will gradually influence modern philosophy and other humanities.
Collapse
|
50
|
Li ZR, Hromchak R, Mudipalli A, Bloch A. Tumor suppressor proteins as regulators of cell differentiation. Cancer Res 1998; 58:4282-7. [PMID: 9766653] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/09/2023]
Abstract
The products of the tumor suppressor genes are considered to function as specific inhibitors of tumor cell growth. In this communication, we present evidence to show that these proteins inhibit tumor cell proliferation by participating in the activation of tumor cell differentiation. The ML-1 human myeloblastic leukemia cells used in this study proliferate when treated with insulin-like growth factor I and transferrin but differentiate to monocytes when exposed to tumor necrosis factor alpha or transforming growth factor beta1, or to macrophage-like cells when treated with both these cytokines. Initiation of proliferation but not of differentiation was followed by a 20- to 25-fold increase in the nuclear level of the DNA polymerase-associated processivity factor PCNA and of the proliferation-specific transcription factor E2F1. In contrast, induction of differentiation but not of proliferation was followed by a 25- to 30-fold increase in the nuclear level of the tumor suppressor proteins p53 (wild type), pRb, and p130/Rb2 and of the p53-dependent cyclin kinase inhibitor p21/Cip1. p53 and p21/Cip1, respectively, inhibit the expression and activation of PCNA, whereas p130 and pRb, respectively, inhibit the expression and activation of E2F1. As a result, G1-S-associated DNA and mRNA synthesis is inhibited, growth uncoupled from differentiation, and maturation enabled to proceed. Where this function of the tumor suppressor proteins is impaired, the capacity for differentiation is lost, which leads to the sustained proliferation that is characteristic of the cancer cell.
Collapse
|