1
|
Wu X, Zhou Q, Mu L, Hu X. Machine learning in the identification, prediction and exploration of environmental toxicology: Challenges and perspectives. JOURNAL OF HAZARDOUS MATERIALS 2022; 438:129487. [PMID: 35816807 DOI: 10.1016/j.jhazmat.2022.129487] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Revised: 06/16/2022] [Accepted: 06/26/2022] [Indexed: 06/15/2023]
Abstract
Over the past few decades, data-driven machine learning (ML) has distinguished itself from hypothesis-driven studies and has recently received much attention in environmental toxicology. However, the use of ML in environmental toxicology remains in the early stages, with knowledge gaps, technical bottlenecks in data quality, high-dimensional/heterogeneous/small-sample data analysis and model interpretability, and a lack of an in-depth understanding of environmental toxicology. Given the above problems, we review the recent progress in the literature and highlight state-of-the-art toxicological studies using ML (such as learning and predicting toxicity in complicated biosystems and multiple-factor environmental scenarios of long-term and large-scale pollution). Beyond predicting simple biological endpoints by integrating untargeted omics and adverse outcome pathways, ML development should focus on revealing toxicological mechanisms. The integration of data-driven ML with other methods (e.g., omics analysis and adverse outcome pathway frameworks) endows ML with widely promising application in revealing toxicological mechanisms. High-quality databases and interpretable algorithms are urgently needed for toxicology and environmental science. Addressing the core issues and future challenges for ML in this review may narrow the knowledge gap between environmental toxicity and computational science and facilitate the control of environmental risk in the future.
Collapse
Affiliation(s)
- Xiaotong Wu
- Key Laboratory of Pollution Processes and Environmental Criteria (Ministry of Education)/Tianjin Key Laboratory of Environmental Remediation and Pollution Control, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China
| | - Qixing Zhou
- Key Laboratory of Pollution Processes and Environmental Criteria (Ministry of Education)/Tianjin Key Laboratory of Environmental Remediation and Pollution Control, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China
| | - Li Mu
- Tianjin Key Laboratory of Agro-environment and Safe-product, Key Laboratory for Environmental Factors Control of Agro-product Quality Safety (Ministry of Agriculture and Rural Affairs), Institute of Agro-environmental Protection, Ministry of Agriculture and Rural Affairs, Tianjin 300191, China.
| | - Xiangang Hu
- Key Laboratory of Pollution Processes and Environmental Criteria (Ministry of Education)/Tianjin Key Laboratory of Environmental Remediation and Pollution Control, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China.
| |
Collapse
|
2
|
Ma H, Bian Y, Rong Y, Huang W, Xu T, Xie W, Ye G, Huang J. Cross-dependent graph neural networks for molecular property prediction. Bioinformatics 2022; 38:2003-2009. [PMID: 35094072 DOI: 10.1093/bioinformatics/btac039] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Revised: 12/14/2021] [Accepted: 01/25/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION The crux of molecular property prediction is to generate meaningful representations of the molecules. One promising route is to exploit the molecular graph structure through graph neural networks (GNNs). Both atoms and bonds significantly affect the chemical properties of a molecule, so an expressive model ought to exploit both node (atom) and edge (bond) information simultaneously. Inspired by this observation, we explore the multi-view modeling with GNN (MVGNN) to form a novel paralleled framework, which considers both atoms and bonds equally important when learning molecular representations. In specific, one view is atom-central and the other view is bond-central, then the two views are circulated via specifically designed components to enable more accurate predictions. To further enhance the expressive power of MVGNN, we propose a cross-dependent message-passing scheme to enhance information communication of different views. The overall framework is termed as CD-MVGNN. RESULTS We theoretically justify the expressiveness of the proposed model in terms of distinguishing non-isomorphism graphs. Extensive experiments demonstrate that CD-MVGNN achieves remarkably superior performance over the state-of-the-art models on various challenging benchmarks. Meanwhile, visualization results of the node importance are consistent with prior knowledge, which confirms the interpretability power of CD-MVGNN. AVAILABILITY AND IMPLEMENTATION The code and data underlying this work are available in GitHub at https://github.com/uta-smile/CD-MVGNN. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hehuan Ma
- Department of Computer Science, University of Texas at Arlington, Arlington 76019, USA
| | | | - Yu Rong
- AI Lab, Tencent, Shenzhen 518057, China
| | - Wenbing Huang
- Institute for AI Industry Research, Tsinghua University, Beijing 100084, China
| | | | | | - Geyan Ye
- AI Lab, Tencent, Shenzhen 518057, China
| | - Junzhou Huang
- Department of Computer Science, University of Texas at Arlington, Arlington 76019, USA
| |
Collapse
|
3
|
Chen J, Si YW, Un CW, Siu SWI. Chemical toxicity prediction based on semi-supervised learning and graph convolutional neural network. J Cheminform 2021; 13:93. [PMID: 34838140 PMCID: PMC8627024 DOI: 10.1186/s13321-021-00570-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Accepted: 11/11/2021] [Indexed: 12/28/2022] Open
Abstract
As safety is one of the most important properties of drugs, chemical toxicology prediction has received increasing attentions in the drug discovery research. Traditionally, researchers rely on in vitro and in vivo experiments to test the toxicity of chemical compounds. However, not only are these experiments time consuming and costly, but experiments that involve animal testing are increasingly subject to ethical concerns. While traditional machine learning (ML) methods have been used in the field with some success, the limited availability of annotated toxicity data is the major hurdle for further improving model performance. Inspired by the success of semi-supervised learning (SSL) algorithms, we propose a Graph Convolution Neural Network (GCN) to predict chemical toxicity and trained the network by the Mean Teacher (MT) SSL algorithm. Using the Tox21 data, our optimal SSL-GCN models for predicting the twelve toxicological endpoints achieve an average ROC-AUC score of 0.757 in the test set, which is a 6% improvement over GCN models trained by supervised learning and conventional ML methods. Our SSL-GCN models also exhibit superior performance when compared to models constructed using the built-in DeepChem ML methods. This study demonstrates that SSL can increase the prediction power of models by learning from unannotated data. The optimal unannotated to annotated data ratio ranges between 1:1 and 4:1. This study demonstrates the success of SSL in chemical toxicity prediction; the same technique is expected to be beneficial to other chemical property prediction tasks by utilizing existing large chemical databases. Our optimal model SSL-GCN is hosted on an online server accessible through: https://app.cbbio.online/ssl-gcn/home .
Collapse
Affiliation(s)
- Jiarui Chen
- Department of Computer and Information Science, University of Macau, Avenida da Universidade, Taipa, 999078, Macau, China
| | - Yain-Whar Si
- Department of Computer and Information Science, University of Macau, Avenida da Universidade, Taipa, 999078, Macau, China
| | - Chon-Wai Un
- Department of Computer and Information Science, University of Macau, Avenida da Universidade, Taipa, 999078, Macau, China
| | - Shirley W I Siu
- Department of Computer and Information Science, University of Macau, Avenida da Universidade, Taipa, 999078, Macau, China.
- Institute of Science and Environment, University of Saint Joseph, Rua de Londres 106, 999078, Macau, China.
- School of Pharmaceutical Sciences, Universiti Sains Malaysia, USM, 11800, Penang, Malaysia.
| |
Collapse
|
4
|
Bo W, Chen L, Qin D, Geng S, Li J, Mei H, Li B, Liang G. Application of quantitative structure-activity relationship to food-derived peptides: Methods, situations, challenges and prospects. Trends Food Sci Technol 2021. [DOI: 10.1016/j.tifs.2021.05.031] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
5
|
Abstract
AbstractThe discovery of new medications in a cost-effective manner has become the top priority for many pharmaceutical companies. Despite decades of innovation, many of their processes arguably remain relatively inefficient. One such process is the prediction of biological activity. This paper describes a new deep learning model, capable of conducting a preliminary screening of chemical compounds in-silico. The model has been constructed using a variation autoencoder to generate chemical compound fingerprints, which have been used to create a regression model to predict their LogD property and a classification model to predict binding in selected assays from the ChEMBL dataset. The conducted experiments demonstrate accurate prediction of the properties of chemical compounds only using structural definitions and also provide several opportunities to improve upon this model in the future.
Collapse
|
6
|
Shamsara J. Evaluation of the performance of various machine learning methods on the discrimination of the active compounds. Chem Biol Drug Des 2021; 97:930-943. [DOI: 10.1111/cbdd.13819] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2020] [Revised: 12/10/2020] [Accepted: 12/21/2020] [Indexed: 12/12/2022]
Affiliation(s)
- Jamal Shamsara
- Pharmaceutical Research Center Pharmaceutical Technology Institute Mashhad University of Medical Sciences Mashhad Iran
| |
Collapse
|
7
|
Yang S, Ye Q, Ding J, Yin, Lu A, Chen X, Hou T, Cao D. Current advances in ligand‐based target prediction. WIRES COMPUTATIONAL MOLECULAR SCIENCE 2020. [DOI: 10.1002/wcms.1504] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Affiliation(s)
- Su‐Qing Yang
- Xiangya School of Pharmaceutical Sciences Central South University Changsha Hunan China
| | - Qing Ye
- College of Pharmaceutical Sciences Innovation Institute for Artificial Intelligence in Medicine, Zhejiang University Hangzhou, Zhejiang China
| | - Jun‐Jie Ding
- Beijing Institute of Pharmaceutical Chemistry Beijing China
| | - Yin
- Department of Dermatology, Hunan Engineering Research Center of Skin Health and Disease, Hunan Key Laboratory of Skin Cancer and Psoriasis, Xiangya Hospital Central South University Changsha Hunan China
| | - Ai‐Ping Lu
- Institute for Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine Hong Kong Baptist University Hong Kong China
| | - Xiang Chen
- Department of Dermatology, Hunan Engineering Research Center of Skin Health and Disease, Hunan Key Laboratory of Skin Cancer and Psoriasis, Xiangya Hospital Central South University Changsha Hunan China
| | - Ting‐Jun Hou
- College of Pharmaceutical Sciences Innovation Institute for Artificial Intelligence in Medicine, Zhejiang University Hangzhou, Zhejiang China
| | - Dong‐Sheng Cao
- Xiangya School of Pharmaceutical Sciences Central South University Changsha Hunan China
- Institute for Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine Hong Kong Baptist University Hong Kong China
| |
Collapse
|
8
|
Jablonka K, Ongari D, Moosavi SM, Smit B. Big-Data Science in Porous Materials: Materials Genomics and Machine Learning. Chem Rev 2020; 120:8066-8129. [PMID: 32520531 PMCID: PMC7453404 DOI: 10.1021/acs.chemrev.0c00004] [Citation(s) in RCA: 149] [Impact Index Per Article: 37.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Indexed: 12/16/2022]
Abstract
By combining metal nodes with organic linkers we can potentially synthesize millions of possible metal-organic frameworks (MOFs). The fact that we have so many materials opens many exciting avenues but also create new challenges. We simply have too many materials to be processed using conventional, brute force, methods. In this review, we show that having so many materials allows us to use big-data methods as a powerful technique to study these materials and to discover complex correlations. The first part of the review gives an introduction to the principles of big-data science. We show how to select appropriate training sets, survey approaches that are used to represent these materials in feature space, and review different learning architectures, as well as evaluation and interpretation strategies. In the second part, we review how the different approaches of machine learning have been applied to porous materials. In particular, we discuss applications in the field of gas storage and separation, the stability of these materials, their electronic properties, and their synthesis. Given the increasing interest of the scientific community in machine learning, we expect this list to rapidly expand in the coming years.
Collapse
Affiliation(s)
- Kevin
Maik Jablonka
- Laboratory of Molecular Simulation
(LSMO), Institut des Sciences et Ingénierie Chimiques (ISIC), École Polytechnique Fédérale
de Lausanne (EPFL), Sion, Switzerland
| | - Daniele Ongari
- Laboratory of Molecular Simulation
(LSMO), Institut des Sciences et Ingénierie Chimiques (ISIC), École Polytechnique Fédérale
de Lausanne (EPFL), Sion, Switzerland
| | - Seyed Mohamad Moosavi
- Laboratory of Molecular Simulation
(LSMO), Institut des Sciences et Ingénierie Chimiques (ISIC), École Polytechnique Fédérale
de Lausanne (EPFL), Sion, Switzerland
| | - Berend Smit
- Laboratory of Molecular Simulation
(LSMO), Institut des Sciences et Ingénierie Chimiques (ISIC), École Polytechnique Fédérale
de Lausanne (EPFL), Sion, Switzerland
| |
Collapse
|
9
|
TranScreen: Transfer Learning on Graph-Based Anti-Cancer Virtual Screening Model. BIG DATA AND COGNITIVE COMPUTING 2020. [DOI: 10.3390/bdcc4030016] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Deep learning’s automatic feature extraction has proven its superior performance over traditional fingerprint-based features in the implementation of virtual screening models. However, these models face multiple challenges in the field of early drug discovery, such as over-training and generalization to unseen data, due to the inherently unbalanced and small datasets. In this work, the TranScreen pipeline is proposed, which utilizes transfer learning and a collection of weight initializations to overcome these challenges. An amount of 182 graph convolutional neural networks are trained on molecular source datasets and the learned knowledge is transferred to the target task for fine-tuning. The target task of p53-based bioactivity prediction, an important factor for anti-cancer discovery, is chosen to showcase the capability of the pipeline. Having trained a collection of source models, three different approaches are implemented to compare and rank them for a given task before fine-tuning. The results show improvement in performance of the model in multiple cases, with the best model increasing the area under receiver operating curve ROC-AUC from 0.75 to 0.91 and the recall from 0.25 to 1. This improvement is vital for practical virtual screening via lowering the false negatives and demonstrates the potential of transfer learning. The code and pre-trained models are made accessible online.
Collapse
|
10
|
Abbasi K, Poso A, Ghasemi J, Amanlou M, Masoudi-Nejad A. Deep Transferable Compound Representation across Domains and Tasks for Low Data Drug Discovery. J Chem Inf Model 2019; 59:4528-4539. [PMID: 31661955 DOI: 10.1021/acs.jcim.9b00626] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
The main problem of small molecule-based drug discovery is to find a candidate molecule with increased pharmacological activity, proper ADME, and low toxicity. Recently, machine learning has driven a significant contribution to drug discovery. However, many machine learning methods, such as deep learning-based approaches, require a large amount of training data to form accurate predictions for unseen data. In lead optimization step, the amount of available biological data on small molecule compounds is low, which makes it a challenging problem to apply machine learning methods. The main goal of this study is to design a new approach to handle these situations. To this end, source assay (auxiliary assay) knowledge is utilized to learn a better model to predict the property of new compounds in the target assay. Up to now, the current approaches did not consider that source and target assays are adapted to different target groups with different compounds distribution. In this paper, we propose a new architecture by utilizing graph convolutional network and adversarial domain adaptation network to tackle this issue. To evaluate the proposed approach, we applied it to Tox21, ToxCast, SIDER, HIV, and BACE collections. The results showed the effectiveness of the proposed approach in transferring the related knowledge from source to target data set.
Collapse
Affiliation(s)
- Karim Abbasi
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics , University of Tehran , Tehran 1417614411 , Iran
| | - Antti Poso
- School of Pharmacy, Faculty of Health Sciences , University of Eastern Finland , Kuopio 80100 , Finland
| | - Jahanbakhsh Ghasemi
- Chemistry Department, Faculty of Sciences , University of Tehran , Tehran 1417614418 , Iran
| | - Massoud Amanlou
- Drug Design and Development Research Center, Department of Medicinal Chemistry , Tehran University of Medical Sciences , Tehran 1416753955 , Iran
| | - Ali Masoudi-Nejad
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics , University of Tehran , Tehran 1417614411 , Iran
| |
Collapse
|
11
|
Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M, Palmer A, Settels V, Jaakkola T, Jensen K, Barzilay R. Analyzing Learned Molecular Representations for Property Prediction. J Chem Inf Model 2019; 59:3370-3388. [PMID: 31361484 PMCID: PMC6727618 DOI: 10.1021/acs.jcim.9b00237] [Citation(s) in RCA: 591] [Impact Index Per Article: 118.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2019] [Indexed: 12/23/2022]
Abstract
Advancements in neural machinery have led to a wide range of algorithmic solutions for molecular property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed molecular fingerprints or expert-crafted descriptors and graph convolutional neural networks that construct a learned molecular representation by operating on the graph structure of the molecule. However, recent literature has yet to clearly determine which of these two methods is superior when generalizing to new chemical space. Furthermore, prior research has rarely examined these new models in industry research settings in comparison to existing employed models. In this paper, we benchmark models extensively on 19 public and 16 proprietary industrial data sets spanning a wide variety of chemical end points. In addition, we introduce a graph convolutional model that consistently matches or outperforms models using fixed molecular descriptors as well as previous graph neural architectures on both public and proprietary data sets. Our empirical findings indicate that while approaches based on these representations have yet to reach the level of experimental reproducibility, our proposed model nevertheless offers significant improvements over models currently used in industrial workflows.
Collapse
Affiliation(s)
- Kevin Yang
- Computer
Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, United States
| | - Kyle Swanson
- Computer
Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, United States
| | - Wengong Jin
- Computer
Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, United States
| | - Connor Coley
- Department
of Chemical Engineering, MIT, Cambridge, Massachusetts 02139, United States
| | | | - Hua Gao
- Amgen Inc., Cambridge, Massachusetts 02141, United States
| | | | - Timothy Hopper
- Amgen Inc., Cambridge, Massachusetts 02141, United States
| | - Brian Kelley
- Novartis
Institutes
for BioMedical Research, Cambridge, Massachusetts 02139, United States
| | | | | | | | - Tommi Jaakkola
- Computer
Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, United States
| | - Klavs Jensen
- Department
of Chemical Engineering, MIT, Cambridge, Massachusetts 02139, United States
| | - Regina Barzilay
- Computer
Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
12
|
Schaub AJ, Moreno GO, Zhao S, Truong HV, Luo R, Tsai SC. Computational structural enzymology methodologies for the study and engineering of fatty acid synthases, polyketide synthases and nonribosomal peptide synthetases. Methods Enzymol 2019; 622:375-409. [PMID: 31155062 PMCID: PMC7197764 DOI: 10.1016/bs.mie.2019.03.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Various computational methodologies can be applied to enzymological studies on enzymes in the fatty acid, polyketide, and non-ribosomal peptide biosynthetic pathways. These multi-domain complexes are called fatty acid synthases, polyketide synthases, and non-ribosomal peptide synthetases. These mega-synthases biosynthesize chemically diverse and complex bioactive molecules, with the intermediates being chauffeured between catalytic partners via a carrier protein. Recent efforts have been made to engineer these systems to expand their product diversity. A major stumbling block is our poor understanding of the transient protein-protein and protein-substrate interactions between the carrier protein and its many catalytic partner domains and product intermediates. The innate reactivity of pathway intermediates in two major classes of polyketide synthases has frustrated our mechanistic understanding of these interactions during the biosynthesis of these natural products, ultimately impeding the engineering of these systems for the generation of engineered natural products. Computational techniques described in this chapter can aid data interpretation or used to generate testable models of these experimentally intractable transient interactions, thereby providing insight into key interactions that are difficult to capture otherwise, with the potential to expand the diversity in these systems.
Collapse
Affiliation(s)
- Andrew J Schaub
- Department of Chemistry, University of California, Irvine, CA, United States
| | - Gabriel O Moreno
- Department of Molecular Biology and Biochemistry, University of California, Irvine, CA, United States
| | - Shiji Zhao
- Mathematical, Computational and Systems Biology Program, Center for Complex Biological Systems, University of California, Irvine, CA, United States
| | - Hau V Truong
- Department of Chemistry, University of California, Irvine, CA, United States
| | - Ray Luo
- Departments of Molecular Biology and Biochemistry, Chemical and Biomolecular Engineering, Materials Science and Engineering, and Biomedical Engineering, University of California, Irvine, CA, United States.
| | - Shiou-Chuan Tsai
- Department of Molecular Biology and Biochemistry, Chemistry, Pharmaceutical Sciences, University of California, Irvine, CA, United States.
| |
Collapse
|
13
|
Liu S, Alnammi M, Ericksen SS, Voter AF, Ananiev GE, Keck JL, Hoffmann FM, Wildman SA, Gitter A. Practical Model Selection for Prospective Virtual Screening. J Chem Inf Model 2018; 59:282-293. [PMID: 30500183 PMCID: PMC6351977 DOI: 10.1021/acs.jcim.8b00363] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
![]()
Virtual (computational) high-throughput
screening provides a strategy
for prioritizing compounds for experimental screens, but the choice
of virtual screening algorithm depends on the data set and evaluation
strategy. We consider a wide range of ligand-based machine learning
and docking-based approaches for virtual screening on two protein–protein
interactions, PriA-SSB and RMI-FANCM, and present a strategy for choosing
which algorithm is best for prospective compound prioritization. Our
workflow identifies a random forest as the best algorithm for these
targets over more sophisticated neural network-based models. The top
250 predictions from our selected random forest recover 37 of the
54 active compounds from a library of 22,434 new molecules assayed
on PriA-SSB. We show that virtual screening methods that perform well
on public data sets and synthetic benchmarks, like multi-task neural
networks, may not always translate to prospective screening performance
on a specific assay of interest.
Collapse
Affiliation(s)
- Shengchao Liu
- Department of Computer Sciences , University of Wisconsin-Madison , Madison , Wisconsin 53706 , United States.,Morgridge Institute for Research , Madison , Wisconsin 53715 , United States
| | - Moayad Alnammi
- Department of Computer Sciences , University of Wisconsin-Madison , Madison , Wisconsin 53706 , United States.,Morgridge Institute for Research , Madison , Wisconsin 53715 , United States
| | - Spencer S Ericksen
- Small Molecule Screening Facility , University of Wisconsin Carbone Cancer Center , Madison , Wisconsin 53792 , United States
| | - Andrew F Voter
- Department of Biomolecular Chemistry , University of Wisconsin School of Medicine and Public Health , Madison , Wisconsin 53706 , United States
| | - Gene E Ananiev
- Small Molecule Screening Facility , University of Wisconsin Carbone Cancer Center , Madison , Wisconsin 53792 , United States
| | - James L Keck
- Department of Biomolecular Chemistry , University of Wisconsin School of Medicine and Public Health , Madison , Wisconsin 53706 , United States
| | - F Michael Hoffmann
- Small Molecule Screening Facility , University of Wisconsin Carbone Cancer Center , Madison , Wisconsin 53792 , United States.,McArdle Laboratory for Cancer Research , University of Wisconsin-Madison , Madison , Wisconsin 53705 , United States
| | - Scott A Wildman
- Small Molecule Screening Facility , University of Wisconsin Carbone Cancer Center , Madison , Wisconsin 53792 , United States
| | - Anthony Gitter
- Department of Computer Sciences , University of Wisconsin-Madison , Madison , Wisconsin 53706 , United States.,Morgridge Institute for Research , Madison , Wisconsin 53715 , United States.,Department of Biostatistics and Medical Informatics , University of Wisconsin-Madison , Madison , Wisconsin 53792 , United States
| |
Collapse
|
14
|
Srinivas R, Klimovich PV, Larson EC. Implicit-descriptor ligand-based virtual screening by means of collaborative filtering. J Cheminform 2018; 10:56. [PMID: 30467684 PMCID: PMC6755561 DOI: 10.1186/s13321-018-0310-y] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Accepted: 11/13/2018] [Indexed: 12/20/2022] Open
Abstract
Current ligand-based machine learning methods in virtual screening rely heavily on molecular fingerprinting for preprocessing, i.e., explicit description of ligands’ structural and physicochemical properties in a vectorized form. Of particular importance to current methods are the extent to which molecular fingerprints describe a particular ligand and what metric sufficiently captures similarity among ligands. In this work, we propose and evaluate methods that do not require explicit feature vectorization through fingerprinting, but, instead, provide implicit descriptors based only on other known assays. Our methods are based upon well known collaborative filtering algorithms used in recommendation systems. Our implicit descriptor method does not require any fingerprint similarity search, which makes the method free of the bias arising from the empirical nature of the fingerprint models. We show that implicit methods significantly outperform traditional machine learning methods, and the main strengths of implicit methods are their resilience to target-ligand sparsity and high potential for spotting promiscuous ligands.
Collapse
Affiliation(s)
- Raghuram Srinivas
- Department of Computer Science and Engineering, Bobby B. Lyle School of Engineering, Southern Methodist University, 3145 Dyer Street, Dallas, TX, 75205, USA. .,DataScience@SMU, Dallas, 75205, TX, USA.
| | - Pavel V Klimovich
- Department of Computer Science and Engineering, Bobby B. Lyle School of Engineering, Southern Methodist University, 3145 Dyer Street, Dallas, TX, 75205, USA.,The Dedman College Interdisciplinary Institute, 3225 Daniel Avenue, Dallas, TX, 75205, USA
| | - Eric C Larson
- Department of Computer Science and Engineering, Bobby B. Lyle School of Engineering, Southern Methodist University, 3145 Dyer Street, Dallas, TX, 75205, USA
| |
Collapse
|
15
|
Ghasemi F, Mehridehnavi A, Pérez-Garrido A, Pérez-Sánchez H. Neural network and deep-learning algorithms used in QSAR studies: merits and drawbacks. Drug Discov Today 2018; 23:1784-1790. [PMID: 29936244 DOI: 10.1016/j.drudis.2018.06.016] [Citation(s) in RCA: 91] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2017] [Revised: 06/05/2018] [Accepted: 06/14/2018] [Indexed: 10/28/2022]
Affiliation(s)
- Fahimeh Ghasemi
- Department of Bioinformatics and Systems Biology, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Hezar-Jerib Ave., 81746 73461, Islamic Republic of Iran.
| | - Alireza Mehridehnavi
- Department of Bioinformatics and Systems Biology, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Hezar-Jerib Ave., 81746 73461, Islamic Republic of Iran
| | - Alfonso Pérez-Garrido
- Bioinformatics and High Performance Computing Research Group (BIO-HPC), Universidad Católica de Murcia (UCAM), E30107 Murcia, Spain
| | - Horacio Pérez-Sánchez
- Bioinformatics and High Performance Computing Research Group (BIO-HPC), Universidad Católica de Murcia (UCAM), E30107 Murcia, Spain.
| |
Collapse
|
16
|
Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow PM, Zietz M, Hoffman MM, Xie W, Rosen GL, Lengerich BJ, Israeli J, Lanchantin J, Woloszynek S, Carpenter AE, Shrikumar A, Xu J, Cofer EM, Lavender CA, Turaga SC, Alexandari AM, Lu Z, Harris DJ, DeCaprio D, Qi Y, Kundaje A, Peng Y, Wiley LK, Segler MHS, Boca SM, Swamidass SJ, Huang A, Gitter A, Greene CS. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 2018; 15:20170387. [PMID: 29618526 PMCID: PMC5938574 DOI: 10.1098/rsif.2017.0387] [Citation(s) in RCA: 790] [Impact Index Per Article: 131.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2017] [Accepted: 03/07/2018] [Indexed: 11/12/2022] Open
Abstract
Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network's prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine.
Collapse
Affiliation(s)
- Travers Ching
- Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at Manoa, Honolulu, HI, USA
| | - Daniel S Himmelstein
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Brett K Beaulieu-Jones
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Alexandr A Kalinin
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, USA
| | | | - Gregory P Way
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Enrico Ferrero
- Computational Biology and Stats, Target Sciences, GlaxoSmithKline, Stevenage, UK
| | | | - Michael Zietz
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Michael M Hoffman
- Princess Margaret Cancer Centre, Toronto, Ontario, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| | - Wei Xie
- Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN, USA
| | - Gail L Rosen
- Ecological and Evolutionary Signal-processing and Informatics Laboratory, Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Benjamin J Lengerich
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Johnny Israeli
- Biophysics Program, Stanford University, Stanford, CA, USA
| | - Jack Lanchantin
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| | - Stephen Woloszynek
- Ecological and Evolutionary Signal-processing and Informatics Laboratory, Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Anne E Carpenter
- Imaging Platform, Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Avanti Shrikumar
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, Chicago, IL, USA
| | - Evan M Cofer
- Department of Computer Science, Trinity University, San Antonio, TX, USA
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Christopher A Lavender
- Integrative Bioinformatics, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC, USA
| | - Srinivas C Turaga
- Howard Hughes Medical Institute, Janelia Research Campus, Ashburn, VA, USA
| | - Amr M Alexandari
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information and National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - David J Harris
- Department of Wildlife Ecology and Conservation, University of Florida, Gainesville, FL, USA
| | | | - Yanjun Qi
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Yifan Peng
- National Center for Biotechnology Information and National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Laura K Wiley
- Division of Biomedical Informatics and Personalized Medicine, University of Colorado School of Medicine, Aurora, CO, USA
| | - Marwin H S Segler
- Institute of Organic Chemistry, Westfälische Wilhelms-Universität Münster, Münster, Germany
| | - Simina M Boca
- Innovation Center for Biomedical Informatics, Georgetown University Medical Center, Washington, DC, USA
| | - S Joshua Swamidass
- Department of Pathology and Immunology, Washington University in Saint Louis, St Louis, MO, USA
| | - Austin Huang
- Department of Medicine, Brown University, Providence, RI, USA
| | - Anthony Gitter
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
- Morgridge Institute for Research, Madison, WI, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
17
|
Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Leswing K, Pande V. MoleculeNet: a benchmark for molecular machine learning. Chem Sci 2017; 9:513-530. [PMID: 29629118 PMCID: PMC5868307 DOI: 10.1039/c7sc02664a] [Citation(s) in RCA: 884] [Impact Index Per Article: 126.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2017] [Accepted: 10/30/2017] [Indexed: 12/22/2022] Open
Abstract
A large scale benchmark for molecular machine learning consisting of multiple public datasets, metrics, featurizations and learning algorithms.
Molecular machine learning has been maturing rapidly over the last few years. Improved methods and the presence of larger datasets have enabled machine learning algorithms to make increasingly accurate predictions about molecular properties. However, algorithmic progress has been limited due to the lack of a standard benchmark to compare the efficacy of proposed methods; most new algorithms are benchmarked on different datasets making it challenging to gauge the quality of proposed methods. This work introduces MoleculeNet, a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high quality open-source implementations of multiple previously proposed molecular featurization and learning algorithms (released as part of the DeepChem open source library). MoleculeNet benchmarks demonstrate that learnable representations are powerful tools for molecular machine learning and broadly offer the best performance. However, this result comes with caveats. Learnable representations still struggle to deal with complex tasks under data scarcity and highly imbalanced classification. For quantum mechanical and biophysical datasets, the use of physics-aware featurizations can be more important than choice of particular learning algorithm.
Collapse
Affiliation(s)
- Zhenqin Wu
- Department of Chemistry , Stanford University , Stanford , CA 94305 , USA .
| | - Bharath Ramsundar
- Department of Computer Science , Stanford University , Stanford , CA 94305 , USA
| | - Evan N Feinberg
- Program in Biophysics , Stanford School of Medicine , Stanford , CA 94305 , USA
| | - Joseph Gomes
- Department of Chemistry , Stanford University , Stanford , CA 94305 , USA .
| | - Caleb Geniesse
- Program in Biophysics , Stanford School of Medicine , Stanford , CA 94305 , USA
| | - Aneesh S Pappu
- Department of Computer Science , Stanford University , Stanford , CA 94305 , USA
| | | | - Vijay Pande
- Department of Chemistry , Stanford University , Stanford , CA 94305 , USA .
| |
Collapse
|
18
|
Lenselink EB, Ten Dijke N, Bongers B, Papadatos G, van Vlijmen HWT, Kowalczyk W, IJzerman AP, van Westen GJP. Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set. J Cheminform 2017; 9:45. [PMID: 29086168 PMCID: PMC5555960 DOI: 10.1186/s13321-017-0232-0] [Citation(s) in RCA: 165] [Impact Index Per Article: 23.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2017] [Accepted: 07/31/2017] [Indexed: 11/10/2022] Open
Abstract
The increase of publicly available bioactivity data in recent years has fueled and catalyzed research in chemogenomics, data mining, and modeling approaches. As a direct result, over the past few years a multitude of different methods have been reported and evaluated, such as target fishing, nearest neighbor similarity-based methods, and Quantitative Structure Activity Relationship (QSAR)-based protocols. However, such studies are typically conducted on different datasets, using different validation strategies, and different metrics. In this study, different methods were compared using one single standardized dataset obtained from ChEMBL, which is made available to the public, using standardized metrics (BEDROC and Matthews Correlation Coefficient). Specifically, the performance of Naïve Bayes, Random Forests, Support Vector Machines, Logistic Regression, and Deep Neural Networks was assessed using QSAR and proteochemometric (PCM) methods. All methods were validated using both a random split validation and a temporal validation, with the latter being a more realistic benchmark of expected prospective execution. Deep Neural Networks are the top performing classifiers, highlighting the added value of Deep Neural Networks over other more conventional methods. Moreover, the best method ('DNN_PCM') performed significantly better at almost one standard deviation higher than the mean performance. Furthermore, Multi-task and PCM implementations were shown to improve performance over single task Deep Neural Networks. Conversely, target prediction performed almost two standard deviations under the mean performance. Random Forests, Support Vector Machines, and Logistic Regression performed around mean performance. Finally, using an ensemble of DNNs, alongside additional tuning, enhanced the relative performance by another 27% (compared with unoptimized 'DNN_PCM'). Here, a standardized set to test and evaluate different machine learning algorithms in the context of multi-task learning is offered by providing the data and the protocols. Graphical Abstract .
Collapse
Affiliation(s)
- Eelke B Lenselink
- Division of Medicinal Chemistry, Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, P.O. Box 9502, 2300 RA, Leiden, The Netherlands
| | - Niels Ten Dijke
- Leiden Institute of Advanced Computer Science, Leiden University, P.O. Box 9512, 2300 RA, Leiden, The Netherlands
| | - Brandon Bongers
- Division of Medicinal Chemistry, Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, P.O. Box 9502, 2300 RA, Leiden, The Netherlands
| | - George Papadatos
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, UK.,GlaxoSmithKline, Medicines Research Centre, Gunnels Wood Road, Stevenage, Herts, SG1 2NY, UK
| | - Herman W T van Vlijmen
- Division of Medicinal Chemistry, Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, P.O. Box 9502, 2300 RA, Leiden, The Netherlands
| | - Wojtek Kowalczyk
- Leiden Institute of Advanced Computer Science, Leiden University, P.O. Box 9512, 2300 RA, Leiden, The Netherlands
| | - Adriaan P IJzerman
- Division of Medicinal Chemistry, Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, P.O. Box 9502, 2300 RA, Leiden, The Netherlands
| | - Gerard J P van Westen
- Division of Medicinal Chemistry, Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, P.O. Box 9502, 2300 RA, Leiden, The Netherlands.
| |
Collapse
|
19
|
Goh GB, Hodas NO, Vishnu A. Deep learning for computational chemistry. J Comput Chem 2017; 38:1291-1307. [PMID: 28272810 DOI: 10.1002/jcc.24764] [Citation(s) in RCA: 297] [Impact Index Per Article: 42.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2016] [Revised: 01/09/2017] [Accepted: 01/18/2017] [Indexed: 02/06/2023]
Abstract
The rise and fall of artificial neural networks is well documented in the scientific literature of both computer science and computational chemistry. Yet almost two decades later, we are now seeing a resurgence of interest in deep learning, a machine learning algorithm based on multilayer neural networks. Within the last few years, we have seen the transformative impact of deep learning in many domains, particularly in speech recognition and computer vision, to the extent that the majority of expert practitioners in those field are now regularly eschewing prior established models in favor of deep learning models. In this review, we provide an introductory overview into the theory of deep neural networks and their unique properties that distinguish them from traditional machine learning algorithms used in cheminformatics. By providing an overview of the variety of emerging applications of deep neural networks, we highlight its ubiquity and broad applicability to a wide range of challenges in the field, including quantitative structure activity relationship, virtual screening, protein structure prediction, quantum chemistry, materials design, and property prediction. In reviewing the performance of deep neural networks, we observed a consistent outperformance against non-neural networks state-of-the-art models across disparate research topics, and deep neural network-based models often exceeded the "glass ceiling" expectations of their respective tasks. Coupled with the maturity of GPU-accelerated computing for training deep neural networks and the exponential growth of chemical data on which to train these networks on, we anticipate that deep learning algorithms will be a valuable tool for computational chemistry. © 2017 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Garrett B Goh
- Advanced Computing, Mathematics, and Data Division, Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, Washington, 99354
| | - Nathan O Hodas
- Advanced Computing, Mathematics, and Data Division, Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, Washington, 99354
| | - Abhinav Vishnu
- Advanced Computing, Mathematics, and Data Division, Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, Washington, 99354
| |
Collapse
|
20
|
Lima AN, Philot EA, Trossini GHG, Scott LPB, Maltarollo VG, Honorio KM. Use of machine learning approaches for novel drug discovery. Expert Opin Drug Discov 2016; 11:225-39. [PMID: 26814169 DOI: 10.1517/17460441.2016.1146250] [Citation(s) in RCA: 138] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
INTRODUCTION The use of computational tools in the early stages of drug development has increased in recent decades. Machine learning (ML) approaches have been of special interest, since they can be applied in several steps of the drug discovery methodology, such as prediction of target structure, prediction of biological activity of new ligands through model construction, discovery or optimization of hits, and construction of models that predict the pharmacokinetic and toxicological (ADMET) profile of compounds. AREAS COVERED This article presents an overview on some applications of ML techniques in drug design. These techniques can be employed in ligand-based drug design (LBDD) and structure-based drug design (SBDD) studies, such as similarity searches, construction of classification and/or prediction models of biological activity, prediction of secondary structures and binding sites docking and virtual screening. EXPERT OPINION Successful cases have been reported in the literature, demonstrating the efficiency of ML techniques combined with traditional approaches to study medicinal chemistry problems. Some ML techniques used in drug design are: support vector machine, random forest, decision trees and artificial neural networks. Currently, an important application of ML techniques is related to the calculation of scoring functions used in docking and virtual screening assays from a consensus, combining traditional and ML techniques in order to improve the prediction of binding sites and docking solutions.
Collapse
Affiliation(s)
- Angélica Nakagawa Lima
- a Centro de Ciências Naturais e Humanas , Universidade Federal do ABC , São Paulo , Brazil
| | - Eric Allison Philot
- a Centro de Ciências Naturais e Humanas , Universidade Federal do ABC , São Paulo , Brazil
| | | | - Luis Paulo Barbour Scott
- c Centro de Matemática, Computação e Cognição , Universidade Federal do ABC , São Paulo , Brazil
| | | | - Kathia Maria Honorio
- a Centro de Ciências Naturais e Humanas , Universidade Federal do ABC , São Paulo , Brazil.,d Escola de Artes, Ciências e Humanidades , Universidade de São Paulo , São Paulo , Brazil
| |
Collapse
|
21
|
Molecular graph convolutions: moving beyond fingerprints. J Comput Aided Mol Des 2016; 30:595-608. [PMID: 27558503 DOI: 10.1007/s10822-016-9938-8] [Citation(s) in RCA: 597] [Impact Index Per Article: 74.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2016] [Accepted: 08/11/2016] [Indexed: 10/21/2022]
Abstract
Molecular "fingerprints" encoding structural information are the workhorse of cheminformatics and machine learning in drug discovery applications. However, fingerprint representations necessarily emphasize particular aspects of the molecular structure while ignoring others, rather than allowing the model to make data-driven decisions. We describe molecular graph convolutions, a machine learning architecture for learning from undirected graphs, specifically small molecules. Graph convolutions use a simple encoding of the molecular graph-atoms, bonds, distances, etc.-which allows the model to take greater advantage of information in the graph structure. Although graph convolutions do not outperform all fingerprint-based methods, they (along with other graph-based methods) represent a new paradigm in ligand-based virtual screening with exciting opportunities for future improvement.
Collapse
|
22
|
Accurate and efficient target prediction using a potency-sensitive influence-relevance voter. J Cheminform 2015; 7:63. [PMID: 26719774 PMCID: PMC4696267 DOI: 10.1186/s13321-015-0110-6] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2015] [Accepted: 12/02/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A number of algorithms have been proposed to predict the biological targets of diverse molecules. Some are structure-based, but the most common are ligand-based and use chemical fingerprints and the notion of chemical similarity. These methods tend to be computationally faster than others, making them particularly attractive tools as the amount of available data grows. RESULTS Using a ChEMBL-derived database covering 490,760 molecule-protein interactions and 3236 protein targets, we conduct a large-scale assessment of the performance of several target-prediction algorithms at predicting drug-target activity. We assess algorithm performance using three validation procedures: standard tenfold cross-validation, tenfold cross-validation in a simulated screen that includes random inactive molecules, and validation on an external test set composed of molecules not present in our database. CONCLUSIONS We present two improvements over current practice. First, using a modified version of the influence-relevance voter (IRV), we show that using molecule potency data can improve target prediction. Second, we demonstrate that random inactive molecules added during training can boost the accuracy of several algorithms in realistic target-prediction experiments. Our potency-sensitive version of the IRV (PS-IRV) obtains the best results on large test sets in most of the experiments. Models and software are publicly accessible through the chemoinformatics portal at http://chemdb.ics.uci.edu/.
Collapse
|
23
|
Dörr A, Rosenbaum L, Zell A. A ranking method for the concurrent learning of compounds with various activity profiles. J Cheminform 2015; 7:2. [PMID: 25643067 PMCID: PMC4306736 DOI: 10.1186/s13321-014-0050-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2014] [Accepted: 12/11/2014] [Indexed: 11/30/2022] Open
Abstract
Background In this study, we present a SVM-based ranking algorithm for the concurrent learning of compounds with different activity profiles and their varying prioritization. To this end, a specific labeling of each compound was elaborated in order to infer virtual screening models against multiple targets. We compared the method with several state-of-the-art SVM classification techniques that are capable of inferring multi-target screening models on three chemical data sets (cytochrome P450s, dehydrogenases, and a trypsin-like protease data set) containing three different biological targets each. Results The experiments show that ranking-based algorithms show an increased performance for single- and multi-target virtual screening. Moreover, compounds that do not completely fulfill the desired activity profile are still ranked higher than decoys or compounds with an entirely undesired profile, compared to other multi-target SVM methods. Conclusions SVM-based ranking methods constitute a valuable approach for virtual screening in multi-target drug design. The utilization of such methods is most helpful when dealing with compounds with various activity profiles and the finding of many ligands with an already perfectly matching activity profile is not to be expected. Electronic supplementary material The online version of this article (doi:10.1186/s13321-014-0050-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alexander Dörr
- Center for Bioinformatics Tübingen (ZBIT), University of Tuebingen, Sand 1, Tübingen, 72076 Germany
| | - Lars Rosenbaum
- Center for Bioinformatics Tübingen (ZBIT), University of Tuebingen, Sand 1, Tübingen, 72076 Germany
| | - Andreas Zell
- Center for Bioinformatics Tübingen (ZBIT), University of Tuebingen, Sand 1, Tübingen, 72076 Germany
| |
Collapse
|
24
|
Lavecchia A. Machine-learning approaches in drug discovery: methods and applications. Drug Discov Today 2014; 20:318-31. [PMID: 25448759 DOI: 10.1016/j.drudis.2014.10.012] [Citation(s) in RCA: 353] [Impact Index Per Article: 35.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2014] [Revised: 09/27/2014] [Accepted: 10/24/2014] [Indexed: 12/19/2022]
Abstract
During the past decade, virtual screening (VS) has evolved from traditional similarity searching, which utilizes single reference compounds, into an advanced application domain for data mining and machine-learning approaches, which require large and representative training-set compounds to learn robust decision rules. The explosive growth in the amount of public domain-available chemical and biological data has generated huge effort to design, analyze, and apply novel learning methodologies. Here, I focus on machine-learning techniques within the context of ligand-based VS (LBVS). In addition, I analyze several relevant VS studies from recent publications, providing a detailed view of the current state-of-the-art in this field and highlighting not only the problematic issues, but also the successes and opportunities for further advances.
Collapse
Affiliation(s)
- Antonio Lavecchia
- Department of Pharmacy, Drug Discovery Laboratory, University of Napoli 'Federico II', via D. Montesano 49, I-80131 Napoli, Italy.
| |
Collapse
|
25
|
Xie L, Ge X, Tan H, Xie L, Zhang Y, Hart T, Yang X, Bourne PE. Towards structural systems pharmacology to study complex diseases and personalized medicine. PLoS Comput Biol 2014; 10:e1003554. [PMID: 24830652 PMCID: PMC4022462 DOI: 10.1371/journal.pcbi.1003554] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Genome-Wide Association Studies (GWAS), whole genome sequencing, and high-throughput omics techniques have generated vast amounts of genotypic and molecular phenotypic data. However, these data have not yet been fully explored to improve the effectiveness and efficiency of drug discovery, which continues along a one-drug-one-target-one-disease paradigm. As a partial consequence, both the cost to launch a new drug and the attrition rate are increasing. Systems pharmacology and pharmacogenomics are emerging to exploit the available data and potentially reverse this trend, but, as we argue here, more is needed. To understand the impact of genetic, epigenetic, and environmental factors on drug action, we must study the structural energetics and dynamics of molecular interactions in the context of the whole human genome and interactome. Such an approach requires an integrative modeling framework for drug action that leverages advances in data-driven statistical modeling and mechanism-based multiscale modeling and transforms heterogeneous data from GWAS, high-throughput sequencing, structural genomics, functional genomics, and chemical genomics into unified knowledge. This is not a small task, but, as reviewed here, progress is being made towards the final goal of personalized medicines for the treatment of complex diseases.
Collapse
Affiliation(s)
- Lei Xie
- Department of Computer Science, Hunter College, The City University of New York, New York, New York, United States of America
- Ph.D. Program in Computer Science, Biology, and Biochemistry, The Graduate Center, The City University of New York, New York, New York, United States of America
- * E-mail:
| | - Xiaoxia Ge
- Department of Computer Science, Hunter College, The City University of New York, New York, New York, United States of America
| | - Hepan Tan
- Department of Computer Science, Hunter College, The City University of New York, New York, New York, United States of America
| | - Li Xie
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California, United States of America
| | - Yinliang Zhang
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California, United States of America
| | - Thomas Hart
- Department of Biological Sciences, Hunter College, The City University of New York, New York, New York, United States of America
| | - Xiaowei Yang
- School of Public Health, Hunter College, The City University of New York, New York, New York, United States of America
| | - Philip E. Bourne
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California, United States of America
| |
Collapse
|
26
|
Ng C, Hauptman R, Zhang Y, Bourne PE, Xie L. Anti-infectious drug repurposing using an integrated chemical genomics and structural systems biology approach. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2014:136-47. [PMID: 24297541 PMCID: PMC6322395] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
The emergence of multi-drug and extensive drug resistance of microbes to antibiotics poses a great threat to human health. Although drug repurposing is a promising solution for accelerating the drug development process, its application to anti-infectious drug discovery is limited by the scope of existing phenotype-, ligand-, or target-based methods. In this paper we introduce a new computational strategy to determine the genome-wide molecular targets of bioactive compounds in both human and bacterial genomes. Our method is based on the use of a novel algorithm, ligand Enrichment of Network Topological Similarity (ligENTS), to map the chemical universe to its global pharmacological space. ligENTS outperforms the state-of-the-art algorithms in identifying novel drug-target relationships. Furthermore, we integrate ligENTS with our structural systems biology platform to identify drug repurposing opportunities via target similarity profiling. Using this integrated strategy, we have identified novel P. falciparum targets of drug-like active compounds from the Malaria Box, and suggest that a number of approved drugs may be active against malaria. This study demonstrates the potential of an integrative chemical genomics and structural systems biology approach to drug repurposing.
Collapse
Affiliation(s)
- Clara Ng
- Department of Computer Science, Hunter College, the City University of New York, 695 Park Avenue, New York City, NY 10065, U. S. A..
| | | | | | | | | |
Collapse
|
27
|
Zaretzki J, Matlock M, Swamidass SJ. XenoSite: Accurately Predicting CYP-Mediated Sites of Metabolism with Neural Networks. J Chem Inf Model 2013; 53:3373-83. [DOI: 10.1021/ci400518g] [Citation(s) in RCA: 135] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Affiliation(s)
- Jed Zaretzki
- Department of Pathology and
Immunology, Washington University School of Medicine, St. Louis, Missouri 63130, United States
| | - Matthew Matlock
- Department of Pathology and
Immunology, Washington University School of Medicine, St. Louis, Missouri 63130, United States
| | - S. Joshua Swamidass
- Department of Pathology and
Immunology, Washington University School of Medicine, St. Louis, Missouri 63130, United States
| |
Collapse
|
28
|
Browning MR, Calhoun BT, Swamidass SJ. Managing missing measurements in small-molecule screens. J Comput Aided Mol Des 2013; 27:469-78. [DOI: 10.1007/s10822-013-9642-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2013] [Accepted: 03/29/2013] [Indexed: 12/22/2022]
|
29
|
Varnek A, Baskin I. Machine learning methods for property prediction in chemoinformatics: Quo Vadis? J Chem Inf Model 2012; 52:1413-37. [PMID: 22582859 DOI: 10.1021/ci200409x] [Citation(s) in RCA: 148] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
This paper is focused on modern approaches to machine learning, most of which are as yet used infrequently or not at all in chemoinformatics. Machine learning methods are characterized in terms of the "modes of statistical inference" and "modeling levels" nomenclature and by considering different facets of the modeling with respect to input/ouput matching, data types, models duality, and models inference. Particular attention is paid to new approaches and concepts that may provide efficient solutions of common problems in chemoinformatics: improvement of predictive performance of structure-property (activity) models, generation of structures possessing desirable properties, model applicability domain, modeling of properties with functional endpoints (e.g., phase diagrams and dose-response curves), and accounting for multiple molecular species (e.g., conformers or tautomers).
Collapse
Affiliation(s)
- Alexandre Varnek
- Laboratoire d'Infochimie, UMR 7177 CNRS, Université de Strasbourg, 4, rue B. Pascal, Strasbourg 67000, France.
| | | |
Collapse
|
30
|
Ranu S, Calhoun BT, Singh AK, Swamidass SJ. Probabilistic Substructure Mining From Small-Molecule Screens. Mol Inform 2011; 30:809-15. [DOI: 10.1002/minf.201100058] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2011] [Accepted: 06/04/2011] [Indexed: 12/20/2022]
|
31
|
Abstract
Repurposing and repositioning drugs--discovering new uses for existing and experimental medicines-is an attractive strategy for rescuing stalled pharmaceutical projects, finding treatments for neglected diseases, and reducing the time, cost and risk of drug development. As this strategy emerged, academic researchers began performing high-throughput screens (HTS) of small molecules--the type of experiments once exclusively conducted in industry--and making the data from these screens available to all. Several methods can mine this data to inform repurposing and repositioning efforts. Despite these methods' limitations, it is hopeful that they will accelerate the discovery of new uses for known drugs, but this hope has not yet been realized.
Collapse
Affiliation(s)
- S Joshua Swamidass
- Division of Laboratory and Genomic Medicine, Department of Pathology and Immunology, Washington University School of Medicine, St Louis, MO, USA.
| |
Collapse
|
32
|
Rosenbaum L, Hinselmann G, Jahn A, Zell A. Interpreting linear support vector machine models with heat map molecule coloring. J Cheminform 2011; 3:11. [PMID: 21439031 PMCID: PMC3076244 DOI: 10.1186/1758-2946-3-11] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2010] [Accepted: 03/25/2011] [Indexed: 11/17/2022] Open
Abstract
Background Model-based virtual screening plays an important role in the early drug discovery stage. The outcomes of high-throughput screenings are a valuable source for machine learning algorithms to infer such models. Besides a strong performance, the interpretability of a machine learning model is a desired property to guide the optimization of a compound in later drug discovery stages. Linear support vector machines showed to have a convincing performance on large-scale data sets. The goal of this study is to present a heat map molecule coloring technique to interpret linear support vector machine models. Based on the weights of a linear model, the visualization approach colors each atom and bond of a compound according to its importance for activity. Results We evaluated our approach on a toxicity data set, a chromosome aberration data set, and the maximum unbiased validation data sets. The experiments show that our method sensibly visualizes structure-property and structure-activity relationships of a linear support vector machine model. The coloring of ligands in the binding pocket of several crystal structures of a maximum unbiased validation data set target indicates that our approach assists to determine the correct ligand orientation in the binding pocket. Additionally, the heat map coloring enables the identification of substructures important for the binding of an inhibitor. Conclusions In combination with heat map coloring, linear support vector machine models can help to guide the modification of a compound in later stages of drug discovery. Particularly substructures identified as important by our method might be a starting point for optimization of a lead compound. The heat map coloring should be considered as complementary to structure based modeling approaches. As such, it helps to get a better understanding of the binding mode of an inhibitor.
Collapse
Affiliation(s)
- Lars Rosenbaum
- University of Tübingen, Center for Bioinformatics (ZBIT), Sand 1, 72076 Tübingen, Germany.
| | | | | | | |
Collapse
|
33
|
Gago G, Diacovich L, Arabolaza A, Tsai SC, Gramajo H. Fatty acid biosynthesis in actinomycetes. FEMS Microbiol Rev 2011; 35:475-97. [PMID: 21204864 DOI: 10.1111/j.1574-6976.2010.00259.x] [Citation(s) in RCA: 116] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
Abstract
All organisms that produce fatty acids do so via a repeated cycle of reactions. In mammals and other animals, these reactions are catalyzed by a type I fatty acid synthase (FAS), a large multifunctional protein to which the growing chain is covalently attached. In contrast, most bacteria (and plants) contain a type II system in which each reaction is catalyzed by a discrete protein. The pathway of fatty acid biosynthesis in Escherichia coli is well established and has provided a foundation for elucidating the type II FAS pathways in other bacteria (White et al., 2005). However, fatty acid biosynthesis is more diverse in the phylum Actinobacteria: Mycobacterium, possess both FAS systems while Streptomyces species have only the multienzyme FAS II system and Corynebacterium species exclusively FAS I. In this review, we present an overview of the genome organization, biochemical properties and physiological relevance of the two FAS systems in the three genera of actinomycetes mentioned above. We also address in detail the biochemical and structural properties of the acyl-CoA carboxylases (ACCases) that catalyzes the first committed step of fatty acid synthesis in actinomycetes, and discuss the molecular bases of their substrate specificity and the structure-based identification of new ACCase inhibitors with antimycobacterial properties.
Collapse
Affiliation(s)
- Gabriela Gago
- Microbiology Division, IBR (Instituto de Biología Molecular y Celular de Rosario), Consejo Nacional de Investigaciones Científicas y Técnicas, Facultad de Ciencias Bioquímicas y Farmacéuticas, Universidad Nacional de Rosario, Rosario, Argentina
| | | | | | | | | |
Collapse
|
34
|
Grave KD, Costa F. Molecular graph augmentation with rings and functional groups. J Chem Inf Model 2011; 50:1660-8. [PMID: 20795702 DOI: 10.1021/ci9005035] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Molecular graphs are a compact representation of molecules but may be too concise to obtain optimal generalization performance from graph-based machine learning algorithms. Over centuries, chemists have learned what are the important functional groups in molecules. This knowledge is normally not manifest in molecular graphs. In this paper, we introduce a simple method to incorporate this type of background knowledge: we insert additional vertices with corresponding edges for each functional group and ring structure identified in the molecule. We present experimental evidence that, on a wide range of ligand-based tasks and data sets, the proposed augmentation method improves the predictive performance over several graph kernel-based quantitative structure-activity relationship models. When the augmentation technique is used with the recent pairwise maximal common subgraphs kernel, we achieve a significant improvement over the current state-of-the-art on the NCI-60 cancer data set in 28 out of 60 cell lines, with the other 32 cell lines showing no significant difference in accuracy. Finally, on the Bursi mutagenicity data set, we obtain near-optimal predictions.
Collapse
Affiliation(s)
- Kurt De Grave
- Katholieke Universiteit Leuven, Department of Computer Science, Celestijnenlaan 200A, 3001 Heverlee, Belgium.
| | | |
Collapse
|
35
|
Hinselmann G, Rosenbaum L, Jahn A, Fechner N, Ostermann C, Zell A. Large-Scale Learning of Structure−Activity Relationships Using a Linear Support Vector Machine and Problem-Specific Metrics. J Chem Inf Model 2011; 51:203-13. [DOI: 10.1021/ci100073w] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Affiliation(s)
- Georg Hinselmann
- Center for Bioinformatics (ZBIT), University of Tübingen, Tübingen, Germany
| | - Lars Rosenbaum
- Center for Bioinformatics (ZBIT), University of Tübingen, Tübingen, Germany
| | - Andreas Jahn
- Center for Bioinformatics (ZBIT), University of Tübingen, Tübingen, Germany
| | - Nikolas Fechner
- Center for Bioinformatics (ZBIT), University of Tübingen, Tübingen, Germany
| | | | - Andreas Zell
- Center for Bioinformatics (ZBIT), University of Tübingen, Tübingen, Germany
| |
Collapse
|
36
|
Swamidass SJ, Bittker JA, Bodycombe NE, Ryder SP, Clemons PA. An economic framework to prioritize confirmatory tests after a high-throughput screen. ACTA ACUST UNITED AC 2010; 15:680-6. [PMID: 20547534 DOI: 10.1177/1087057110372803] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
How many hits from a high-throughput screen should be sent for confirmatory experiments? Analytical answers to this question are derived from statistics alone and aim to fix, for example, the false discovery rate at a predetermined tolerance. These methods, however, neglect local economic context and consequently lead to irrational experimental strategies. In contrast, the authors argue that this question is essentially economic, not statistical, and is amenable to an economic analysis that admits an optimal solution. This solution, in turn, suggests a novel tool for deciding the number of hits to confirm and the marginal cost of discovery, which meaningfully quantifies the local economic trade-off between true and false positives, yielding an economically optimal experimental strategy. Validated with retrospective simulations and prospective experiments, this strategy identified 157 additional actives that had been erroneously labeled inactive in at least one real-world screening experiment.
Collapse
Affiliation(s)
- S Joshua Swamidass
- Division of Laboratory and Genomic Medicine, Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO 63110, USA.
| | | | | | | | | |
Collapse
|
37
|
Geppert H, Vogt M, Bajorath J. Current trends in ligand-based virtual screening: molecular representations, data mining methods, new application areas, and performance evaluation. J Chem Inf Model 2010; 50:205-16. [PMID: 20088575 DOI: 10.1021/ci900419k] [Citation(s) in RCA: 231] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Affiliation(s)
- Hanna Geppert
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universitat, Dahlmannstrasse 2, D-53113 Bonn, Germany
| | | | | |
Collapse
|
38
|
Swamidass SJ, Azencott CA, Daily K, Baldi P. A CROC stronger than ROC: measuring, visualizing and optimizing early retrieval. Bioinformatics 2010; 26:1348-56. [PMID: 20378557 DOI: 10.1093/bioinformatics/btq140] [Citation(s) in RCA: 74] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The performance of classifiers is often assessed using Receiver Operating Characteristic ROC [or (AC) accumulation curve or enrichment curve] curves and the corresponding areas under the curves (AUCs). However, in many fundamental problems ranging from information retrieval to drug discovery, only the very top of the ranked list of predictions is of any interest and ROCs and AUCs are not very useful. New metrics, visualizations and optimization tools are needed to address this 'early retrieval' problem. RESULTS To address the early retrieval problem, we develop the general concentrated ROC (CROC) framework. In this framework, any relevant portion of the ROC (or AC) curve is magnified smoothly by an appropriate continuous transformation of the coordinates with a corresponding magnification factor. Appropriate families of magnification functions confined to the unit square are derived and their properties are analyzed together with the resulting CROC curves. The area under the CROC curve (AUC[CROC]) can be used to assess early retrieval. The general framework is demonstrated on a drug discovery problem and used to discriminate more accurately the early retrieval performance of five different predictors. From this framework, we propose a novel metric and visualization-the CROC(exp), an exponential transform of the ROC curve-as an alternative to other methods. The CROC(exp) provides a principled, flexible and effective way for measuring and visualizing early retrieval performance with excellent statistical power. Corresponding methods for optimizing early retrieval are also described in the Appendix. AVAILABILITY Datasets are publicly available. Python code and command-line utilities implementing CROC curves and metrics are available at http://pypi.python.org/pypi/CROC/ CONTACT: pfbaldi@ics.uci.edu
Collapse
Affiliation(s)
- S Joshua Swamidass
- Division of Laboratory and Genomic Medicine, Department of Pathology and Immunology, Washington University, St. Louis, MO 63110, USA
| | | | | | | |
Collapse
|