1
|
Schuffenhauer A, Schneider N, Hintermann S, Auld D, Blank J, Cotesta S, Engeloch C, Fechner N, Gaul C, Giovannoni J, Jansen J, Joslin J, Krastel P, Lounkine E, Manchester J, Monovich LG, Pelliccioli AP, Schwarze M, Shultz MD, Stiefl N, Baeschlin DK. Evolution of Novartis' Small Molecule Screening Deck Design. J Med Chem 2020; 63:14425-14447. [PMID: 33140646 DOI: 10.1021/acs.jmedchem.0c01332] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
This article summarizes the evolution of the screening deck at the Novartis Institutes for BioMedical Research (NIBR). Historically, the screening deck was an assembly of all available compounds. In 2015, we designed a first deck to facilitate access to diverse subsets with optimized properties. We allocated the compounds as plated subsets on a 2D grid with property based ranking in one dimension and increasing structural redundancy in the other. The learnings from the 2015 screening deck were applied to the design of a next generation in 2019. We found that using traditional leadlikeness criteria (mainly MW, clogP) reduces the hit rates of attractive chemical starting points in subset screening. Consequently, the 2019 deck relies on solubility and permeability to select preferred compounds. The 2019 design also uses NIBR's experimental assay data and inferred biological activity profiles in addition to structural diversity to define redundancy across the compound sets.
Collapse
Affiliation(s)
- Ansgar Schuffenhauer
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Nadine Schneider
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Samuel Hintermann
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Douglas Auld
- Novartis Institutes for BioMedical Research Inc., 181 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Jutta Blank
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Simona Cotesta
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Caroline Engeloch
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Nikolas Fechner
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Christoph Gaul
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Jerome Giovannoni
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Johanna Jansen
- Novartis Institutes for BioMedical Research-Emeryville, 5300 Chiron Way, Emeryville, California 94608-2916, United States
| | - John Joslin
- Genomics Institute of the Novartis Foundation, San Diego, California 92121, United States
| | - Philipp Krastel
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Eugen Lounkine
- Novartis Institutes for BioMedical Research Inc., 181 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - John Manchester
- Novartis Institutes for BioMedical Research Inc., 181 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Lauren G Monovich
- Novartis Institutes for BioMedical Research Inc., 181 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Anna Paola Pelliccioli
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Manuel Schwarze
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Michael D Shultz
- Novartis Institutes for BioMedical Research Inc., 181 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Nikolaus Stiefl
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Daniel K Baeschlin
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| |
Collapse
|
2
|
Vachery J, Ranu S. RISC: Rapid Inverted-Index Based Search of Chemical Fingerprints. J Chem Inf Model 2019; 59:2702-2713. [PMID: 30908028 DOI: 10.1021/acs.jcim.9b00069] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The ability to search for a query molecule on massive molecular repositories is a fundamental task in chemoinformatics and drug-discovery. Chemical fingerprints are commonly used to characterize the structure and properties of molecules. Some fingerprints, particularly unfolded fingerprints, are often of extreme high dimension and sparse where only few features have a positive value. In this work, we propose a new searching algorithm, RISC, which exploits sparsity in high-dimensional fingerprints to derive effective pruning mechanisms and dramatically speed-up searching efficiency. RISC is robust enough to work on both binary and nonbinary chemical fingerprints. Extensive experiments on Range Queries and Top-k Queries across several molecular repositories demonstrate that at fingerprints of dimension 2048 and above, which is often the case with unfolded fingerprints, RISC is consistently faster than the state-of-the-art techniques. The source code of our implementation is available at http://www.cse.iitd.ac.in/~sayan/software.html .
Collapse
Affiliation(s)
- Jithin Vachery
- Department of Computer Science , IIT-Madras , Chennai , 600036 , India
| | - Sayan Ranu
- Department of Computer Science , IIT-Delhi , New Delhi , 110016 , India
| |
Collapse
|
3
|
Probst D, Reymond JL. A probabilistic molecular fingerprint for big data settings. J Cheminform 2018; 10:66. [PMID: 30564943 PMCID: PMC6755601 DOI: 10.1186/s13321-018-0321-8] [Citation(s) in RCA: 49] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2018] [Accepted: 12/13/2018] [Indexed: 11/10/2022] Open
Abstract
Background Among the various molecular fingerprints available to describe small organic molecules, extended connectivity fingerprint, up to four bonds (ECFP4) performs best in benchmarking drug analog recovery studies as it encodes substructures with a high level of detail. Unfortunately, ECFP4 requires high dimensional representations (≥ 1024D) to perform well, resulting in ECFP4 nearest neighbor searches in very large databases such as GDB, PubChem or ZINC to perform very slowly due to the curse of dimensionality. Results Herein we report a new fingerprint, called MinHash fingerprint, up to six bonds (MHFP6), which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. By leveraging locality sensitive hashing, LSH approximate nearest neighbor search methods perform as well on unfolded MHFP6 as comparable methods do on folded ECFP4 fingerprints in terms of speed and relative recovery rate, while operating in very sparse and high-dimensional binary chemical space. Conclusion MHFP6 is a new molecular fingerprint, encoding circular substructures, which outperforms ECFP4 for analog searches while allowing the direct application of locality sensitive hashing algorithms. It should be well suited for the analysis of large databases. The source code for MHFP6 is available on GitHub (https://github.com/reymond-group/mhfp).![]() Electronic supplementary material The online version of this article (10.1186/s13321-018-0321-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Daniel Probst
- Department of Chemistry and Biochemistry, National Center for Competence in Research NCCR TransCure, University of Berne, Freiestrasse 3, 3012, Bern, Switzerland.
| | - Jean-Louis Reymond
- Department of Chemistry and Biochemistry, National Center for Competence in Research NCCR TransCure, University of Berne, Freiestrasse 3, 3012, Bern, Switzerland
| |
Collapse
|
4
|
Fraaije JGEM, van Male J, Becherer P, Serral Gracià R. Coarse-Grained Models for Automated Fragmentation and Parametrization of Molecular Databases. J Chem Inf Model 2016; 56:2361-2377. [DOI: 10.1021/acs.jcim.6b00003] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Johannes G. E. M. Fraaije
- Leiden
Institute of Chemistry, Leiden University, Einsteinweg 55, 2300 RA Leiden, The Netherlands
- Culgi BV, Galileiweg 8, 2333 BD Leiden, The Netherlands
| | - Jan van Male
- Culgi BV, Galileiweg 8, 2333 BD Leiden, The Netherlands
| | - Paul Becherer
- Culgi BV, Galileiweg 8, 2333 BD Leiden, The Netherlands
| | | |
Collapse
|
5
|
Sukumar N, Krein MP, Prabhu G, Bhattacharya S, Sen S. Network measures for chemical library design. Drug Dev Res 2015; 75:402-11. [PMID: 25195584 DOI: 10.1002/ddr.21218] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
In this overview, we examine recent developments in network approaches to drug design. A brief overview of networks is followed by a discussion of how chemical similarity networks and their properties address challenges in drug design. Multiple methods used to assess or enhance chemical diversity for early-stage drug discovery are discussed, as well as methods that can be used for drug repositioning and ligand polypharmacology.
Collapse
Affiliation(s)
- Nagamani Sukumar
- Department of Chemistry, Shiv Nadar University, Dadri, Gautam Budh Nagar, U.P., 201314, India; Center for Informatics, Shiv Nadar University, Dadri, Gautam Budh Nagar, U.P., 201314, India
| | | | | | | | | |
Collapse
|
6
|
|
7
|
Chemical space networks: a powerful new paradigm for the description of chemical space. J Comput Aided Mol Des 2014; 28:795-802. [PMID: 24925682 DOI: 10.1007/s10822-014-9760-0] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2014] [Accepted: 06/04/2014] [Indexed: 01/26/2023]
Abstract
The concept of chemical space is playing an increasingly important role in many areas of chemical research, especially medicinal chemistry and chemical biology. It is generally conceived as consisting of numerous compound clusters of varying sizes scattered throughout the space in much the same way as galaxies of stars inhabit our universe. A number of issues associated with this coordinate-based representation are discussed. Not the least of which is the continuous nature of the space, a feature not entirely compatible with the inherently discrete nature of chemical space. Cell-based representations, which are derived from coordinate-based spaces, have also been developed that facilitate a number of chemical informatic activities (e.g., diverse subset selection, filling 'diversity voids', and comparing compound collections).These representations generally suffer the 'curse of dimensionality'. In this work, networks are proposed as an attractive paradigm for representing chemical space since they circumvent many of the issues associated with coordinate- and cell-based representations, including the curse of dimensionality. In addition, their relational structure is entirely compatible with the intrinsic nature of chemical space. A description of the features of these chemical space networks is presented that emphasizes their statistical characteristics and indicates how they are related to various types of network topologies that exhibit random, scale-free, and/or 'small world' properties.
Collapse
|
8
|
|
9
|
Csermely P, Korcsmáros T, Kiss HJM, London G, Nussinov R. Structure and dynamics of molecular networks: a novel paradigm of drug discovery: a comprehensive review. Pharmacol Ther 2013; 138:333-408. [PMID: 23384594 PMCID: PMC3647006 DOI: 10.1016/j.pharmthera.2013.01.016] [Citation(s) in RCA: 512] [Impact Index Per Article: 46.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2013] [Accepted: 01/22/2013] [Indexed: 02/02/2023]
Abstract
Despite considerable progress in genome- and proteome-based high-throughput screening methods and in rational drug design, the increase in approved drugs in the past decade did not match the increase of drug development costs. Network description and analysis not only give a systems-level understanding of drug action and disease complexity, but can also help to improve the efficiency of drug design. We give a comprehensive assessment of the analytical tools of network topology and dynamics. The state-of-the-art use of chemical similarity, protein structure, protein-protein interaction, signaling, genetic interaction and metabolic networks in the discovery of drug targets is summarized. We propose that network targeting follows two basic strategies. The "central hit strategy" selectively targets central nodes/edges of the flexible networks of infectious agents or cancer cells to kill them. The "network influence strategy" works against other diseases, where an efficient reconfiguration of rigid networks needs to be achieved by targeting the neighbors of central nodes/edges. It is shown how network techniques can help in the identification of single-target, edgetic, multi-target and allo-network drug target candidates. We review the recent boom in network methods helping hit identification, lead selection optimizing drug efficacy, as well as minimizing side-effects and drug toxicity. Successful network-based drug development strategies are shown through the examples of infections, cancer, metabolic diseases, neurodegenerative diseases and aging. Summarizing >1200 references we suggest an optimized protocol of network-aided drug development, and provide a list of systems-level hallmarks of drug quality. Finally, we highlight network-related drug development trends helping to achieve these hallmarks by a cohesive, global approach.
Collapse
Affiliation(s)
- Peter Csermely
- Department of Medical Chemistry, Semmelweis University, P.O. Box 260, H-1444 Budapest 8, Hungary.
| | | | | | | | | |
Collapse
|
10
|
Willett P. Fusing similarity rankings in ligand-based virtual screening. Comput Struct Biotechnol J 2013; 5:e201302002. [PMID: 24688695 PMCID: PMC3962232 DOI: 10.5936/csbj.201302002] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2012] [Revised: 10/04/2012] [Accepted: 10/12/2012] [Indexed: 11/22/2022] Open
Abstract
Data fusion is the name given to a range of methods for combining multiple sources of evidence. This mini-review summarizes the use of one such class of methods for combining the rankings obtained when similarity searching is used for ligand-based virtual screening. Two main approaches are described: similarity fusion involves combining rankings from single searches based on multiple similarity measures; and group fusion involves combining rankings from multiple searches based on a single similarity measure. The review then focuses on the rules that are available for combining similarity rankings, and on the evidence that exists for the superiority of fusion-based methods over conventional similarity searching.
Collapse
Affiliation(s)
- Peter Willett
- Information School, University of Sheffield, 211 Portobello Street, Sheffield S1 4DP, United Kingdom
| |
Collapse
|
11
|
Affiliation(s)
- Peter Willett
- Information School, University of Sheffield, 211 Portobello Street, Sheffield S1 4DP, United Kingdom.
| |
Collapse
|
12
|
Graphs and networks in chemical and biological informatics: past, present and future. Future Med Chem 2012; 4:2039-47. [DOI: 10.4155/fmc.12.128] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Chemical and biological network analysis has recently garnered intense interest from the perspective of drug design and discovery. While graph theoretic concepts have a long history in chemistry – predating quantum mechanics – and graphical measures of chemical structures date back to the 1970s, it is only recently with the advent of public repositories of information and availability of high-throughput assays and computational resources that network analysis of large-scale chemical networks, such as protein–protein interaction networks, has become possible. Drug design and discovery are undergoing a paradigm shift, from the notion of ‘one target, one drug’ to a much more nuanced view that relies on multiple sources of information: genomic, proteomic, metabolomic and so on. This holistic view of drug design is an incredibly daunting undertaking still very much in its infancy. Here, we focus on current developments in graph- and network-centric approaches in chemical and biological informatics, with particular reference to applications in the fields of SAR modeling and drug design. Key insights from the past suggest a path forward via visualization and fusion of multiple sources of chemical network data.
Collapse
|
13
|
Nasr R, Vernica R, Li C, Baldi P. Speeding up chemical searches using the inverted index: the convergence of chemoinformatics and text search methods. J Chem Inf Model 2012; 52:891-900. [PMID: 22462644 DOI: 10.1021/ci200552r] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
In ligand-based screening, retrosynthesis, and other chemoinformatics applications, one often seeks to search large databases of molecules in order to retrieve molecules that are similar to a given query. With the expanding size of molecular databases, the efficiency and scalability of data structures and algorithms for chemical searches are becoming increasingly important. Remarkably, both the chemoinformatics and information retrieval communities have converged on similar solutions whereby molecules or documents are represented by binary vectors, or fingerprints, indexing their substructures such as labeled paths for molecules and n-grams for text, with the same Jaccard-Tanimoto similarity measure. As a result, similarity search methods from one field can be adapted to the other. Here we adapt recent, state-of-the-art, inverted index methods from information retrieval to speed up similarity searches in chemoinformatics. Our results show a several-fold speed-up improvement over previous methods for both threshold searches and top-K searches. We also provide a mathematical analysis that allows one to predict the level of pruning achieved by the inverted index approach and validate the quality of these predictions through simulation experiments. All results can be replicated using data freely downloadable from http://cdb.ics.uci.edu/ .
Collapse
Affiliation(s)
- Ramzi Nasr
- Departments of Computer Science, University of California, Irvine, Irvine, California 92697-3435, United States
| | | | | | | |
Collapse
|
14
|
Prescriptions of traditional Chinese medicine are specific to cancer types and adjustable to temperature changes. PLoS One 2012; 7:e31648. [PMID: 22359613 PMCID: PMC3280982 DOI: 10.1371/journal.pone.0031648] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2011] [Accepted: 01/10/2012] [Indexed: 02/05/2023] Open
Abstract
Targeted cancer therapies, with specific molecular targets, ameliorate the side effect issue of radiation and chemotherapy and also point to the development of personalized medicine. Combination of drugs targeting multiple pathways of carcinogenesis is potentially more fruitful. Traditional Chinese medicine (TCM) has been tailoring herbal mixtures for individualized healthcare for two thousand years. A systematic study of the patterns of TCM formulas and herbs prescribed to cancers is valuable. We analysed a total of 187,230 TCM prescriptions to 30 types of cancer in Taiwan in 2007, a year's worth of collection from the National Health Insurance reimbursement database (Taiwan). We found that a TCM cancer prescription consists on average of two formulas and four herbs. We show that the percentage weights of TCM formulas and herbs in a TCM prescription follow Zipf's law with an exponent around 0.6. TCM prescriptions to benign neoplasms have a larger Zipf's exponent than those to malignant cancers. Furthermore, we show that TCM prescriptions, via weighted combination of formulas and herbs, are specific to not only the malignancy of neoplasms but also the sites of origins of malignant cancers. From the effects of formulas and natures of herbs that were heavily prescribed to cancers, that cancers are a ‘warm and stagnant’ syndrome in TCM can be proposed, suggesting anti-inflammatory regimens for better prevention and treatment of cancers. We show that TCM incorporated relevant formulas to the prescriptions to cancer patients with a secondary morbidity. We compared TCM prescriptions made in different seasons and identified temperatures as the environmental factor that correlates with changes in TCM prescriptions in Taiwan. Lung cancer patients were among the patients whose prescriptions were adjusted when temperatures drop. The findings of our study provide insight to TCM cancer treatment, helping dialogue between modern western medicine and TCM for better cancer care.
Collapse
|
15
|
Swamidass SJ, Calhoun BT, Bittker JA, Bodycombe NE, Clemons PA. Utility-aware screening with clique-oriented prioritization. J Chem Inf Model 2011; 52:29-37. [PMID: 22117901 DOI: 10.1021/ci2003285] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Most methods of deciding which hits from a screen to send for confirmatory testing assume that all confirmed actives are equally valuable and aim only to maximize the number of confirmed hits. In contrast, "utility-aware" methods are informed by models of screeners' preferences and can increase the rate at which the useful information is discovered. Clique-oriented prioritization (COP) extends a recently proposed economic framework and aims--by changing which hits are sent for confirmatory testing--to maximize the number of scaffolds with at least two confirmed active examples. In both retrospective and prospective experiments, COP enables accurate predictions of the number of clique discoveries in a batch of confirmatory experiments and improves the rate of clique discovery by more than 3-fold. In contrast, other similarity-based methods like ontology-based pattern identification (OPI) and local hit-rate analysis (LHR) reduce the rate of scaffold discovery by about half. The utility-aware algorithm used to implement COP is general enough to implement several other important models of screener preferences.
Collapse
Affiliation(s)
- S Joshua Swamidass
- Division of Laboratory and Genomic Medicine, Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, Missouri, USA.
| | | | | | | | | |
Collapse
|
16
|
Krein MP, Sukumar N. Exploration of the Topology of Chemical Spaces with Network Measures. J Phys Chem A 2011; 115:12905-18. [DOI: 10.1021/jp204022u] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Affiliation(s)
- Michael P. Krein
- Rensselaer Exploratory Center for Cheminformatics Research, and Department of Chemistry & Chemical Biology, Rensselaer Polytechnic Institute, 110 Eighth Street, Troy, New York 12180, United States
| | - N. Sukumar
- Rensselaer Exploratory Center for Cheminformatics Research, and Department of Chemistry & Chemical Biology, Rensselaer Polytechnic Institute, 110 Eighth Street, Troy, New York 12180, United States
| |
Collapse
|
17
|
Multiple search methods for similarity-based virtual screening: analysis of search overlap and precision. J Cheminform 2011; 3:29. [PMID: 21824430 PMCID: PMC3195112 DOI: 10.1186/1758-2946-3-29] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2011] [Accepted: 08/08/2011] [Indexed: 11/28/2022] Open
Abstract
Background Data fusion methods are widely used in virtual screening, and make the implicit assumption that the more often a molecule is retrieved in multiple similarity searches, the more likely it is to be active. This paper tests the correctness of this assumption. Results Sets of 25 searches using either the same reference structure and 25 different similarity measures (similarity fusion) or 25 different reference structures and the same similarity measure (group fusion) show that large numbers of unique molecules are retrieved by just a single search, but that the numbers of unique molecules decrease very rapidly as more searches are considered. This rapid decrease is accompanied by a rapid increase in the fraction of those retrieved molecules that are active. There is an approximately log-log relationship between the numbers of different molecules retrieved and the number of searches carried out, and a rationale for this power-law behaviour is provided. Conclusions Using multiple searches provides a simple way of increasing the precision of a similarity search, and thus provides a justification for the use of data fusion methods in virtual screening.
Collapse
|
18
|
Wang L, Li X, Zhang YQ, Zhang Y, Zhang K. Evolution of scaling emergence in large-scale spatial epidemic spreading. PLoS One 2011; 6:e21197. [PMID: 21747932 PMCID: PMC3128583 DOI: 10.1371/journal.pone.0021197] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2011] [Accepted: 05/22/2011] [Indexed: 12/01/2022] Open
Abstract
Background Zipf's law and Heaps' law are two representatives of the scaling concepts, which play a significant role in the study of complexity science. The coexistence of the Zipf's law and the Heaps' law motivates different understandings on the dependence between these two scalings, which has still hardly been clarified. Methodology/Principal Findings In this article, we observe an evolution process of the scalings: the Zipf's law and the Heaps' law are naturally shaped to coexist at the initial time, while the crossover comes with the emergence of their inconsistency at the larger time before reaching a stable state, where the Heaps' law still exists with the disappearance of strict Zipf's law. Such findings are illustrated with a scenario of large-scale spatial epidemic spreading, and the empirical results of pandemic disease support a universal analysis of the relation between the two laws regardless of the biological details of disease. Employing the United States domestic air transportation and demographic data to construct a metapopulation model for simulating the pandemic spread at the U.S. country level, we uncover that the broad heterogeneity of the infrastructure plays a key role in the evolution of scaling emergence. Conclusions/Significance The analyses of large-scale spatial epidemic spreading help understand the temporal evolution of scalings, indicating the coexistence of the Zipf's law and the Heaps' law depends on the collective dynamics of epidemic processes, and the heterogeneity of epidemic spread indicates the significance of performing targeted containment strategies at the early time of a pandemic disease.
Collapse
Affiliation(s)
- Lin Wang
- Adaptive Networks and Control Lab, Department of Electronic Engineering, Fudan University, Shanghai, People's Republic of China
| | - Xiang Li
- Adaptive Networks and Control Lab, Department of Electronic Engineering, Fudan University, Shanghai, People's Republic of China
- * E-mail:
| | - Yi-Qing Zhang
- Adaptive Networks and Control Lab, Department of Electronic Engineering, Fudan University, Shanghai, People's Republic of China
| | - Yan Zhang
- Adaptive Networks and Control Lab, Department of Electronic Engineering, Fudan University, Shanghai, People's Republic of China
| | - Kan Zhang
- Adaptive Networks and Control Lab, Department of Electronic Engineering, Fudan University, Shanghai, People's Republic of China
| |
Collapse
|
19
|
Swamidass SJ, Calhoun BT, Bittker JA, Bodycombe NE, Clemons PA. Enhancing the rate of scaffold discovery with diversity-oriented prioritization. ACTA ACUST UNITED AC 2011; 27:2271-8. [PMID: 21685049 DOI: 10.1093/bioinformatics/btr369] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION In high-throughput screens (HTS) of small molecules for activity in an in vitro assay, it is common to search for active scaffolds, with at least one example successfully confirmed as an active. The number of active scaffolds better reflects the success of the screen than the number of active molecules. Many existing algorithms for deciding which hits should be sent for confirmatory testing neglect this concern. RESULTS We derived a new extension of a recently proposed economic framework, diversity-oriented prioritization (DOP), that aims-by changing which hits are sent for confirmatory testing-to maximize the number of scaffolds with at least one confirmed active. In both retrospective and prospective experiments, DOP accurately predicted the number of scaffold discoveries in a batch of confirmatory experiments, improved the rate of scaffold discovery by 8-17%, and was surprisingly robust to the size of the confirmatory test batches. As an extension of our previously reported economic framework, DOP can be used to decide the optimal number of hits to send for confirmatory testing by iteratively computing the cost of discovering an additional scaffold, the marginal cost of discovery. CONTACT swamidass@wustl.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- S Joshua Swamidass
- Division of Laboratory and Genomic Medicine, Department of Pathology and Immunology, Washington University School of Medicine, St Louis, MO, USA.
| | | | | | | | | |
Collapse
|
20
|
Andronico A, Randall A, Benz RW, Baldi P. Data-driven high-throughput prediction of the 3-D structure of small molecules: review and progress. J Chem Inf Model 2011; 51:760-76. [PMID: 21417267 DOI: 10.1021/ci100223t] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Accurate prediction of the 3-D structure of small molecules is essential in order to understand their physical, chemical, and biological properties, including how they interact with other molecules. Here, we survey the field of high-throughput methods for 3-D structure prediction and set up new target specifications for the next generation of methods. We then introduce COSMOS, a novel data-driven prediction method that utilizes libraries of fragment and torsion angle parameters. We illustrate COSMOS using parameters extracted from the Cambridge Structural Database (CSD) by analyzing their distribution and then evaluating the system's performance in terms of speed, coverage, and accuracy. Results show that COSMOS represents a significant improvement when compared to state-of-the-art prediction methods, particularly in terms of coverage of complex molecular structures, including metal-organics. COSMOS can predict structures for 96.4% of the molecules in the CSD (99.6% organic, 94.6% metal-organic), whereas the widely used commercial method CORINA predicts structures for 68.5% (98.5% organic, 51.6% metal-organic). On the common subset of molecules predicted by both methods, COSMOS makes predictions with an average speed per molecule of 0.15 s (0.10 s organic, 0.21 s metal-organic) and an average rmsd of 1.57 Å (1.26 Å organic, 1.90 Å metal-organic), and CORINA makes predictions with an average speed per molecule of 0.13s (0.18s organic, 0.08s metal-organic) and an average rmsd of 1.60 Å (1.13 Å organic, 2.11 Å metal-organic). COSMOS is available through the ChemDB chemoinformatics Web portal at http://cdb.ics.uci.edu/ .
Collapse
Affiliation(s)
- Alessio Andronico
- School of Information and Computer Sciences, Institute for Genomics and Bioinformatics, Irvine , Irvine, California 92697-3435, USA
| | | | | | | |
Collapse
|
21
|
Nasr R, Hirschberg DS, Baldi P. Hashing algorithms and data structures for rapid searches of fingerprint vectors. J Chem Inf Model 2010; 50:1358-68. [PMID: 20681581 DOI: 10.1021/ci100132g] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
In many large chemoinformatics database systems, molecules are represented by long binary fingerprint vectors whose components record the presence or absence of particular functional groups or combinatorial features. To speed up database searches, we propose to add to each fingerprint a short signature integer vector of length M. For a given fingerprint, the i component of the signature vector counts the number of 1-bits in the fingerprint that fall on components congruent to i modulo M. Given two signatures, we show how one can rapidly compute a bound on the Jaccard-Tanimoto similarity measure of the two corresponding fingerprints, using the intersection bound. Thus, these signatures allow one to significantly prune the search space by discarding molecules associated with unfavorable bounds. Analytical methods are developed to predict the resulting amount of pruning as a function of M. Data structures combining different values of M are also developed together with methods for predicting the optimal values of M for a given implementation. Simulations using a particular implementation show that the proposed approach leads to a 1 order of magnitude speedup over a linear search and a 3-fold speedup over a previous implementation. All theoretical results and predictions are corroborated by large-scale simulations using molecules from the ChemDB. Several possible algorithmic extensions are discussed.
Collapse
Affiliation(s)
- Ramzi Nasr
- School of Information and Computer Sciences, Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697-3435, USA
| | | | | |
Collapse
|
22
|
Lü L, Zhang ZK, Zhou T. Zipf's law leads to Heaps' law: analyzing their relation in finite-size systems. PLoS One 2010; 5:e14139. [PMID: 21152034 PMCID: PMC2996287 DOI: 10.1371/journal.pone.0014139] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2010] [Accepted: 10/20/2010] [Indexed: 11/18/2022] Open
Abstract
Background Zipf's law and Heaps' law are observed in disparate complex systems. Of particular interests, these two laws often appear together. Many theoretical models and analyses are performed to understand their co-occurrence in real systems, but it still lacks a clear picture about their relation. Methodology/Principal Findings We show that the Heaps' law can be considered as a derivative phenomenon if the system obeys the Zipf's law. Furthermore, we refine the known approximate solution of the Heaps' exponent provided the Zipf's exponent. We show that the approximate solution is indeed an asymptotic solution for infinite systems, while in the finite-size system the Heaps' exponent is sensitive to the system size. Extensive empirical analysis on tens of disparate systems demonstrates that our refined results can better capture the relation between the Zipf's and Heaps' exponents. Conclusions/Significance The present analysis provides a clear picture about the relation between the Zipf's law and Heaps' law without the help of any specific stochastic model, namely the Heaps' law is indeed a derivative phenomenon from the Zipf's law. The presented numerical method gives considerably better estimation of the Heaps' exponent given the Zipf's exponent and the system size. Our analysis provides some insights and implications of real complex systems. For example, one can naturally obtained a better explanation of the accelerated growth of scale-free networks.
Collapse
Affiliation(s)
- Linyuan Lü
- Web Sciences Center, University of Electronic Science and Technology of China, Chengdu, People's Republic of China
- Department of Physics, University of Fribourg, Fribourg, Switzerland
| | - Zi-Ke Zhang
- Department of Physics, University of Fribourg, Fribourg, Switzerland
| | - Tao Zhou
- Web Sciences Center, University of Electronic Science and Technology of China, Chengdu, People's Republic of China
- Department of Physics, University of Fribourg, Fribourg, Switzerland
- Department of Modern Physics, University of Science and Technology of China, Hefei, People's Republic of China
- * E-mail:
| |
Collapse
|
23
|
Tanaka N, Ohno K, Niimi T, Moritomo A, Mori K, Orita M. Small-World Phenomena in Chemical Library Networks: Application to Fragment-Based Drug Discovery. J Chem Inf Model 2009; 49:2677-86. [DOI: 10.1021/ci900123v] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Affiliation(s)
- Naoki Tanaka
- Chemistry Research Laboratories, Drug Discovery Research, Astellas Pharma Inc., 21 Miyukigaoka, Tsukuba-shi, Ibaraki 305-8585, Japan
| | - Kazuki Ohno
- Chemistry Research Laboratories, Drug Discovery Research, Astellas Pharma Inc., 21 Miyukigaoka, Tsukuba-shi, Ibaraki 305-8585, Japan
| | - Tatsuya Niimi
- Chemistry Research Laboratories, Drug Discovery Research, Astellas Pharma Inc., 21 Miyukigaoka, Tsukuba-shi, Ibaraki 305-8585, Japan
| | - Ayako Moritomo
- Chemistry Research Laboratories, Drug Discovery Research, Astellas Pharma Inc., 21 Miyukigaoka, Tsukuba-shi, Ibaraki 305-8585, Japan
| | - Kenichi Mori
- Chemistry Research Laboratories, Drug Discovery Research, Astellas Pharma Inc., 21 Miyukigaoka, Tsukuba-shi, Ibaraki 305-8585, Japan
| | - Masaya Orita
- Chemistry Research Laboratories, Drug Discovery Research, Astellas Pharma Inc., 21 Miyukigaoka, Tsukuba-shi, Ibaraki 305-8585, Japan
| |
Collapse
|
24
|
Abstract
Small aromatic ring systems are of central importance in the development of novel synthetic protein ligands. Here we generate a complete list of 24,847 such ring systems. We call this list and associated annotations VEHICLe, which stands for virtual exploratory heterocyclic library. Searches of literature and compound databases, using this list as substructure queries, identified only 1701 as synthesized. Using a carefully validated machine learning approach, we were able to estimate that the number of unpublished, but synthetically tractable, VEHICLe rings could be over 3000. However, analysis also shows that the rate of publication of novel examples to be as low as 5-10 per year. With this work, we aim to provide fresh stimulus to creative organic chemists by highlighting a small set of apparently simple ring systems that are predicted to be tractable but are, to the best of our knowledge, unconquered.
Collapse
Affiliation(s)
- William R Pitt
- UCB Celltech, Granta Park, Great Abington, Cambridge CB15 6GS, United Kingdom.
| | | | | | | |
Collapse
|