Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Tetko IV, M. Lowe D, Williams AJ. The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS. J Cheminform 2016;8:2. [PMID: 26807157 PMCID: PMC4724158 DOI: 10.1186/s13321-016-0113-y] [Citation(s) in RCA: 50] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2015] [Accepted: 01/08/2016] [Indexed: 11/18/2022] Open

For:	Tetko IV, M. Lowe D, Williams AJ. The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS. J Cheminform 2016;8:2. [PMID: 26807157 PMCID: PMC4724158 DOI: 10.1186/s13321-016-0113-y] [Citation(s) in RCA: 50] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2015] [Accepted: 01/08/2016] [Indexed: 11/18/2022] Open

Number

Cited by Other Article(s)

Liang T, Liu W, Tan K, Wu A, Lu X. Advancing Ionic Liquid Research with pSCNN: A Novel Approach for Accurate Normal Melting Temperature Predictions. ACS OMEGA 2024;9:31694-31702. [PMID: 39072063 PMCID: PMC11270577 DOI: 10.1021/acsomega.4c02393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Revised: 04/12/2024] [Accepted: 06/25/2024] [Indexed: 07/30/2024]

Machine learning models for phase transition and decomposition temperature of ionic liquids. J Mol Liq 2022. [DOI: 10.1016/j.molliq.2022.120247] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]

Ksenofontov AA, Lukanov MM, Bocharov PS. Can machine learning methods accurately predict the molar absorption coefficient of different classes of dyes? SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2022;279:121442. [PMID: 35660154 DOI: 10.1016/j.saa.2022.121442] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Revised: 05/25/2022] [Accepted: 05/26/2022] [Indexed: 06/15/2023]

Parastar H, Tauler R. Big (Bio)Chemical Data Mining Using Chemometric Methods: A Need for Chemists. Angew Chem Int Ed Engl 2022. [DOI: 10.1002/ange.201801134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Tinkov OV, Grigorev VY, Grigoreva LD, Osipov VN, Kolotaev AV, Khachatryan DS. QSAR analysis and experimental evaluation of new quinazoline-containing hydroxamic acids as histone deacetylase 6 inhibitors. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2022;33:513-532. [PMID: 35786151 DOI: 10.1080/1062936x.2022.2092210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Accepted: 06/14/2022] [Indexed: 06/15/2023]

Bujak M, Podsiadło M, Katrusiak A. Response to comment on Properties and interactions - melting point of tribromobenzene isomers. ACTA CRYSTALLOGRAPHICA SECTION B, STRUCTURAL SCIENCE, CRYSTAL ENGINEERING AND MATERIALS 2022;78:276-278. [PMID: 35411867 DOI: 10.1107/s2052520622003067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]

What Features of Ligands Are Relevant to the Opening of Cryptic Pockets in Drug Targets? INFORMATICS 2022. [DOI: 10.3390/informatics9010008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open

Shin HK. Topological Distance-Based Electron Interaction Tensor to Apply a Convolutional Neural Network on Drug-like Compounds. ACS OMEGA 2021;6:35757-35768. [PMID: 34984306 PMCID: PMC8717557 DOI: 10.1021/acsomega.1c05693] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 12/08/2021] [Indexed: 05/15/2023]

Makarov D, Fadeeva Y, Shmukler L, Tetko I. Beware of proper validation of models for ionic Liquids! J Mol Liq 2021. [DOI: 10.1016/j.molliq.2021.117722] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]

On prediction of melting points without computer simulation: a focus on energetic molecular crystals. FIREPHYSCHEM 2021. [DOI: 10.1016/j.fpc.2021.11.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open

Ghosh D, Koch U, Hadian K, Sattler M, Tetko IV. Highly Accurate Filters to Flag Frequent Hitters in AlphaScreen Assays by Suggesting their Mechanism. Mol Inform 2021;41:e2100151. [PMID: 34676998 DOI: 10.1002/minf.202100151] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2021] [Accepted: 09/29/2021] [Indexed: 11/06/2022]

Tinkov OV, Grigorev VY, Grigoreva LD. Prediction of an Organic Compound’s Biotransformation Time: A Study Using Avermectins. MOSCOW UNIVERSITY CHEMISTRY BULLETIN 2021. [PMCID: PMC8382113 DOI: 10.3103/s0027131421040088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]

Tinkov OV, Grigorev VY, Grigoreva LD. QSAR analysis of the acute toxicity of avermectins towards Tetrahymena pyriformis. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2021;32:541-571. [PMID: 34157880 DOI: 10.1080/1062936x.2021.1932583] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Accepted: 05/17/2021] [Indexed: 06/13/2023]

Xiang Y, Tang YH, Liu H, Lin G, Sun H. Predicting Single-Substance Phase Diagrams: A Kernel Approach on Graph Representations of Molecules. J Phys Chem A 2021;125:4488-4497. [PMID: 33999627 DOI: 10.1021/acs.jpca.1c02391] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]

Mi W, Chen H, Zhu DA, Zhang T, Qian F. Melting point prediction of organic molecules by deciphering the chemical structure into a natural language. Chem Commun (Camb) 2021;57:2633-2636. [PMID: 33587048 DOI: 10.1039/d0cc07384a] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]

Lai J, Li X, Wang Y, Yin S, Zhou J, Liu Z. AIScaffold: A Web-Based Tool for Scaffold Diversification Using Deep Learning. J Chem Inf Model 2020;61:1-6. [PMID: 33356237 DOI: 10.1021/acs.jcim.0c00867] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]

Tinkov O, Polishchuk P, Matveieva M, Grigorev V, Grigoreva L, Porozov Y. The Influence of Structural Patterns on Acute Aquatic Toxicity of Organic Compounds. Mol Inform 2020;40:e2000209. [PMID: 33029954 DOI: 10.1002/minf.202000209] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Accepted: 10/01/2020] [Indexed: 12/28/2022]

Mohammed AI, Ahmed AM, Bhadbhade MM, Ho J, Read RW. Sugar-substituted fluorous 1,2,3-triazoles: Helical twists in fluoroalkyl chains and their molecular association in the solid state and correlations with physicochemical properties. J Fluor Chem 2020. [DOI: 10.1016/j.jfluchem.2020.109536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]

Sivaraman G, Jackson NE, Sanchez-Lengeling B, Vázquez-Mayagoitia Á, Aspuru-Guzik A, Vishwanath V, de Pablo JJ. A machine learning workflow for molecular analysis: application to melting points. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2020. [DOI: 10.1088/2632-2153/ab8aa3] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open

Cui X, Yang R, Li S, Liu J, Wu Q, Li X. Modeling and insights into molecular basis of low molecular weight respiratory sensitizers. Mol Divers 2020;25:847-859. [PMID: 32166484 DOI: 10.1007/s11030-020-10069-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2019] [Accepted: 03/03/2020] [Indexed: 01/10/2023]

Shin HK. Electron configuration-based neural network model to predict physicochemical properties of inorganic compounds. RSC Adv 2020;10:33268-33278. [PMID: 35515036 PMCID: PMC9056678 DOI: 10.1039/d0ra05873d] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2020] [Accepted: 09/01/2020] [Indexed: 11/21/2022] Open

Modeling Physico-Chemical ADMET Endpoints with Multitask Graph Convolutional Networks. Molecules 2019;25:molecules25010044. [PMID: 31877719 PMCID: PMC6982787 DOI: 10.3390/molecules25010044] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2019] [Revised: 12/19/2019] [Accepted: 12/20/2019] [Indexed: 11/19/2022] Open

Tarasova OA, Biziukova NY, Filimonov DA, Poroikov VV, Nicklaus MC. Data Mining Approach for Extraction of Useful Information About Biologically Active Compounds from Publications. J Chem Inf Model 2019;59:3635-3644. [PMID: 31453694 DOI: 10.1021/acs.jcim.9b00164] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]

Dalavitsou A, Vasiliadis A, Mordos MD, Kouskoura MG, Markopoulou CK. Analytes’ Structure and Signal Response in Evaporating Light Scattering Detector (ELSD). CURR ANAL CHEM 2019. [DOI: 10.2174/1573411014666180330161557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]

Marcou G, Flamme B, Beck G, Chagnes A, Mokshyna O, Horvath D, Varnek A. In silico Design, Virtual Screening and Synthesis of Novel Electrolytic Solvents. Mol Inform 2019;38:e1900014. [DOI: 10.1002/minf.201900014] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2019] [Accepted: 05/07/2019] [Indexed: 11/11/2022]

Cardoso‐Silva J, Papadatos G, Papageorgiou LG, Tsoka S. Optimal Piecewise Linear Regression Algorithm for QSAR Modelling. Mol Inform 2018;38:e1800028. [DOI: 10.1002/minf.201800028] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 08/02/2018] [Indexed: 12/20/2022]

Ghosh D, Koch U, Hadian K, Sattler M, Tetko IV. Luciferase Advisor: High-Accuracy Model To Flag False Positive Hits in Luciferase HTS Assays. J Chem Inf Model 2018;58:933-942. [PMID: 29667823 DOI: 10.1021/acs.jcim.7b00574] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]

Bergström CAS, Larsson P. Computational prediction of drug solubility in water-based systems: Qualitative and quantitative approaches used in the current drug discovery and development setting. Int J Pharm 2018;540:185-193. [PMID: 29421301 PMCID: PMC5861307 DOI: 10.1016/j.ijpharm.2018.01.044] [Citation(s) in RCA: 108] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2017] [Revised: 01/20/2018] [Accepted: 01/22/2018] [Indexed: 01/18/2023]

Tauler R, Parastar H. Big (Bio)Chemical Data Mining Using Chemometric Methods: A Need for Chemists. Angew Chem Int Ed Engl 2018;61:e201801134. [DOI: 10.1002/anie.201801134] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2018] [Indexed: 11/08/2022]

Withnall M, Chen H, Tetko IV. Matched Molecular Pair Analysis on Large Melting Point Datasets: A Big Data Perspective. ChemMedChem 2018;13:599-606. [PMID: 28650584 PMCID: PMC5900986 DOI: 10.1002/cmdc.201700303] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2017] [Revised: 06/26/2017] [Indexed: 11/11/2022]

Mansouri K, Grulke CM, Judson RS, Williams AJ. OPERA models for predicting physicochemical properties and environmental fate endpoints. J Cheminform 2018. [PMID: 29520515 PMCID: PMC5843579 DOI: 10.1186/s13321-018-0263-1] [Citation(s) in RCA: 267] [Impact Index Per Article: 44.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open

Abstract

The collection of chemical structure information and associated experimental data for quantitative structure–activity/property relationship (QSAR/QSPR) modeling is facilitated by an increasing number of public databases containing large amounts of useful data. However, the performance of QSAR models highly depends on the quality of the data and modeling methodology used. This study aims to develop robust QSAR/QSPR models for chemical properties of environmental interest that can be used for regulatory purposes. This study primarily uses data from the publicly available PHYSPROP database consisting of a set of 13 common physicochemical and environmental fate properties. These datasets have undergone extensive curation using an automated workflow to select only high-quality data, and the chemical structures were standardized prior to calculation of the molecular descriptors. The modeling procedure was developed based on the five Organization for Economic Cooperation and Development (OECD) principles for QSAR models. A weighted k-nearest neighbor approach was adopted using a minimum number of required descriptors calculated using PaDEL, an open-source software. The genetic algorithms selected only the most pertinent and mechanistically interpretable descriptors (2–15, with an average of 11 descriptors). The sizes of the modeled datasets varied from 150 chemicals for biodegradability half-life to 14,050 chemicals for logP, with an average of 3222 chemicals across all endpoints. The optimal models were built on randomly selected training sets (75%) and validated using fivefold cross-validation (CV) and test sets (25%). The CV Q² of the models varied from 0.72 to 0.95, with an average of 0.86 and an R² test value from 0.71 to 0.96, with an average of 0.82. Modeling and performance details are described in QSAR model reporting format and were validated by the European Commission’s Joint Research Center to be OECD compliant. All models are freely available as an open-source, command-line application called OPEn structure–activity/property Relationship App (OPERA). OPERA models were applied to more than 750,000 chemicals to produce freely available predicted data on the U.S. Environmental Protection Agency’s CompTox Chemistry Dashboard.

Collapse

Tebes-Stevens C, Patel JM, Koopmans M, Olmstead J, Hilal SH, Pope N, Weber EJ, Wolfe K. Demonstration of a consensus approach for the calculation of physicochemical properties required for environmental fate assessments. CHEMOSPHERE 2018;194:94-106. [PMID: 29197820 PMCID: PMC6146973 DOI: 10.1016/j.chemosphere.2017.11.137] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/05/2017] [Revised: 11/21/2017] [Accepted: 11/22/2017] [Indexed: 05/21/2023]

Coley CW, Barzilay R, Green WH, Jaakkola TS, Jensen KF. Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction. J Chem Inf Model 2017;57:1757-1772. [PMID: 28696688 DOI: 10.1021/acs.jcim.6b00601] [Citation(s) in RCA: 220] [Impact Index Per Article: 31.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]

Zang Q, Mansouri K, Williams AJ, Judson RS, Allen DG, Casey WM, Kleinstreuer NC. In Silico Prediction of Physicochemical Properties of Environmental Chemicals Using Molecular Fingerprints and Machine Learning. J Chem Inf Model 2017;57:36-49. [PMID: 28006899 PMCID: PMC6131700 DOI: 10.1021/acs.jcim.6b00625] [Citation(s) in RCA: 77] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]

Abstract

There are little available toxicity data on the vast majority of chemicals in commerce. High-throughput screening (HTS) studies, such as those being carried out by the U.S. Environmental Protection Agency (EPA) ToxCast program in partnership with the federal Tox21 research program, can generate biological data to inform models for predicting potential toxicity. However, physicochemical properties are also needed to model environmental fate and transport, as well as exposure potential. The purpose of the present study was to generate an open-source quantitative structure-property relationship (QSPR) workflow to predict a variety of physicochemical properties that would have cross-platform compatibility to integrate into existing cheminformatics workflows. In this effort, decades-old experimental property data sets available within the EPA EPI Suite were reanalyzed using modern cheminformatics workflows to develop updated QSPR models capable of supplying computationally efficient, open, and transparent HTS property predictions in support of environmental modeling efforts. Models were built using updated EPI Suite data sets for the prediction of six physicochemical properties: octanol-water partition coefficient (logP), water solubility (logS), boiling point (BP), melting point (MP), vapor pressure (logVP), and bioconcentration factor (logBCF). The coefficient of determination (R2) between the estimated values and experimental data for the six predicted properties ranged from 0.826 (MP) to 0.965 (BP), with model performance for five of the six properties exceeding those from the original EPI Suite models. The newly derived models can be employed for rapid estimation of physicochemical properties within an open-source HTS workflow to inform fate and toxicity prediction models of environmental chemicals.

Collapse

Mansouri K, Grulke CM, Richard AM, Judson RS, Williams AJ. An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2016;27:939-965. [PMID: 27885862 DOI: 10.1080/1062936x.2016.1253611] [Citation(s) in RCA: 81] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/03/2016] [Accepted: 10/24/2016] [Indexed: 05/18/2023]

Tetko IV, Maran U, Tropsha A. Public (Q)SAR Services, Integrated Modeling Environments, and Model Repositories on the Web: State of the Art and Perspectives for Future Development. Mol Inform 2016;36. [PMID: 27778468 DOI: 10.1002/minf.201600082] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2016] [Accepted: 10/03/2016] [Indexed: 01/08/2023]

Swain MC, Cole JM. ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature. J Chem Inf Model 2016;56:1894-1904. [PMID: 27669338 DOI: 10.1021/acs.jcim.6b00207] [Citation(s) in RCA: 158] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

Abstract

The emergence of "big data" initiatives has led to the need for tools that can automatically extract valuable chemical information from large volumes of unstructured data, such as the scientific literature. Since chemical information can be present in figures, tables, and textual paragraphs, successful information extraction often depends on the ability to interpret all of these domains simultaneously. We present a complete toolkit for the automated extraction of chemical entities and their associated properties, measurements, and relationships from scientific documents that can be used to populate structured chemical databases. Our system provides an extensible, chemistry-aware, natural language processing pipeline for tokenization, part-of-speech tagging, named entity recognition, and phrase parsing. Within this scope, we report improved performance for chemical named entity recognition through the use of unsupervised word clustering based on a massive corpus of chemistry articles. For phrase parsing and information extraction, we present the novel use of multiple rule-based grammars that are tailored for interpreting specific document domains such as textual paragraphs, captions, and tables. We also describe document-level processing to resolve data interdependencies and show that this is particularly necessary for the autogeneration of chemical databases since captions and tables commonly contain chemical identifiers and references that are defined elsewhere in the text. The performance of the toolkit to correctly extract various types of data was evaluated, affording an F-score of 93.4%, 86.8%, and 91.5% for extracting chemical identifiers, spectroscopic attributes, and chemical property attributes, respectively; set against the CHEMDNER chemical name extraction challenge, ChemDataExtractor yields a competitive F-score of 87.8%. All tools have been released under the MIT license and are available to download from http://www.chemdataextractor.org .

Collapse

Does ‘Big Data’ exist in medicinal chemistry, and if so, how can it be harnessed? Future Med Chem 2016;8:1801-1806. [DOI: 10.4155/fmc-2016-0163] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open

Ekins S. The Next Era: Deep Learning in Pharmaceutical Research. Pharm Res 2016;33:2594-603. [PMID: 27599991 DOI: 10.1007/s11095-016-2029-7] [Citation(s) in RCA: 99] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2016] [Accepted: 08/23/2016] [Indexed: 01/22/2023]

Tetko IV, Engkvist O, Koch U, Reymond JL, Chen H. BIGCHEM: Challenges and Opportunities for Big Data Analysis in Chemistry. Mol Inform 2016;35:615-621. [PMID: 27464907 PMCID: PMC5129546 DOI: 10.1002/minf.201600073] [Citation(s) in RCA: 68] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Accepted: 07/06/2016] [Indexed: 01/19/2023]

Salmina ES, Haider N, Tetko IV. Extended Functional Groups (EFG): An Efficient Set for Chemical Characterization and Structure-Activity Relationship Studies of Chemical Compounds. Molecules 2015;21:E1. [PMID: 26703557 PMCID: PMC6273096 DOI: 10.3390/molecules21010001] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2015] [Revised: 12/09/2015] [Accepted: 12/15/2015] [Indexed: 11/16/2022] Open