Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017;117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]

For:	Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017;117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]

Number

Cited by Other Article(s)

Schilling-Wilhelmi M, Ríos-García M, Shabih S, Gil MV, Miret S, Koch CT, Márquez JA, Jablonka KM. From text to insight: large language models for chemical data extraction. Chem Soc Rev 2025;54:1125-1150. [PMID: 39703015 DOI: 10.1039/d4cs00913d] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2024]

Fu W, Qi M, Rong Y, Lin C, Guo W, Su B. Remote On-Paper Electrochemiluminescence-Based High-Safety and Multilevel Information Encryption. Angew Chem Int Ed Engl 2025;64:e202420184. [PMID: 39659206 DOI: 10.1002/anie.202420184] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2024] [Revised: 11/27/2024] [Accepted: 12/11/2024] [Indexed: 12/12/2024]

Biziukova NY, Rudik AV, Dmitriev AV, Tarasova OA, Filimonov DA, Poroikov VV. XenoMet: A Corpus of Texts to Extract Data on Metabolites of Xenobiotics. ACS OMEGA 2025;10:2459-2471. [PMID: 39895765 PMCID: PMC11780559 DOI: 10.1021/acsomega.4c05723] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/19/2024] [Revised: 10/25/2024] [Accepted: 11/21/2024] [Indexed: 02/04/2025]

Abstract

Understanding the biotransformation of xenobiotics in the human body is critical for a comprehensive assessment of drug effects since pharmacologically active drug metabolites may exhibit a range of biological effects that often differ from those of the original pharmaceutical agent. Studies of the biotransformation mechanisms of xenobiotics have resulted in numerous publications. Extracting information about the parent compounds (substrates) and their metabolites from the texts allows retrieval of information on their biological activities, molecular mechanisms of action, and toxicity. Manual curation of the names of xenobiotics, their metabolites, and biotransformation reactions in the text is a challenging task due to the large number of publications related to studies of pharmaceutical agents metabolism. Our aim is to create an annotated corpus of texts that can be used for automated extraction of the names of xenobiotics, including pharmaceutical agents that undergo biotransformation and their metabolites. Prior to manual annotation of the corpus, semiautomatic annotation was carried out based on the earlier developed rule-based method for parent compounds and their metabolites extraction. To create XenoMet, we automatically extracted relevant texts from PubMed using a query based on MeSH terms. The names of biotransformation reactions were recognized by using an in-house-developed dictionary. Then, we manually verified the extracted data by correcting errors in the named entity annotation and identified the associations between substrates and metabolites. We tested the applicability of XenoMet for the reconstruction of a metabolic tree and for the automated extraction of the chemical names of substrates, metabolites, and reactions of biotransformation. Classification of the named entities of metabolites, substrates, and biotransformation reactions by a conditional random fields approach using XenoMet as the training set provides an F1-score of 0.79.

Collapse

Jin D, Liang Y, Xiong Z, Yang X, Wang H, Zeng J, Gu S. Application of Transformers to Chemical Synthesis. Molecules 2025;30:493. [PMID: 39942600 PMCID: PMC11821105 DOI: 10.3390/molecules30030493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2024] [Revised: 01/09/2025] [Accepted: 01/10/2025] [Indexed: 02/16/2025] Open

Kevlishvili I, St Michel RG, Garrison AG, Toney JW, Adamji H, Jia H, Román-Leshkov Y, Kulik HJ. Leveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexes. Faraday Discuss 2025;256:275-303. [PMID: 39301698 DOI: 10.1039/d4fd00087k] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/22/2024]

Khalid A, Kaleem A, Qazi W, Abdullah R, Iqtedar M, Naz S. Site-specific prediction of O-GlcNAc modification in proteins using evolutionary scale model. PLoS One 2024;19:e0316215. [PMID: 39739642 DOI: 10.1371/journal.pone.0316215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 12/07/2024] [Indexed: 01/02/2025] Open

Harigua-Souiai E, Masmoudi O, Makni S, Oualha R, Abdelkrim YZ, Hamdi S, Souiai O, Guizani I. cidalsDB: an AI-empowered platform for anti-pathogen therapeutics research. J Cheminform 2024;16:134. [PMID: 39609715 PMCID: PMC11605991 DOI: 10.1186/s13321-024-00929-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Accepted: 11/11/2024] [Indexed: 11/30/2024] Open

Liang W, Su W, Zhong L, Yang Z, Li T, Liang Y, Ruan T, Jiang G. Comprehensive Characterization of Oxidative Stress-Modulating Chemicals Using GPT-Based Text Mining. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2024;58:20540-20552. [PMID: 39513989 DOI: 10.1021/acs.est.4c07390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2024]

Affiliation(s)

Wenqing Liang State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China University of Chinese Academy of Sciences, Beijing 100049, China
Wenyuan Su State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China University of Chinese Academy of Sciences, Beijing 100049, China
Laijin Zhong State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China University of Chinese Academy of Sciences, Beijing 100049, China
Zhendong Yang State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China University of Chinese Academy of Sciences, Beijing 100049, China
Tingyu Li State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China University of Chinese Academy of Sciences, Beijing 100049, China
Yong Liang Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, School of Environment and Health, Jianghan University, Wuhan 430056, China
Ting Ruan State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China University of Chinese Academy of Sciences, Beijing 100049, China Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, School of Environment and Health, Jianghan University, Wuhan 430056, China
Guibin Jiang State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China University of Chinese Academy of Sciences, Beijing 100049, China

Collapse

Ai Q, Meng F, Shi J, Pelkie B, Coley CW. Extracting structured data from organic synthesis procedures using a fine-tuned large language model. DIGITAL DISCOVERY 2024;3:1822-1831. [PMID: 39157760 PMCID: PMC11322921 DOI: 10.1039/d4dd00091a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Accepted: 07/30/2024] [Indexed: 08/20/2024]

Huang Z, Li X, Li A, Yang Y, He L, Zhang Z, Wu S, Wang Y, Cai S, He Y, Liu X. MPNTEXT: An Interactive Platform for Automatically Extracting Metal-Polyphenol Networks and Their Applications from Scientific Literature. J Chem Inf Model 2024. [PMID: 39258795 DOI: 10.1021/acs.jcim.4c01093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/12/2024]

Liu W, Chen J, Wang H, Fu Z, Peijnenburg WJGM, Hong H. Perspectives on Advancing Multimodal Learning in Environmental Science and Engineering Studies. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2024. [PMID: 39226136 DOI: 10.1021/acs.est.4c03088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]

Ahmad F, Muhmood T. Clinical translation of nanomedicine with integrated digital medicine and machine learning interventions. Colloids Surf B Biointerfaces 2024;241:114041. [PMID: 38897022 DOI: 10.1016/j.colsurfb.2024.114041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Revised: 06/11/2024] [Accepted: 06/13/2024] [Indexed: 06/21/2024]

Chen C, Li SL, Xu YY, Liu J, Graham DW, Zhu YG. Characterising global antimicrobial resistance research explains why One Health solutions are slow in development: An application of AI-based gap analysis. ENVIRONMENT INTERNATIONAL 2024;187:108680. [PMID: 38723455 DOI: 10.1016/j.envint.2024.108680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2024] [Revised: 04/16/2024] [Accepted: 04/19/2024] [Indexed: 05/19/2024]

Blakey M, Pearman-Kanza S, Frey JG. Zombie cheminformatics: extraction and conversion of Wiswesser Line Notation (WLN) from chemical documents. J Cheminform 2024;16:42. [PMID: 38622746 PMCID: PMC11017645 DOI: 10.1186/s13321-024-00831-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Accepted: 03/23/2024] [Indexed: 04/17/2024] Open

Arora S, Chettri S, Percha V, Kumar D, Latwal M. Artifical intelligence: a virtual chemist for natural product drug discovery. J Biomol Struct Dyn 2024;42:3826-3835. [PMID: 37232451 DOI: 10.1080/07391102.2023.2216295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 05/12/2023] [Indexed: 05/27/2023]

Zhang X, Zhou Z, Ming C, Sun YY. GPT-Assisted Learning of Structure-Property Relationships by Graph Neural Networks: Application to Rare-Earth-Doped Phosphors. J Phys Chem Lett 2023;14:11342-11349. [PMID: 38064589 DOI: 10.1021/acs.jpclett.3c02848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2023]

Zhang K, Zhou X, Li S, Zhao L, Hu W, Cai A, Zeng Y, Wang Q, Wu M, Li G, Liu J, Ji H, Qin Y, Wu L. A General Strategy for Developing Ultrasensitive "Transistor-Like" Thermochromic Fluorescent Materials for Multilevel Information Encryption. ADVANCED MATERIALS (DEERFIELD BEACH, FLA.) 2023;35:e2305472. [PMID: 37437082 DOI: 10.1002/adma.202305472] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 07/10/2023] [Indexed: 07/14/2023]

Affiliation(s)

Ke Zhang Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
Xiaobo Zhou Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
Shijie Li Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
Lingfeng Zhao Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
Wenqi Hu Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
Aiting Cai Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
Yuhan Zeng Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
Qi Wang Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
Mingmin Wu Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
Guo Li Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
Jinxia Liu Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
Haiwei Ji Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
Yuling Qin Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
Li Wu Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China

Collapse

Li S, Zhang Y, Fang Z, Meng K, Tian R, He H, Sun S. Extracting the Synthetic Route of Pd-Based Catalysts in Methanol Steam Reforming from the Scientific Literature. J Chem Inf Model 2023;63:6249-6260. [PMID: 37807535 DOI: 10.1021/acs.jcim.3c01442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]

Kiseleva OI, Kurbatov IY, Arzumanian VA, Ilgisonis EV, Zakharov SV, Poverennaya EV. The Expectation and Reality of the HepG2 Core Metabolic Profile. Metabolites 2023;13:908. [PMID: 37623852 PMCID: PMC10456947 DOI: 10.3390/metabo13080908] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 07/29/2023] [Accepted: 08/01/2023] [Indexed: 08/26/2023] Open

Xie W, Fan K, Zhang S, Li L. Multiple sampling schemes and deep learning improve active learning performance in drug-drug interaction information retrieval analysis from the literature. J Biomed Semantics 2023;14:5. [PMID: 37248476 PMCID: PMC10228061 DOI: 10.1186/s13326-023-00287-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Accepted: 04/29/2023] [Indexed: 05/31/2023] Open

Abstract

BACKGROUND

Drug-drug interaction (DDI) information retrieval (IR) is an important natural language process (NLP) task from the PubMed literature. For the first time, active learning (AL) is studied in DDI IR analysis. DDI IR analysis from PubMed abstracts faces the challenges of relatively small positive DDI samples among overwhelmingly large negative samples. Random negative sampling and positive sampling are purposely designed to improve the efficiency of AL analysis. The consistency of random negative sampling and positive sampling is shown in the paper.

RESULTS

PubMed abstracts are divided into two pools. Screened pool contains all abstracts that pass the DDI keywords query in PubMed, while unscreened pool includes all the other abstracts. At a prespecified recall rate of 0.95, DDI IR analysis precision is evaluated and compared. In screened pool IR analysis using supporting vector machine (SVM), similarity sampling plus uncertainty sampling improves the precision over uncertainty sampling, from 0.89 to 0.92 respectively. In the unscreened pool IR analysis, the integrated random negative sampling, positive sampling, and similarity sampling improve the precision over uncertainty sampling along, from 0.72 to 0.81 respectively. When we change the SVM to a deep learning method, all sampling schemes consistently improve DDI AL analysis in both screened pool and unscreened pool. Deep learning has significant improvement of precision over SVM, 0.96 vs. 0.92 in screened pool, and 0.90 vs. 0.81 in the unscreened pool, respectively.

CONCLUSIONS

By integrating various sampling schemes and deep learning algorithms into AL, the DDI IR analysis from literature is significantly improved. The random negative sampling and positive sampling are highly effective methods in improving AL analysis where the positive and negative samples are extremely imbalanced.

Collapse

Wang L, Gao Y, Chen X, Cui W, Zhou Y, Luo X, Xu S, Du Y, Wang B. A corpus of CO₂ electrocatalytic reduction process extracted from the scientific literature. Sci Data 2023;10:175. [PMID: 36991006 PMCID: PMC10060421 DOI: 10.1038/s41597-023-02089-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2023] [Accepted: 03/19/2023] [Indexed: 03/31/2023] Open

He K, Mao R, Gong T, Cambria E, Li C. JCBIE: a joint continual learning neural network for biomedical information extraction. BMC Bioinformatics 2022;23:549. [PMID: 36536280 PMCID: PMC9761970 DOI: 10.1186/s12859-022-05096-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2022] [Accepted: 12/05/2022] [Indexed: 12/23/2022] Open

Duan Y, Rosaleny LE, Coutinho JT, Giménez-Santamarina S, Scheie A, Baldoví JJ, Cardona-Serra S, Gaita-Ariño A. Data-driven design of molecular nanomagnets. Nat Commun 2022;13:7626. [PMID: 36494346 PMCID: PMC9734471 DOI: 10.1038/s41467-022-35336-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Accepted: 11/29/2022] [Indexed: 12/13/2022] Open

Dinh TT, Vo-Chanh TP, Nguyen C, Huynh VQ, Vo N, Nguyen HD. Extract antibody and antigen names from biomedical literature. BMC Bioinformatics 2022;23:524. [PMID: 36474140 PMCID: PMC9727932 DOI: 10.1186/s12859-022-04993-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 10/18/2022] [Indexed: 12/12/2022] Open

Mai H, Le TC, Chen D, Winkler DA, Caruso RA. Machine Learning in the Development of Adsorbents for Clean Energy Application and Greenhouse Gas Capture. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2022;9:e2203899. [PMID: 36285802 PMCID: PMC9798988 DOI: 10.1002/advs.202203899] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 09/27/2022] [Indexed: 06/04/2023]

Islamaj R, Leaman R, Cissel D, Coss C, Denicola J, Fisher C, Guzman R, Kochar PG, Miliaras N, Punske Z, Sekiya K, Trinh D, Whitman D, Schmidt S, Lu Z. NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles. Database (Oxford) 2022;2022:baac102. [PMID: 36458799 PMCID: PMC9716560 DOI: 10.1093/database/baac102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Revised: 10/17/2022] [Accepted: 11/28/2022] [Indexed: 12/03/2022]

Abstract

The automatic recognition of chemical names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. The task is even more challenging when considering the identification of these entities in the article's full text and, furthermore, the identification of candidate substances for that article's metadata [Medical Subject Heading (MeSH) article indexing]. The National Library of Medicine (NLM)-Chem track at BioCreative VII aimed to foster the development of algorithms that can predict with high quality the chemical entities in the biomedical literature and further identify the chemical substances that are candidates for article indexing. As a result of this challenge, the NLM-Chem track produced two comprehensive, manually curated corpora annotated with chemical entities and indexed with chemical substances: the chemical identification corpus and the chemical indexing corpus. The NLM-Chem BioCreative VII (NLM-Chem-BC7) Chemical Identification corpus consists of 204 full-text PubMed Central (PMC) articles, fully annotated for chemical entities by 12 NLM indexers for both span (i.e. named entity recognition) and normalization (i.e. entity linking) using MeSH. This resource was used for the training and testing of the Chemical Identification task to evaluate the accuracy of algorithms in predicting chemicals mentioned in recently published full-text articles. The NLM-Chem-BC7 Chemical Indexing corpus consists of 1333 recently published PMC articles, equipped with chemical substance indexing by manual experts at the NLM. This resource was used for the evaluation of the Chemical Indexing task, which evaluated the accuracy of algorithms in predicting the chemicals that should be indexed, i.e. appear in the listing of MeSH terms for the document. This set was further enriched after the challenge in two ways: (i) 11 NLM indexers manually verified each of the candidate terms appearing in the prediction results of the challenge participants, but not in the MeSH indexing, and the chemical indexing terms appearing in the MeSH indexing list, but not in the prediction results, and (ii) the challenge organizers algorithmically merged the chemical entity annotations in the full text for all predicted chemical entities and used a statistical approach to keep those with the highest degree of confidence. As a result, the NLM-Chem-BC7 Chemical Indexing corpus is a gold-standard corpus for chemical indexing of journal articles and a silver-standard corpus for chemical entity identification in full-text journal articles. Together, these resources are currently the most comprehensive resources for chemical entity recognition, and we demonstrate improvements in the chemical entity recognition algorithms. We detail the characteristics of these novel resources and make them available for the community. Database URL: https://ftp.ncbi.nlm.nih.gov/pub/lu/NLM-Chem-BC7-corpus/.

Collapse

Zhang X, Mao R, Cambria E. A survey on syntactic processing techniques. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10300-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]

Bojar D, Lisacek F. Glycoinformatics in the Artificial Intelligence Era. Chem Rev 2022;122:15971-15988. [PMID: 35961636 PMCID: PMC9615983 DOI: 10.1021/acs.chemrev.2c00110] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Indexed: 11/29/2022]

Mroz A, Posligua V, Tarzia A, Wolpert EH, Jelfs KE. Into the Unknown: How Computation Can Help Explore Uncharted Material Space. J Am Chem Soc 2022;144:18730-18743. [PMID: 36206484 PMCID: PMC9585593 DOI: 10.1021/jacs.2c06833] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Indexed: 11/28/2022]

Parastar H, Tauler R. Big (Bio)Chemical Data Mining Using Chemometric Methods: A Need for Chemists. Angew Chem Int Ed Engl 2022. [DOI: 10.1002/ange.201801134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Ohms J. Validity of PubChem compounds supplied by Patentscope or SureChEMBL. WORLD PATENT INFORMATION 2022. [DOI: 10.1016/j.wpi.2022.102134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]

Ghosh S, Lu K. Band gap information extraction from materials science literature – a pilot study. ASLIB J INFORM MANAG 2022. [DOI: 10.1108/ajim-03-2022-0141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]

Tarasova OA, Rudik AV, Biziukova NY, Filimonov DA, Poroikov VV. Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach. J Cheminform 2022;14:55. [PMID: 35964150 PMCID: PMC9375066 DOI: 10.1186/s13321-022-00633-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 07/12/2022] [Indexed: 11/24/2022] Open

Abstract

Motivation

Application of chemical named entity recognition (CNER) algorithms allows retrieval of information from texts about chemical compound identifiers and creates associations with physical–chemical properties and biological activities. Scientific texts represent low-formalized sources of information. Most methods aimed at CNER are based on machine learning approaches, including conditional random fields and deep neural networks. In general, most machine learning approaches require either vector or sparse word representation of texts. Chemical named entities (CNEs) constitute only a small fraction of the whole text, and the datasets used for training are highly imbalanced.

Methods and results

We propose a new method for extracting CNEs from texts based on the naïve Bayes classifier combined with specially developed filters. In contrast to the earlier developed CNER methods, our approach uses the representation of the data as a set of fragments of text (FoTs) with the subsequent preparati`on of a set of multi-n-grams (sequences from one to n symbols) for each FoT. Our approach may provide the recognition of novel CNEs. For CHEMDNER corpus, the values of the sensitivity (recall) was 0.95, precision was 0.74, specificity was 0.88, and balanced accuracy was 0.92 based on five-fold cross validation. We applied the developed algorithm to the extracted CNEs of potential Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) main protease (Mpro) inhibitors. A set of CNEs corresponding to the chemical substances evaluated in the biochemical assays used for the discovery of Mpro inhibitors was retrieved. Manual analysis of the appropriate texts showed that CNEs of potential SARS-CoV-2 Mpro inhibitors were successfully identified by our method.

Conclusion

The obtained results show that the proposed method can be used for filtering out words that are not related to CNEs; therefore, it can be successfully applied to the extraction of CNEs for the purposes of cheminformatics and medicinal chemistry.

Supplementary Information

The online version contains supplementary material available at 10.1186/s13321-022-00633-4.

Collapse

Mai H, Le TC, Chen D, Winkler DA, Caruso RA. Machine Learning for Electrocatalyst and Photocatalyst Design and Discovery. Chem Rev 2022;122:13478-13515. [PMID: 35862246 DOI: 10.1021/acs.chemrev.2c00061] [Citation(s) in RCA: 85] [Impact Index Per Article: 28.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]

Chang YC, Chiu YW, Chuang TW. Linguistic Pattern-Infused Dual-Channel Bidirectional Long Short-term Memory With Attention for Dengue Case Summary Generation From the Program for Monitoring Emerging Diseases-Mail Database: Algorithm Development Study. JMIR Public Health Surveill 2022;8:e34583. [PMID: 35830225 PMCID: PMC9491834 DOI: 10.2196/34583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2021] [Revised: 04/15/2022] [Accepted: 05/27/2022] [Indexed: 11/13/2022] Open

Abstract

BACKGROUND

Globalization and environmental changes have intensified the emergence or re-emergence of infectious diseases worldwide, such as outbreaks of dengue fever in Southeast Asia. Collaboration on region-wide infectious disease surveillance systems is therefore critical but difficult to achieve because of the different transparency levels of health information systems in different countries. Although the Program for Monitoring Emerging Diseases (ProMED)-mail is the most comprehensive international expert-curated platform providing rich disease outbreak information on humans, animals, and plants, the unstructured text content of the reports makes analysis for further application difficult.

OBJECTIVE

To make monitoring the epidemic situation in Southeast Asia more efficient, this study aims to develop an automatic summary of the alert articles from ProMED-mail, a huge textual data source. In this paper, we proposed a text summarization method that uses natural language processing technology to automatically extract important sentences from alert articles in ProMED-mail emails to generate summaries. Using our method, we can quickly capture crucial information to help make important decisions regarding epidemic surveillance.

METHODS

Our data, which span a period from 1994 to 2019, come from the ProMED-mail website. We analyzed the collected data to establish a unique Taiwan dengue corpus that was validated with professionals' annotations to achieve almost perfect agreement (Cohen κ=90%). To generate a ProMED-mail summary, we developed a dual-channel bidirectional long short-term memory with attention mechanism with infused latent syntactic features to identify key sentences from the alerting article.

RESULTS

Our method is superior to many well-known machine learning and neural network approaches in identifying important sentences, achieving a macroaverage F1 score of 93%. Moreover, it can successfully extract the relevant correct information on dengue fever from a ProMED-mail alerting article, which can help researchers or general users to quickly understand the essence of the alerting article at first glance. In addition to verifying the model, we also recruited 3 professional experts and 2 students from related fields to participate in a satisfaction survey on the generated summaries, and the results show that 84% (63/75) of the summaries received high satisfaction ratings.

CONCLUSIONS

The proposed approach successfully fuses latent syntactic features into a deep neural network to analyze the syntactic, semantic, and contextual information in the text. It then exploits the derived information to identify crucial sentences in the ProMED-mail alerting article. The experiment results show that the proposed method is not only effective but also outperforms the compared methods. Our approach also demonstrates the potential for case summary generation from ProMED-mail alerting articles. In terms of practical application, when a new alerting article arrives, our method can quickly identify the relevant case information, which is the most critical part, to use as a reference or for further analysis.

Collapse

Yan R, Jiang X, Wang W, Dang D, Su Y. Materials information extraction via automatically generated corpus. Sci Data 2022;9:401. [PMID: 35831367 PMCID: PMC9279422 DOI: 10.1038/s41597-022-01492-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Accepted: 06/28/2022] [Indexed: 11/12/2022] Open

Auto-generating databases of Yield Strength and Grain Size using ChemDataExtractor. Sci Data 2022. [PMCID: PMC9184532 DOI: 10.1038/s41597-022-01301-w] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open

Li M, Tian S, Meng F, Yin M, Yue Q, Wang S, Bu W, Luo L. Continuously Multiplexed Ultrastrong Raman Probes by Precise Isotopic Polymer Backbone Doping for Multidimensional Information Storage and Encryption. NANO LETTERS 2022;22:4544-4551. [PMID: 35604007 DOI: 10.1021/acs.nanolett.2c01443] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]

Text-mined dataset of gold nanoparticle synthesis procedures, morphologies, and size entities. Sci Data 2022;9:234. [PMID: 35618761 PMCID: PMC9135747 DOI: 10.1038/s41597-022-01321-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Accepted: 04/08/2022] [Indexed: 12/13/2022] Open

Dataset of solution-based inorganic materials synthesis procedures extracted from the scientific literature. Sci Data 2022;9:231. [PMID: 35614129 PMCID: PMC9132903 DOI: 10.1038/s41597-022-01317-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2021] [Accepted: 04/05/2022] [Indexed: 11/10/2022] Open

Serov N, Vinogradov V. Inverse Material Search and Synthesis Verification by Hand Drawings via Transfer Learning and Contour Detection. SMALL METHODS 2022;6:e2101619. [PMID: 35285181 DOI: 10.1002/smtd.202101619] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Revised: 02/12/2022] [Indexed: 06/14/2023]

Trewartha A, Walker N, Huo H, Lee S, Cruse K, Dagdelen J, Dunn A, Persson KA, Ceder G, Jain A. Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science. PATTERNS (NEW YORK, N.Y.) 2022;3:100488. [PMID: 35465225 PMCID: PMC9024010 DOI: 10.1016/j.patter.2022.100488] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2021] [Revised: 01/21/2022] [Accepted: 03/15/2022] [Indexed: 11/03/2022]

Abstract

A bottleneck in efficiently connecting new materials discoveries to established literature has arisen due to an increase in publications. This problem may be addressed by using named entity recognition (NER) to extract structured summary-level data from unstructured materials science text. We compare the performance of four NER models on three materials science datasets. The four models include a bidirectional long short-term memory (BiLSTM) and three transformer models (BERT, SciBERT, and MatBERT) with increasing degrees of domain-specific materials science pre-training. MatBERT improves over the other two BERT_BASE-based models by 1%∼12%, implying that domain-specific pre-training provides measurable advantages. Despite relative architectural simplicity, the BiLSTM model consistently outperforms BERT, perhaps due to its domain-specific pre-trained word embeddings. Furthermore, MatBERT and SciBERT models outperform the original BERT model to a greater extent in the small data limit. MatBERT’s higher-quality predictions should accelerate the extraction of structured data from materials science literature.

•

Efficient extraction of information from materials science literature is needed

•

Domain-specific materials science pre-training improves results

•

Even simpler domain-specific models can outperform more complex general models

A bottleneck in efficiently connecting new materials discoveries to established literature has arisen due to a massive increase in publications. Four different language models are trained to automatically collect important information from materials science articles. We compare a simple model (BiLSTM) with materials science knowledge to three variants of a more complex model: one with general knowledge (BERT), one with general scientific knowledge (SciBERT), and one with materials science knowledge (MatBERT). We find that MatBERT performs the best overall. This implies that language models with greater extents of materials science knowledge will perform better on materials science-related tasks. The simpler model even consistently outperforms BERT. Furthermore, the performance gaps grow when the models are given fewer examples of information extraction to learn from. MatBERT’s higher-quality results should accelerate the collection of information from materials science literature.

Collapse

Affiliation(s)

Amalie Trewartha Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
Nicholas Walker Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
Haoyan Huo Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
Sanghoon Lee Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
Kevin Cruse Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
John Dagdelen Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
Alexander Dunn Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
Kristin A Persson Molecular Foundry, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
Gerbrand Ceder Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
Anubhav Jain Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA

Collapse

Zhu M, Cole JM. PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format. J Chem Inf Model 2022;62:1633-1643. [PMID: 35349259 PMCID: PMC9049592 DOI: 10.1021/acs.jcim.1c01198] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

Abstract

The layout of portable document format (PDF) files is constant to any screen, and the metadata therein are latent, compared to mark-up languages such as HTML and XML. No semantic tags are usually provided, and a PDF file is not designed to be edited or its data interpreted by software. However, data held in PDF files need to be extracted in order to comply with open-source data requirements that are now government-regulated. In the chemical domain, related chemical and property data also need to be found, and their correlations need to be exploited to enable data science in areas such as data-driven materials discovery. Such relationships may be realized using text-mining software such as the “chemistry-aware” natural-language-processing tool, ChemDataExtractor; however, this tool has limited data-extraction capabilities from PDF files. This study presents the PDFDataExtractor tool, which can act as a plug-in to ChemDataExtractor. It outperforms other PDF-extraction tools for the chemical literature by coupling its functionalities to the chemical-named entity-recognition capabilities of ChemDataExtractor. The intrinsic PDF-reading abilities of ChemDataExtractor are much improved. The system features a template-based architecture. This enables semantic information to be extracted from the PDF files of scientific articles in order to reconstruct the logical structure of articles. While other existing PDF-extracting tools focus on quantity mining, this template-based system is more focused on quality mining on different layouts. PDFDataExtractor outputs information in JSON and plain text, including the metadata of a PDF file, such as paper title, authors, affiliation, email, abstract, keywords, journal, year, document object identifier (DOI), reference, and issue number. With a self-created evaluation article set, PDFDataExtractor achieved promising precision for all key assessed metadata areas of the document text.

Collapse

Nandy A, Terrones G, Arunachalam N, Duan C, Kastner DW, Kulik HJ. MOFSimplify, machine learning models with extracted stability data of three thousand metal-organic frameworks. Sci Data 2022;9:74. [PMID: 35277533 PMCID: PMC8917177 DOI: 10.1038/s41597-022-01181-0] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Accepted: 01/17/2022] [Indexed: 11/09/2022] Open

Chiang LH, Braun B, Wang Z, Castillo I. Towards AI at Scale in the Chemical Industry. AIChE J 2022. [DOI: 10.1002/aic.17644] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]

Saldívar-González FI, Aldas-Bulos VD, Medina-Franco JL, Plisson F. Natural product drug discovery in the artificial intelligence era. Chem Sci 2022;13:1526-1546. [PMID: 35282622 PMCID: PMC8827052 DOI: 10.1039/d1sc04471k] [Citation(s) in RCA: 70] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Accepted: 12/10/2021] [Indexed: 12/19/2022] Open

Li Y, Yu L, Liu J, Guo L, Wu Y, Wu X. NetDPO: (delta, gamma)-approximate pattern matching with gap constraints under one-off condition. APPL INTELL 2022. [DOI: 10.1007/s10489-021-03000-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]

Designing a multilayer film via machine learning of scientific literature. Sci Rep 2022;12:930. [PMID: 35042971 PMCID: PMC8766440 DOI: 10.1038/s41598-022-05010-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 01/04/2022] [Indexed: 12/23/2022] Open

Cai X, Wang N, Yang L, Mei X. Global-local neighborhood based network representation for citation recommendation. APPL INTELL 2022. [DOI: 10.1007/s10489-021-02964-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]

Chen K, Tian H, Li B, Rangarajan S. A chemistry‐inspired neural network kinetic model for oxidative coupling of methane from high‐throughput data. AIChE J 2022. [DOI: 10.1002/aic.17584] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]