1
|
Schilling-Wilhelmi M, Ríos-García M, Shabih S, Gil MV, Miret S, Koch CT, Márquez JA, Jablonka KM. From text to insight: large language models for chemical data extraction. Chem Soc Rev 2025; 54:1125-1150. [PMID: 39703015 DOI: 10.1039/d4cs00913d] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2024]
Abstract
The vast majority of chemical knowledge exists in unstructured natural language, yet structured data is crucial for innovative and systematic materials design. Traditionally, the field has relied on manual curation and partial automation for data extraction for specific use cases. The advent of large language models (LLMs) represents a significant shift, potentially enabling non-experts to extract structured, actionable data from unstructured text efficiently. While applying LLMs to chemical and materials science data extraction presents unique challenges, domain knowledge offers opportunities to guide and validate LLM outputs. This tutorial review provides a comprehensive overview of LLM-based structured data extraction in chemistry, synthesizing current knowledge and outlining future directions. We address the lack of standardized guidelines and present frameworks for leveraging the synergy between LLMs and chemical expertise. This work serves as a foundational resource for researchers aiming to harness LLMs for data-driven chemical research. The insights presented here could significantly enhance how researchers across chemical disciplines access and utilize scientific information, potentially accelerating the development of novel compounds and materials for critical societal needs.
Collapse
Affiliation(s)
- Mara Schilling-Wilhelmi
- Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldtstrasse 10, 07743 Jena, Germany.
| | - Martiño Ríos-García
- Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldtstrasse 10, 07743 Jena, Germany.
- Institute of Carbon Science and Technology (INCAR), CSIC, Francisco Pintado Fe 26, 33011 Oviedo, Spain
| | - Sherjeel Shabih
- Department of Physics and CSMB, Humboldt-Universität zu Berlin, Berlin, Germany
| | - María Victoria Gil
- Institute of Carbon Science and Technology (INCAR), CSIC, Francisco Pintado Fe 26, 33011 Oviedo, Spain
| | | | - Christoph T Koch
- Department of Physics and CSMB, Humboldt-Universität zu Berlin, Berlin, Germany
| | - José A Márquez
- Department of Physics and CSMB, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Kevin Maik Jablonka
- Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldtstrasse 10, 07743 Jena, Germany.
- Center for Energy and Environmental Chemistry Jena (CEEC Jena), Friedrich Schiller University Jena, Philosophenweg 7a, 07743 Jena, Germany
- Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena), Lessingstrasse 12-14, 07743 Jena, Germany
| |
Collapse
|
2
|
Fu W, Qi M, Rong Y, Lin C, Guo W, Su B. Remote On-Paper Electrochemiluminescence-Based High-Safety and Multilevel Information Encryption. Angew Chem Int Ed Engl 2025; 64:e202420184. [PMID: 39659206 DOI: 10.1002/anie.202420184] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2024] [Revised: 11/27/2024] [Accepted: 12/11/2024] [Indexed: 12/12/2024]
Abstract
The escalating needs in information protection underscore the urgency of developing advanced encryption strategies. Herein we report a novel chemical approach that enables information encryption by on-paper electrochemiluminescence (ECL). Dendritic porous silica nanospheres modified with polyetherimide and bovine serum albumin were prepared as the chemical ink to write the secret message on a paper. Attaching the paper to an electrode, immersing it in a solution containing tris(2,2'-bipyridyl)ruthenium (Ru(bpy)3 2+) and then applying a suitable voltage, a remote "catalytic route" electrochemical reaction produces ECL that functions as the key to decrypt and visualize the message by imaging. In addition, proteins can be also used as the biological ink to write the secret message, which is then decrypted by a combined use of immunochemistry and ECL imaging as two keys. We believe the ECL-based strategy holds great promise in high-safety and multilevel information encryption, as it is protected not only by encoding, like conventional invisible inks, but also by the unique ECL decoding approach.
Collapse
Affiliation(s)
- Wenxuan Fu
- Institute of Analytical Chemistry, Department of Chemistry, Zhejiang University, Hangzhou, 310058, China
| | - Min Qi
- Institute of Analytical Chemistry, Department of Chemistry, Zhejiang University, Hangzhou, 310058, China
| | - Yidan Rong
- Institute of Analytical Chemistry, Department of Chemistry, Zhejiang University, Hangzhou, 310058, China
| | - Chukai Lin
- Institute of Analytical Chemistry, Department of Chemistry, Zhejiang University, Hangzhou, 310058, China
| | - Weiliang Guo
- Collaborative Innovation Center of Biomedical Functional Materials and Key Laboratory of Biofunctional Materials of Jiangsu Province, School of Chemistry and Materials Science, Nanjing Normal University, Nanjing, 210023, China
| | - Bin Su
- Institute of Analytical Chemistry, Department of Chemistry, Zhejiang University, Hangzhou, 310058, China
- General Surgery Department, Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou, 310052, China
| |
Collapse
|
3
|
Biziukova NY, Rudik AV, Dmitriev AV, Tarasova OA, Filimonov DA, Poroikov VV. XenoMet: A Corpus of Texts to Extract Data on Metabolites of Xenobiotics. ACS OMEGA 2025; 10:2459-2471. [PMID: 39895765 PMCID: PMC11780559 DOI: 10.1021/acsomega.4c05723] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/19/2024] [Revised: 10/25/2024] [Accepted: 11/21/2024] [Indexed: 02/04/2025]
Abstract
Understanding the biotransformation of xenobiotics in the human body is critical for a comprehensive assessment of drug effects since pharmacologically active drug metabolites may exhibit a range of biological effects that often differ from those of the original pharmaceutical agent. Studies of the biotransformation mechanisms of xenobiotics have resulted in numerous publications. Extracting information about the parent compounds (substrates) and their metabolites from the texts allows retrieval of information on their biological activities, molecular mechanisms of action, and toxicity. Manual curation of the names of xenobiotics, their metabolites, and biotransformation reactions in the text is a challenging task due to the large number of publications related to studies of pharmaceutical agents metabolism. Our aim is to create an annotated corpus of texts that can be used for automated extraction of the names of xenobiotics, including pharmaceutical agents that undergo biotransformation and their metabolites. Prior to manual annotation of the corpus, semiautomatic annotation was carried out based on the earlier developed rule-based method for parent compounds and their metabolites extraction. To create XenoMet, we automatically extracted relevant texts from PubMed using a query based on MeSH terms. The names of biotransformation reactions were recognized by using an in-house-developed dictionary. Then, we manually verified the extracted data by correcting errors in the named entity annotation and identified the associations between substrates and metabolites. We tested the applicability of XenoMet for the reconstruction of a metabolic tree and for the automated extraction of the chemical names of substrates, metabolites, and reactions of biotransformation. Classification of the named entities of metabolites, substrates, and biotransformation reactions by a conditional random fields approach using XenoMet as the training set provides an F1-score of 0.79.
Collapse
Affiliation(s)
- Nadezhda Yu. Biziukova
- Institute of Biomedical
Chemistry, 10-8, Pogodinskaya
Str., Moscow 119121, Russian Federation
| | - Anastasia V. Rudik
- Institute of Biomedical
Chemistry, 10-8, Pogodinskaya
Str., Moscow 119121, Russian Federation
| | - Alexander V. Dmitriev
- Institute of Biomedical
Chemistry, 10-8, Pogodinskaya
Str., Moscow 119121, Russian Federation
| | - Olga A. Tarasova
- Institute of Biomedical
Chemistry, 10-8, Pogodinskaya
Str., Moscow 119121, Russian Federation
| | - Dmitry A. Filimonov
- Institute of Biomedical
Chemistry, 10-8, Pogodinskaya
Str., Moscow 119121, Russian Federation
| | - Vladimir V. Poroikov
- Institute of Biomedical
Chemistry, 10-8, Pogodinskaya
Str., Moscow 119121, Russian Federation
| |
Collapse
|
4
|
Jin D, Liang Y, Xiong Z, Yang X, Wang H, Zeng J, Gu S. Application of Transformers to Chemical Synthesis. Molecules 2025; 30:493. [PMID: 39942600 PMCID: PMC11821105 DOI: 10.3390/molecules30030493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2024] [Revised: 01/09/2025] [Accepted: 01/10/2025] [Indexed: 02/16/2025] Open
Abstract
Efficient chemical synthesis is critical for the production of organic chemicals, particularly in the pharmaceutical industry. Leveraging machine learning to predict chemical synthesis and improve the development efficiency has become a significant research focus in modern chemistry. Among various machine learning models, the Transformer, a leading model in natural language processing, has revolutionized numerous fields due to its powerful feature-extraction and representation-learning capabilities. Recent applications demonstrated that Transformer models can also significantly enhance the performance in chemical synthesis tasks, particularly in reaction prediction and retrosynthetic planning. This article provides a comprehensive review of the applications and innovations of Transformer models in the qualitative prediction tasks of chemical synthesis, with a focus on technical approaches, performance advantages, and the challenges associated with applying the Transformer architecture to chemical reactions. Furthermore, we discuss the future directions for improving the applications of Transformer models in chemical synthesis.
Collapse
Affiliation(s)
- Dong Jin
- School of Chemical Engineering & Pharmacy, Pharmaceutical Research Institute, Wuhan Institute of Technology, Wuhan 430205, China; (D.J.); (Y.L.); (Z.X.); (H.W.)
| | - Yuli Liang
- School of Chemical Engineering & Pharmacy, Pharmaceutical Research Institute, Wuhan Institute of Technology, Wuhan 430205, China; (D.J.); (Y.L.); (Z.X.); (H.W.)
| | - Zihao Xiong
- School of Chemical Engineering & Pharmacy, Pharmaceutical Research Institute, Wuhan Institute of Technology, Wuhan 430205, China; (D.J.); (Y.L.); (Z.X.); (H.W.)
| | - Xiaojie Yang
- Hubei Key Laboratory of Radiation Chemistry and Functional Materials, School of Nuclear Technology and Chemistry & Biology, Hubei University of Science and Technology, Xianning 437100, China;
| | - Haifeng Wang
- School of Chemical Engineering & Pharmacy, Pharmaceutical Research Institute, Wuhan Institute of Technology, Wuhan 430205, China; (D.J.); (Y.L.); (Z.X.); (H.W.)
| | - Jie Zeng
- School of Chemical Engineering & Pharmacy, Pharmaceutical Research Institute, Wuhan Institute of Technology, Wuhan 430205, China; (D.J.); (Y.L.); (Z.X.); (H.W.)
| | - Shuangxi Gu
- School of Chemical Engineering & Pharmacy, Pharmaceutical Research Institute, Wuhan Institute of Technology, Wuhan 430205, China; (D.J.); (Y.L.); (Z.X.); (H.W.)
| |
Collapse
|
5
|
Kevlishvili I, St Michel RG, Garrison AG, Toney JW, Adamji H, Jia H, Román-Leshkov Y, Kulik HJ. Leveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexes. Faraday Discuss 2025; 256:275-303. [PMID: 39301698 DOI: 10.1039/d4fd00087k] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/22/2024]
Abstract
The breadth of transition metal chemical space covered by databases such as the Cambridge Structural Database and the derived computational database tmQM is not conducive to application-specific modeling and the development of structure-property relationships. Here, we employ both supervised and unsupervised natural language processing (NLP) techniques to link experimentally synthesized compounds in the tmQM database to their respective applications. Leveraging NLP models, we curate four distinct datasets: tmCAT for catalysis, tmPHOTO for photophysical activity, tmBIO for biological relevance, and tmSCO for magnetism. Analyzing the chemical substructures within each dataset reveals common chemical motifs in each of the designated applications. We then use these common chemical structures to augment our initial datasets for each application, yielding a total of 21 631 compounds in tmCAT, 4599 in tmPHOTO, 2782 in tmBIO, and 983 in tmSCO. These datasets are expected to accelerate the more targeted computational screening and development of refined structure-property relationships with machine learning.
Collapse
Affiliation(s)
- Ilia Kevlishvili
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Roland G St Michel
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
- Department of Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Aaron G Garrison
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Jacob W Toney
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Husain Adamji
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Haojun Jia
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Yuriy Román-Leshkov
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Heather J Kulik
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| |
Collapse
|
6
|
Khalid A, Kaleem A, Qazi W, Abdullah R, Iqtedar M, Naz S. Site-specific prediction of O-GlcNAc modification in proteins using evolutionary scale model. PLoS One 2024; 19:e0316215. [PMID: 39739642 DOI: 10.1371/journal.pone.0316215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 12/07/2024] [Indexed: 01/02/2025] Open
Abstract
Protein glycosylation, a vital post-translational modification, is pivotal in various biological processes and disease pathogenesis. Computational approaches, including protein language models and machine learning algorithms, have emerged as valuable tools for predicting O-GlcNAc sites, reducing experimental costs, and enhancing efficiency. However, the literature has not reported the prediction of O-GlcNAc sites through the evolutionary scale model (ESM). Therefore, this study employed the ESM-2 model for O-GlcNAc site prediction in humans. Approximately 1100 O-linked glycoprotein sequences retrieved from the O-GlcNAc database were utilized for model training. The ESM-2 model exhibited consistent improvement over epochs, achieving an accuracy of 78.30%, recall of 78.30%, precision of 61.31%, and F1-score of 68.74%. However, compared to the traditional models which show an overfitting on the same data up to 99%, ESM-2 model outperforms in terms of optimal training and testing predictions. These findings underscore the effectiveness of the ESM-2 model in accurately predicting O-GlcNAc sites within human proteins. Accurately predicting O-GlcNAc sites within human proteins can significantly advance glycoproteomic research by enhancing our understanding of protein function and disease mechanisms, aiding in developing targeted therapies, and facilitating biomarker discovery for improved diagnosis and treatment. Furthermore, future studies should focus on more diverse data types, longer protein sequence lengths, and higher computational resources to evaluate various parameters. Accurate prediction of O-GlcNAc sites might enhance the investigation of the site-specific functions of proteins in physiology and diseases.
Collapse
Affiliation(s)
- Ayesha Khalid
- Department of Biotechnology, Lahore College for Women University, Lahore, Pakistan
| | - Afshan Kaleem
- Department of Biotechnology, Lahore College for Women University, Lahore, Pakistan
| | - Wajahat Qazi
- Department of Computer Science, COMSATS University, Islamabad, Pakistan
| | - Roheena Abdullah
- Department of Biotechnology, Lahore College for Women University, Lahore, Pakistan
| | - Mehwish Iqtedar
- Department of Biotechnology, Lahore College for Women University, Lahore, Pakistan
| | - Shagufta Naz
- Department of Zoology, Lahore College for Women University, Lahore, Pakistan
| |
Collapse
|
7
|
Harigua-Souiai E, Masmoudi O, Makni S, Oualha R, Abdelkrim YZ, Hamdi S, Souiai O, Guizani I. cidalsDB: an AI-empowered platform for anti-pathogen therapeutics research. J Cheminform 2024; 16:134. [PMID: 39609715 PMCID: PMC11605991 DOI: 10.1186/s13321-024-00929-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Accepted: 11/11/2024] [Indexed: 11/30/2024] Open
Abstract
Computer-aided drug discovery (CADD) is nurtured by late advances in big data analytics and Artificial Intelligence (AI) towards enhanced drug discovery (DD) outcomes. In this context, reliable datasets are of utmost importance. We herein present CidalsDB a novel web server for AI-assisted DD against infectious pathogens, namely Leishmania parasites and Coronaviruses. We performed a literature search on molecules with validated anti-pathogen effects. Then, we consolidated these data with bioassays from PubChem. Finally, we constructed a database to store these datasets and make them accessible and ready-to-use for the scientific community through CidalsDB, a web-based interface. In a second step, we implemented and optimized four machine learning (ML) and three deep learning (DL) algorithms that optimally predicted the biological activity of molecules. Random Forests (RF), Multi-Layer Perceptron (MLP) and ChemBERTa were the best classifiers of anti-Leishmania molecules, while Gradient Boosting (GB), Graph-Convolutional Network (GCN) and ChemBERTa achieved the best performances on the Coronaviruses dataset. All six models were optimized and deployed through CidalsDB as anti-pathogen activity prediction models.Scientific contributionCidalsDB is an open access web-based tool that allows browsing and access to ready-to-use datasets of anti-pathogen molecules, alongside best performing AI models for biological activity prediction. It offers a democratized no-code platform for AI-based CADD, which shall foster innovation and collaboration within the DD community. CidalsDB is accessible through https://cidalsdb.streamlit.app/ .
Collapse
Affiliation(s)
- Emna Harigua-Souiai
- Laboratory of Molecular Epidemiology and Experimental Pathology - LR16IPT04, Institut Pasteur de Tunis, Université de Tunis El Manar, 13, Place Pasteur, 1002, Tunis, Tunisia.
| | - Ons Masmoudi
- Laboratory of Molecular Epidemiology and Experimental Pathology - LR16IPT04, Institut Pasteur de Tunis, Université de Tunis El Manar, 13, Place Pasteur, 1002, Tunis, Tunisia
| | - Samer Makni
- Laboratory of Molecular Epidemiology and Experimental Pathology - LR16IPT04, Institut Pasteur de Tunis, Université de Tunis El Manar, 13, Place Pasteur, 1002, Tunis, Tunisia
| | - Rafeh Oualha
- Laboratory of Molecular Epidemiology and Experimental Pathology - LR16IPT04, Institut Pasteur de Tunis, Université de Tunis El Manar, 13, Place Pasteur, 1002, Tunis, Tunisia
| | - Yosser Z Abdelkrim
- Laboratory of Molecular Epidemiology and Experimental Pathology - LR16IPT04, Institut Pasteur de Tunis, Université de Tunis El Manar, 13, Place Pasteur, 1002, Tunis, Tunisia
| | - Sara Hamdi
- Laboratory of Molecular Epidemiology and Experimental Pathology - LR16IPT04, Institut Pasteur de Tunis, Université de Tunis El Manar, 13, Place Pasteur, 1002, Tunis, Tunisia
| | - Oussama Souiai
- Laboratory of BioInformatics, BioMathematics and BioStatistics - LR20IPT09, Institut Pasteur de Tunis, Université de Tunis El Manar, 13, Place Pasteur, 1002, Tunis, Tunisia
- Institut Supérieur des technologies médicales de Tunis, ISTMT, Université de Tunis El Manar, 9, Rue Docteur Zouheïr Safi, 1006, Tunis, Tunisia
| | - Ikram Guizani
- Laboratory of Molecular Epidemiology and Experimental Pathology - LR16IPT04, Institut Pasteur de Tunis, Université de Tunis El Manar, 13, Place Pasteur, 1002, Tunis, Tunisia
| |
Collapse
|
8
|
Liang W, Su W, Zhong L, Yang Z, Li T, Liang Y, Ruan T, Jiang G. Comprehensive Characterization of Oxidative Stress-Modulating Chemicals Using GPT-Based Text Mining. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2024; 58:20540-20552. [PMID: 39513989 DOI: 10.1021/acs.est.4c07390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2024]
Abstract
The screening of hazardous environmental pollutants is hindered by the limited availability of toxicological databases. Large language model (LLM)-based text mining holds the potential to automatically extract complex toxicological information from the literature. Due to its relevance to diseases and the challenge of comprehensive characterization, oxidative stress serves as a suitable case for research by texting mining. In this study, a robust workflow utilizing a LLM (i.e., GPT-4) was developed to extract information on oxidative stress tests, including data collection, text preprocessing, prompt engineering, and performance evaluation procedures. A total of 17,780 relevant records were extracted from 7166 articles, covering 2558 unique compounds. A rising interest in oxidative stress was observed over the past two decades. A list of known prooxidants (n = 1416) and antioxidants (n = 1102) was established, with the leading chemical categories being pharmaceuticals, pesticides, and metals for prooxidants and pharmaceuticals and flavonoids for antioxidants. Structural alert analysis identified potential prooxidant (e.g., chlorobenzene, nitrobenzene, and tertiary amines) and antioxidant (e.g., flavonoid and thiol) substructures. These findings illustrate the feasibility of building toxicological databases through LLM-based text mining in a cost-efficient manner, and the information obtained from the technique holds significant promise for future applications in environmental and health research.
Collapse
Affiliation(s)
- Wenqing Liang
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Wenyuan Su
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Laijin Zhong
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhendong Yang
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Tingyu Li
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yong Liang
- Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, School of Environment and Health, Jianghan University, Wuhan 430056, China
| | - Ting Ruan
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, School of Environment and Health, Jianghan University, Wuhan 430056, China
| | - Guibin Jiang
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
9
|
Ai Q, Meng F, Shi J, Pelkie B, Coley CW. Extracting structured data from organic synthesis procedures using a fine-tuned large language model. DIGITAL DISCOVERY 2024; 3:1822-1831. [PMID: 39157760 PMCID: PMC11322921 DOI: 10.1039/d4dd00091a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Accepted: 07/30/2024] [Indexed: 08/20/2024]
Abstract
The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and despite the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we fine-tune a large language model (LLM) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD "messages" (e.g., full compound, workups, or condition definitions) and 92.25% for individual data fields (e.g., compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.
Collapse
Affiliation(s)
- Qianxiang Ai
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| | - Fanwang Meng
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| | - Jiale Shi
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| | - Brenden Pelkie
- Department of Chemical Engineering, University of Washington Seattle WA USA
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| |
Collapse
|
10
|
Huang Z, Li X, Li A, Yang Y, He L, Zhang Z, Wu S, Wang Y, Cai S, He Y, Liu X. MPNTEXT: An Interactive Platform for Automatically Extracting Metal-Polyphenol Networks and Their Applications from Scientific Literature. J Chem Inf Model 2024. [PMID: 39258795 DOI: 10.1021/acs.jcim.4c01093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/12/2024]
Abstract
In recent years, metal-polyphenol networks (MPNs) have gained significant attention due to their unique properties and broad applications across various fields. However, the burgeoning volume of MPN literature necessitates the automation of chemical information extraction from the extensive corpus of unstructured data, including scientific publications. To address this challenge, we proposed a platform named MPNTEXT, which utilized natural language processing techniques and machine learning algorithms to efficiently identify and extract pertinent information, thereby assisting users in comprehending complex MPNs and their textual descriptions of applications. Users can enter keywords, such as "Fe", "drug delivery", or "tannic acid", to retrieve relevant information, which is then presented in a structured format. This study aims to provide a user-friendly tool for collecting and retrieving MPN data and promotes data-driven material design. The platform offers researchers a more convenient and efficient way to design versatile MPNs and explore their applications.
Collapse
Affiliation(s)
- Zihui Huang
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Xinyi Li
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Andi Li
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Yuhang Yang
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Liqiang He
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Zhiwen Zhang
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Siwei Wu
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Yang Wang
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Shuting Cai
- School of Integrated Circuits, Guangdong University of Technology, Guangzhou 510006, China
| | - Yan He
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Xujie Liu
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| |
Collapse
|
11
|
Liu W, Chen J, Wang H, Fu Z, Peijnenburg WJGM, Hong H. Perspectives on Advancing Multimodal Learning in Environmental Science and Engineering Studies. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2024. [PMID: 39226136 DOI: 10.1021/acs.est.4c03088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
The environment faces increasing anthropogenic impacts, resulting in a rapid increase in environmental issues that undermine the natural capital essential for human wellbeing. These issues are complex and often influenced by various factors represented by data with different modalities. While machine learning (ML) provides data-driven tools for addressing the environmental issues, the current ML models in environmental science and engineering (ES&E) often neglect the utilization of multimodal data. With the advancement in deep learning, multimodal learning (MML) holds promise for comprehensive descriptions of the environmental issues by harnessing data from diverse modalities. This advancement has the potential to significantly elevate the accuracy and robustness of prediction models in ES&E studies, providing enhanced solutions for various environmental modeling tasks. This perspective summarizes MML methodologies and proposes potential applications of MML models in ES&E studies, including environmental quality assessment, prediction of chemical hazards, and optimization of pollution control techniques. Additionally, we discuss the challenges associated with implementing MML in ES&E and propose future research directions in this domain.
Collapse
Affiliation(s)
- Wenjia Liu
- Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Jingwen Chen
- Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Haobo Wang
- Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Zhiqiang Fu
- Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Willie J G M Peijnenburg
- Institute of Environmental Sciences (CML), Leiden University, Leiden 2300 RA, The Netherlands
- Centre for Safety of Substances and Products, National Institute of Public Health and the Environment (RIVM), Bilthoven 3720 BA, The Netherlands
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas 72079, United States
| |
Collapse
|
12
|
Ahmad F, Muhmood T. Clinical translation of nanomedicine with integrated digital medicine and machine learning interventions. Colloids Surf B Biointerfaces 2024; 241:114041. [PMID: 38897022 DOI: 10.1016/j.colsurfb.2024.114041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Revised: 06/11/2024] [Accepted: 06/13/2024] [Indexed: 06/21/2024]
Abstract
Nanomaterials based therapeutics transform the ways of disease prevention, diagnosis and treatment with increasing sophistications in nanotechnology at a breakneck pace, but very few could reach to the clinic due to inconsistencies in preclinical studies followed by regulatory hinderances. To tackle this, integrating the nanomedicine discovery with digital medicine provide technologies as tools of specific biological activity measurement. Hence, overcome the redundancies in nanomedicine discovery by the on-site data acquisition and analytics through integrating intelligent sensors and artificial intelligence (AI) or machine learning (ML). Integrated AI/ML wearable sensors directly gather clinically relevant biochemical information from the subject's body and process data for physicians to make right clinical decision(s) in a time and cost-effective way. This review summarizes insights and recommend the infusion of actionable big data computation enabled sensors in burgeoning field of nanomedicine at academia, research institutes, and pharmaceutical industries, with a potential of clinical translation. Furthermore, many blind spots are present in modern clinically relevant computation, one of which could prevent ML-guided low-cost new nanomedicine development from being successfully translated into the clinic was also discussed.
Collapse
Affiliation(s)
- Farooq Ahmad
- State Key Laboratory of Chemistry and Utilization of Carbon Based Energy Resources, College of Chemistry, Xinjiang University, Urumqi 830017, China.
| | - Tahir Muhmood
- International Iberian Nanotechnology Laboratory (INL), Avenida Mestre José Veiga, Braga 4715-330, Portugal.
| |
Collapse
|
13
|
Chen C, Li SL, Xu YY, Liu J, Graham DW, Zhu YG. Characterising global antimicrobial resistance research explains why One Health solutions are slow in development: An application of AI-based gap analysis. ENVIRONMENT INTERNATIONAL 2024; 187:108680. [PMID: 38723455 DOI: 10.1016/j.envint.2024.108680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2024] [Revised: 04/16/2024] [Accepted: 04/19/2024] [Indexed: 05/19/2024]
Abstract
The global health crisis posed by increasing antimicrobial resistance (AMR) implicitly requires solutions based a One Health approach, yet multisectoral, multidisciplinary research on AMR is rare and huge knowledge gaps exist to guide integrated action. This is partly because a comprehensive survey of past research activity has never performed due to the massive scale and diversity of published information. Here we compiled 254,738 articles on AMR using Artificial Intelligence (AI; i.e., Natural Language Processing, NLP) methods to create a database and information retrieval system for knowledge extraction on research perfomed over the last 20 years. Global maps were created that describe regional, methodological, and sectoral AMR research activities that confirm limited intersectoral research has been performed, which is key to guiding science-informed policy solutions to AMR, especially in low-income countries (LICs). Further, we show greater harmonisation in research methods across sectors and regions is urgently needed. For example, differences in analytical methods used among sectors in AMR research, such as employing culture-based versus genomic methods, results in poor communication between sectors and partially explains why One Health-based solutions are not ensuing. Therefore, our analysis suggest that performing culture-based and genomic AMR analysis in tandem in all sectors is crucial for data integration and holistic One Health solutions. Finally, increased investment in capacity development in LICs should be prioritised as they are places where the AMR burden is often greatest. Our open-access database and AI methodology can be used to further develop, disseminate, and create new tools and practices for AMR knowledge and information sharing.
Collapse
Affiliation(s)
- Cai Chen
- Key Laboratory of Urban Environment and Health, Ningbo Observation and Research Station, Institute of Urban Environment, Chinese Academy of Sciences, Xiamen 361021, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Shu-Le Li
- Key Laboratory of Urban Environment and Health, Ningbo Observation and Research Station, Institute of Urban Environment, Chinese Academy of Sciences, Xiamen 361021, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yao-Yang Xu
- Key Laboratory of Urban Environment and Health, Ningbo Observation and Research Station, Institute of Urban Environment, Chinese Academy of Sciences, Xiamen 361021, China; Zhejiang Key Laboratory of Urban Environmental Processes and Pollution Control, CAS Haixi Industrial Technology Innovation Center in Beilun, Ningbo 315830, China
| | - Jue Liu
- Department of Epidemiology and Biostatistics, School of Public Health, Peking University, Beijing 100191, China; Institute for Global Health and Development, Peking University, Beijing 100191, China
| | - David W Graham
- School of Engineering, Newcastle University, Newcastle, UK.
| | - Yong-Guan Zhu
- Key Laboratory of Urban Environment and Health, Ningbo Observation and Research Station, Institute of Urban Environment, Chinese Academy of Sciences, Xiamen 361021, China; Zhejiang Key Laboratory of Urban Environmental Processes and Pollution Control, CAS Haixi Industrial Technology Innovation Center in Beilun, Ningbo 315830, China; State Key Laboratory of Urban and Regional Ecology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China.
| |
Collapse
|
14
|
Blakey M, Pearman-Kanza S, Frey JG. Zombie cheminformatics: extraction and conversion of Wiswesser Line Notation (WLN) from chemical documents. J Cheminform 2024; 16:42. [PMID: 38622746 PMCID: PMC11017645 DOI: 10.1186/s13321-024-00831-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Accepted: 03/23/2024] [Indexed: 04/17/2024] Open
Abstract
PURPOSE Wiswesser Line Notation (WLN) is a old line notation for encoding chemical compounds for storage and processing by computers. Whilst the notation itself has long since been surpassed by SMILES and InChI, distribution of WLN during its active years was extensive. In the context of modernising chemical data, we present a comprehensive WLN parser developed using the OpenBabel toolkit, capable of translating WLN strings into various formats supported by the library. Furthermore, we have devised a specialised Finite State Machine l, constructed from the rules of WLN, enabling the recognition and extraction of chemical strings out of large bodies of text. Available open-access WLN data with corresponding SMILES or InChI notation is rare, however ChEMBL, ChemSpider and PubChem all contain WLN records which were used for conversion scoring. Our investigation revealed a notable proportion of inaccuracies within the database entries, and we have taken steps to rectify these errors whenever feasible. SCIENTIFIC CONTRIBUTION Tools for both the extraction and conversion of WLN from chemical documents have been successfully developed. Both the Deterministic Finite Automaton (DFA) and parser handle the majority of WLN rules officially endorsed in the three major WLN manuals, with the parser showing a clear jump in accuracy and chemical coverage over previous submissions. The GitHub repository can be found here: https://github.com/Mblakey/wiswesser .
Collapse
Affiliation(s)
- Michael Blakey
- Department of Chemistry, University of Southampton, University Road, Southampton, Hampshire, SO17 1BJ, UK.
| | - Samantha Pearman-Kanza
- Department of Chemistry, University of Southampton, University Road, Southampton, Hampshire, SO17 1BJ, UK
| | - Jeremy G Frey
- Department of Chemistry, University of Southampton, University Road, Southampton, Hampshire, SO17 1BJ, UK
| |
Collapse
|
15
|
Arora S, Chettri S, Percha V, Kumar D, Latwal M. Artifical intelligence: a virtual chemist for natural product drug discovery. J Biomol Struct Dyn 2024; 42:3826-3835. [PMID: 37232451 DOI: 10.1080/07391102.2023.2216295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 05/12/2023] [Indexed: 05/27/2023]
Abstract
Nature is full of a bundle of medicinal substances and its product perceived as a prerogative structure to collaborate with protein drug targets. The natural product's (NPs) structure heterogeneity and eccentric characteristics inspired scientists to work on natural product-inspired medicine. To gear NP drug-finding artificial intelligence (AI) to confront and excavate unexplored opportunities. Natural product-inspired drug discoveries based on AI to act as an innovative tool for molecular design and lead discovery. Various models of machine learning produce quickly synthesizable mimetics of the natural products templates. The invention of novel natural products mimetics by computer-assisted technology provides a feasible strategy to get the natural product with defined bio-activities. AI's hit rate makes its high importance by improving trail patterns such as dose selection, trail life span, efficacy parameters, and biomarkers. Along these lines, AI methods can be a successful tool in a targeted way to formulate advanced medicinal applications for natural products. 'Prediction of future of natural product based drug discovery is not magic, actually its artificial intelligence'Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Shefali Arora
- Department of Chemistry, University of Petroleum and Energy Studies, Dehradun, Uttarakhand, India
| | - Sukanya Chettri
- Department of Chemistry, University of Petroleum and Energy Studies, Dehradun, Uttarakhand, India
| | - Versha Percha
- Department of Pharmaceutical Chemistry, Dolphin(PG) Institute of Biomedical and Natural Sciences, Dehradun, Uttarakhand, India
| | - Deepak Kumar
- Department of Pharmaceutical Chemistry, Dolphin(PG) Institute of Biomedical and Natural Sciences, Dehradun, Uttarakhand, India
| | - Mamta Latwal
- Department of Chemistry, University of Petroleum and Energy Studies, Dehradun, Uttarakhand, India
| |
Collapse
|
16
|
Zhang X, Zhou Z, Ming C, Sun YY. GPT-Assisted Learning of Structure-Property Relationships by Graph Neural Networks: Application to Rare-Earth-Doped Phosphors. J Phys Chem Lett 2023; 14:11342-11349. [PMID: 38064589 DOI: 10.1021/acs.jpclett.3c02848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2023]
Abstract
Two challenges facing machine learning tasks in materials science are data set construction and descriptor design. Graph neural networks circumvent the need for empirical descriptors by encoding geometric information in graphs. Large language models have shown promise for database construction via text extraction. Here, we apply OpenAI's Generative Pre-trained Transformer 4 (GPT-4) and the Crystal Graph Convolutional Neural Network (CGCNN) to the problem of discovering rare-earth-doped phosphors for solid-state lighting. We used GPT-4 to datamine the chemical formulas and emission wavelengths of 264 Eu2+-doped phosphors from 274 articles. A CGCNN model was trained on the acquired data set, achieving a test R2 of 0.77. Using this model, we predicted the emission wavelengths of over 40 000 inorganic materials. We also used transfer learning to fine-tune a bandgap-predicting CGCNN model for emission wavelength prediction. The workflow requires minimal human supervision and is generalizable to other fields.
Collapse
Affiliation(s)
- Xiang Zhang
- State Key Laboratory of High Performance Ceramics and Superfine Microstructure, Shanghai Institute of Ceramics, Chinese Academy of Sciences, Shanghai 201899, People's Republic of China
| | - Zichun Zhou
- State Key Laboratory of High Performance Ceramics and Superfine Microstructure, Shanghai Institute of Ceramics, Chinese Academy of Sciences, Shanghai 201899, People's Republic of China
| | - Chen Ming
- State Key Laboratory of High Performance Ceramics and Superfine Microstructure, Shanghai Institute of Ceramics, Chinese Academy of Sciences, Shanghai 201899, People's Republic of China
| | - Yi-Yang Sun
- State Key Laboratory of High Performance Ceramics and Superfine Microstructure, Shanghai Institute of Ceramics, Chinese Academy of Sciences, Shanghai 201899, People's Republic of China
| |
Collapse
|
17
|
Zhang K, Zhou X, Li S, Zhao L, Hu W, Cai A, Zeng Y, Wang Q, Wu M, Li G, Liu J, Ji H, Qin Y, Wu L. A General Strategy for Developing Ultrasensitive "Transistor-Like" Thermochromic Fluorescent Materials for Multilevel Information Encryption. ADVANCED MATERIALS (DEERFIELD BEACH, FLA.) 2023; 35:e2305472. [PMID: 37437082 DOI: 10.1002/adma.202305472] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 07/10/2023] [Indexed: 07/14/2023]
Abstract
Thermochromic fluorescent materials (TFMs) exhibit great potential in information encryption applications but are limited by low thermosensitivity, poor color tunability, and a wide temperature-responsive range. Herein, a novel strategy for constructing highly sensitive TFMs with tunable emission (450-650 nm) toward multilevel information encryption is proposed, which employs polarity-sensitive fluorophores with donor-acceptor-donor (D-A-D) type structures as emitters and long-chain alkanes as thermosensitive loading matrixes. The structure-function relationships between the performance of TFMs and the structures of both fluorescent emitters and phase-change molecules are systematically studied. Benefiting from the above design, the obtained TFMs exhibit over 9500-fold fluorescence enhancement toward the temperature change, as well as ultrahigh relative temperature sensitivity up to 80% K-1 , which are first confirmed. Thanks to the superior transducing performance, the above-prepared TFMs can be further developed as information-storage platforms within a relatively narrow interval of temperature variation, including temperature-dominated multicolored information display and multilevel information encryption. This work will not only provide a novel perspective for designing superior TFMs for information encryption but also bring inspiration to the design and preparation of other response-switching-type fluorescent probes with ultrahigh conversion efficiency.
Collapse
Affiliation(s)
- Ke Zhang
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Xiaobo Zhou
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Shijie Li
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Lingfeng Zhao
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Wenqi Hu
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Aiting Cai
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Yuhan Zeng
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Qi Wang
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Mingmin Wu
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Guo Li
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Jinxia Liu
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Haiwei Ji
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Yuling Qin
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Li Wu
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| |
Collapse
|
18
|
Li S, Zhang Y, Fang Z, Meng K, Tian R, He H, Sun S. Extracting the Synthetic Route of Pd-Based Catalysts in Methanol Steam Reforming from the Scientific Literature. J Chem Inf Model 2023; 63:6249-6260. [PMID: 37807535 DOI: 10.1021/acs.jcim.3c01442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]
Abstract
The structured material synthesis route is crucial for chemists in performing experiments and modern applications such as machine learning material design. With the exponential growth of the chemical literature in recent years, manual extraction from the published literature is time-consuming and labor-intensive. This study focuses on developing an automated method for extracting Pd-based catalyst synthesis routes from the chemical literature. First, a paragraph classification model based on regular expressions is employed to identify paragraphs that contain material synthesis processes. The identified paragraphs are verified using machine learning techniques. Second, natural language processing techniques are applied to automatically parse the material synthesis routes from the identified paragraphs, generate regularized flowcharts, and output structured data. Lastly, we utilized the structured data of the synthesis routes to train machine learning models and predict the performance of the materials. The extracted material entities include the product, preparation method, precursor, support, loading, synthesis operation, and operation condition. This method avoids extensive manual data annotation and improves the scientific literature information acquisition efficiency. The accuracy of the 11 material entities exceeds 80%, and the accuracy of the method, support, precursor, drying time, and reduction time exceeds 90%.
Collapse
Affiliation(s)
- Shuyuan Li
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Yunjiang Zhang
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Zhaolin Fang
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Kong Meng
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Rui Tian
- Beijing Engineering Research Center for IoT Software and Systems, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
| | - Hong He
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Shaorui Sun
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| |
Collapse
|
19
|
Kiseleva OI, Kurbatov IY, Arzumanian VA, Ilgisonis EV, Zakharov SV, Poverennaya EV. The Expectation and Reality of the HepG2 Core Metabolic Profile. Metabolites 2023; 13:908. [PMID: 37623852 PMCID: PMC10456947 DOI: 10.3390/metabo13080908] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 07/29/2023] [Accepted: 08/01/2023] [Indexed: 08/26/2023] Open
Abstract
To represent the composition of small molecules circulating in HepG2 cells and the formation of the "core" of characteristic metabolites that often attract researchers' attention, we conducted a meta-analysis of 56 datasets obtained through metabolomic profiling via mass spectrometry and NMR. We highlighted the 288 most commonly studied compounds of diverse chemical nature and analyzed metabolic processes involving these small molecules. Building a complete map of the metabolome of a cell, which encompasses the diversity of possible impacts on it, is a severe challenge for the scientific community, which is faced not only with natural limitations of experimental technologies, but also with the absence of transparent and widely accepted standards for processing and presenting the obtained metabolomic data. Formulating our research design, we aimed to reveal metabolites crucial to the Hepg2 cell line, regardless of all chemical and/or physical impact factors. Unfortunately, the existing paradigm of data policy leads to a streetlight effect. When analyzing and reporting only target metabolites of interest, the community ignores the changes in the metabolomic landscape that hide many molecular secrets.
Collapse
Affiliation(s)
- Olga I. Kiseleva
- Institute of Biomedical Chemistry, Pogodinskaya Street, 10, 119121 Moscow, Russia (E.V.I.); (E.V.P.)
| | - Ilya Y. Kurbatov
- Institute of Biomedical Chemistry, Pogodinskaya Street, 10, 119121 Moscow, Russia (E.V.I.); (E.V.P.)
| | - Viktoriia A. Arzumanian
- Institute of Biomedical Chemistry, Pogodinskaya Street, 10, 119121 Moscow, Russia (E.V.I.); (E.V.P.)
| | - Ekaterina V. Ilgisonis
- Institute of Biomedical Chemistry, Pogodinskaya Street, 10, 119121 Moscow, Russia (E.V.I.); (E.V.P.)
| | - Svyatoslav V. Zakharov
- Chemistry Department, Lomonosov Moscow State University, Leninskie gory Street, 1/3, 119991 Moscow, Russia;
| | - Ekaterina V. Poverennaya
- Institute of Biomedical Chemistry, Pogodinskaya Street, 10, 119121 Moscow, Russia (E.V.I.); (E.V.P.)
| |
Collapse
|
20
|
Xie W, Fan K, Zhang S, Li L. Multiple sampling schemes and deep learning improve active learning performance in drug-drug interaction information retrieval analysis from the literature. J Biomed Semantics 2023; 14:5. [PMID: 37248476 PMCID: PMC10228061 DOI: 10.1186/s13326-023-00287-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Accepted: 04/29/2023] [Indexed: 05/31/2023] Open
Abstract
BACKGROUND Drug-drug interaction (DDI) information retrieval (IR) is an important natural language process (NLP) task from the PubMed literature. For the first time, active learning (AL) is studied in DDI IR analysis. DDI IR analysis from PubMed abstracts faces the challenges of relatively small positive DDI samples among overwhelmingly large negative samples. Random negative sampling and positive sampling are purposely designed to improve the efficiency of AL analysis. The consistency of random negative sampling and positive sampling is shown in the paper. RESULTS PubMed abstracts are divided into two pools. Screened pool contains all abstracts that pass the DDI keywords query in PubMed, while unscreened pool includes all the other abstracts. At a prespecified recall rate of 0.95, DDI IR analysis precision is evaluated and compared. In screened pool IR analysis using supporting vector machine (SVM), similarity sampling plus uncertainty sampling improves the precision over uncertainty sampling, from 0.89 to 0.92 respectively. In the unscreened pool IR analysis, the integrated random negative sampling, positive sampling, and similarity sampling improve the precision over uncertainty sampling along, from 0.72 to 0.81 respectively. When we change the SVM to a deep learning method, all sampling schemes consistently improve DDI AL analysis in both screened pool and unscreened pool. Deep learning has significant improvement of precision over SVM, 0.96 vs. 0.92 in screened pool, and 0.90 vs. 0.81 in the unscreened pool, respectively. CONCLUSIONS By integrating various sampling schemes and deep learning algorithms into AL, the DDI IR analysis from literature is significantly improved. The random negative sampling and positive sampling are highly effective methods in improving AL analysis where the positive and negative samples are extremely imbalanced.
Collapse
Affiliation(s)
- Weixin Xie
- Department of Biomedical Informatics, Ohio State University, Columbus, OH 43210 USA
| | - Kunjie Fan
- Department of Biomedical Informatics, Ohio State University, Columbus, OH 43210 USA
| | - Shijun Zhang
- Department of Biomedical Informatics, Ohio State University, Columbus, OH 43210 USA
| | - Lang Li
- Department of Biomedical Informatics, Ohio State University, Columbus, OH 43210 USA
| |
Collapse
|
21
|
Wang L, Gao Y, Chen X, Cui W, Zhou Y, Luo X, Xu S, Du Y, Wang B. A corpus of CO 2 electrocatalytic reduction process extracted from the scientific literature. Sci Data 2023; 10:175. [PMID: 36991006 PMCID: PMC10060421 DOI: 10.1038/s41597-023-02089-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2023] [Accepted: 03/19/2023] [Indexed: 03/31/2023] Open
Abstract
The electrocatalytic CO2 reduction process has gained enormous attention for both environmental protection and chemicals production. Thereinto, the design of new electrocatalysts with high activity and selectivity can draw inspiration from the abundant scientific literature. An annotated and verified corpus made from massive literature can assist the development of natural language processing (NLP) models, which can offer insight to help guide the understanding of these underlying mechanisms. To facilitate data mining in this direction, we present a benchmark corpus of 6,086 records manually extracted from 835 electrocatalytic publications, along with an extended corpus with 145,179 records in this article. In this corpus, nine types of knowledge such as material, regulation method, product, faradaic efficiency, cell setup, electrolyte, synthesis method, current density, and voltage are provided by either annotating or extracting. Machine learning algorithms can be applied to the corpus to help scientists find new and effective electrocatalysts. Furthermore, researchers familiar with NLP can use this corpus to design domain-specific named entity recognition (NER) models.
Collapse
Affiliation(s)
- Ludi Wang
- Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100083, China
| | - Yang Gao
- CAS Key Laboratory of Nanosystem and Hierarchical Fabrication, National Center for Nanoscience and Technology (NCNST), Beijing, 100190, China
| | - Xueqing Chen
- Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100083, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Wenjuan Cui
- Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100083, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Yuanchun Zhou
- Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100083, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Xinying Luo
- CAS Key Laboratory of Nanosystem and Hierarchical Fabrication, National Center for Nanoscience and Technology (NCNST), Beijing, 100190, China
| | - Shuaishuai Xu
- CAS Key Laboratory of Nanosystem and Hierarchical Fabrication, National Center for Nanoscience and Technology (NCNST), Beijing, 100190, China
| | - Yi Du
- Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100083, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Bin Wang
- CAS Key Laboratory of Nanosystem and Hierarchical Fabrication, National Center for Nanoscience and Technology (NCNST), Beijing, 100190, China.
| |
Collapse
|
22
|
He K, Mao R, Gong T, Cambria E, Li C. JCBIE: a joint continual learning neural network for biomedical information extraction. BMC Bioinformatics 2022; 23:549. [PMID: 36536280 PMCID: PMC9761970 DOI: 10.1186/s12859-022-05096-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2022] [Accepted: 12/05/2022] [Indexed: 12/23/2022] Open
Abstract
Extracting knowledge from heterogeneous data sources is fundamental for the construction of structured biomedical knowledge graphs (BKGs), where entities and relations are represented as nodes and edges in the graphs, respectively. Previous biomedical knowledge extraction methods simply considered limited entity types and relations by using a task-specific training set, which is insufficient for large-scale BKGs development and downstream task applications in different scenarios. To alleviate this issue, we propose a joint continual learning biomedical information extraction (JCBIE) network to extract entities and relations from different biomedical information datasets. By empirically studying different joint learning and continual learning strategies, the proposed JCBIE can learn and expand different types of entities and relations from different datasets. JCBIE uses two separated encoders in joint-feature extraction, hence can effectively avoid the feature confusion problem comparing with using one hard-parameter sharing encoder. Specifically, it allows us to adopt entity augmented inputs to establish the interaction between named entity recognition and relation extraction. Finally, a novel evaluation mechanism is proposed for measuring cross-corpus generalization errors, which was ignored by traditional evaluation methods. Our empirical studies show that JCBIE achieves promising performance when continual learning strategy is adopted with multiple corpora.
Collapse
Affiliation(s)
- Kai He
- grid.43169.390000 0001 0599 1243School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi China ,grid.43169.390000 0001 0599 1243Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi China ,grid.43169.390000 0001 0599 1243National Engineering Lab for Big Data Analytics, Xi’an Jiaotong University, Xi’an, Shaanxi China
| | - Rui Mao
- grid.59025.3b0000 0001 2224 0361School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| | - Tieliang Gong
- grid.43169.390000 0001 0599 1243School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi China ,grid.43169.390000 0001 0599 1243Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi China ,grid.43169.390000 0001 0599 1243National Engineering Lab for Big Data Analytics, Xi’an Jiaotong University, Xi’an, Shaanxi China
| | - Erik Cambria
- grid.59025.3b0000 0001 2224 0361School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| | - Chen Li
- grid.43169.390000 0001 0599 1243School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi China ,grid.43169.390000 0001 0599 1243Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi China ,grid.43169.390000 0001 0599 1243National Engineering Lab for Big Data Analytics, Xi’an Jiaotong University, Xi’an, Shaanxi China
| |
Collapse
|
23
|
Duan Y, Rosaleny LE, Coutinho JT, Giménez-Santamarina S, Scheie A, Baldoví JJ, Cardona-Serra S, Gaita-Ariño A. Data-driven design of molecular nanomagnets. Nat Commun 2022; 13:7626. [PMID: 36494346 PMCID: PMC9734471 DOI: 10.1038/s41467-022-35336-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Accepted: 11/29/2022] [Indexed: 12/13/2022] Open
Abstract
Three decades of research in molecular nanomagnets have raised their magnetic memories from liquid helium to liquid nitrogen temperature thanks to a wise choice of the magnetic ion and coordination environment. Still, serendipity and chemical intuition played a main role. In order to establish a powerful framework for statistically driven chemical design, here we collected chemical and physical data for lanthanide-based nanomagnets, catalogued over 1400 published experiments, developed an interactive dashboard (SIMDAVIS) to visualise the dataset, and applied inferential statistical analysis. Our analysis shows that the Arrhenius energy barrier correlates unexpectedly well with the magnetic memory. Furthermore, as both Orbach and Raman processes can be affected by vibronic coupling, chemical design of the coordination scheme may be used to reduce the relaxation rates. Indeed, only bis-phthalocyaninato sandwiches and metallocenes, with rigid ligands, consistently present magnetic memory up to high temperature. Analysing magnetostructural correlations, we offer promising strategies for improvement, in particular for the preparation of pentagonal bipyramids, where even softer complexes are protected against molecular vibrations.
Collapse
Affiliation(s)
- Yan Duan
- Instituto de Ciencia Molecular (ICMol), Universitat de València, C/Catedrático José Beltrán 2, 46980, Paterna, Spain
- Spin-X Institute, South China University of Technology, 510641, Guangzhou, People's Republic of China
| | - Lorena E Rosaleny
- Instituto de Ciencia Molecular (ICMol), Universitat de València, C/Catedrático José Beltrán 2, 46980, Paterna, Spain.
| | - Joana T Coutinho
- Instituto de Ciencia Molecular (ICMol), Universitat de València, C/Catedrático José Beltrán 2, 46980, Paterna, Spain.
- Centre for Rapid and Sustainable Product Development, Polytechnic of Leiria, 2430-028, Marinha Grande, Portugal.
| | - Silvia Giménez-Santamarina
- Instituto de Ciencia Molecular (ICMol), Universitat de València, C/Catedrático José Beltrán 2, 46980, Paterna, Spain
| | - Allen Scheie
- Neutron Scattering Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
| | - José J Baldoví
- Instituto de Ciencia Molecular (ICMol), Universitat de València, C/Catedrático José Beltrán 2, 46980, Paterna, Spain
| | - Salvador Cardona-Serra
- Instituto de Ciencia Molecular (ICMol), Universitat de València, C/Catedrático José Beltrán 2, 46980, Paterna, Spain
| | - Alejandro Gaita-Ariño
- Instituto de Ciencia Molecular (ICMol), Universitat de València, C/Catedrático José Beltrán 2, 46980, Paterna, Spain.
| |
Collapse
|
24
|
Dinh TT, Vo-Chanh TP, Nguyen C, Huynh VQ, Vo N, Nguyen HD. Extract antibody and antigen names from biomedical literature. BMC Bioinformatics 2022; 23:524. [PMID: 36474140 PMCID: PMC9727932 DOI: 10.1186/s12859-022-04993-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 10/18/2022] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND The roles of antibody and antigen are indispensable in targeted diagnosis, therapy, and biomedical discovery. On top of that, massive numbers of new scientific articles about antibodies and/or antigens are published each year, which is a precious knowledge resource but has yet been exploited to its full potential. We, therefore, aim to develop a biomedical natural language processing tool that can automatically identify antibody and antigen entities from articles. RESULTS We first annotated an antibody-antigen corpus including 3210 relevant PubMed abstracts using a semi-automatic approach. The Inter-Annotator Agreement score of 3 annotators ranges from 91.46 to 94.31%, indicating that the annotations are consistent and the corpus is reliable. We then used the corpus to develop and optimize BiLSTM-CRF-based and BioBERT-based models. The models achieved overall F1 scores of 62.49% and 81.44%, respectively, which showed potential for newly studied entities. The two models served as foundation for development of a named entity recognition (NER) tool that automatically recognizes antibody and antigen names from biomedical literature. CONCLUSIONS Our antibody-antigen NER models enable users to automatically extract antibody and antigen names from scientific articles without manually scanning through vast amounts of data and information in the literature. The output of NER can be used to automatically populate antibody-antigen databases, support antibody validation, and facilitate researchers with the most appropriate antibodies of interest. The packaged NER model is available at https://github.com/TrangDinh44/ABAG_BioBERT.git .
Collapse
Affiliation(s)
- Thuy Trang Dinh
- grid.454160.20000 0004 0642 8526Center for Bioscience and Biotechnology, University of Science, Ho Chi Minh City, Vietnam ,grid.444808.40000 0001 2037 434XVietnam National University, Ho Chi Minh City, Vietnam
| | - Trang Phuong Vo-Chanh
- grid.454160.20000 0004 0642 8526Center for Bioscience and Biotechnology, University of Science, Ho Chi Minh City, Vietnam ,grid.444808.40000 0001 2037 434XVietnam National University, Ho Chi Minh City, Vietnam
| | - Chau Nguyen
- grid.454160.20000 0004 0642 8526Center for Bioscience and Biotechnology, University of Science, Ho Chi Minh City, Vietnam ,grid.444808.40000 0001 2037 434XVietnam National University, Ho Chi Minh City, Vietnam
| | - Viet Quoc Huynh
- grid.454160.20000 0004 0642 8526Center for Bioscience and Biotechnology, University of Science, Ho Chi Minh City, Vietnam ,grid.444808.40000 0001 2037 434XVietnam National University, Ho Chi Minh City, Vietnam
| | - Nam Vo
- grid.454160.20000 0004 0642 8526Center for Bioscience and Biotechnology, University of Science, Ho Chi Minh City, Vietnam ,grid.444808.40000 0001 2037 434XVietnam National University, Ho Chi Minh City, Vietnam ,grid.454160.20000 0004 0642 8526Laboratory of Molecular Biotechnology, University of Science, Ho Chi Minh City, Vietnam
| | - Hoang Duc Nguyen
- grid.454160.20000 0004 0642 8526Center for Bioscience and Biotechnology, University of Science, Ho Chi Minh City, Vietnam ,grid.444808.40000 0001 2037 434XVietnam National University, Ho Chi Minh City, Vietnam
| |
Collapse
|
25
|
Mai H, Le TC, Chen D, Winkler DA, Caruso RA. Machine Learning in the Development of Adsorbents for Clean Energy Application and Greenhouse Gas Capture. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2022; 9:e2203899. [PMID: 36285802 PMCID: PMC9798988 DOI: 10.1002/advs.202203899] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 09/27/2022] [Indexed: 06/04/2023]
Abstract
Addressing climate change challenges by reducing greenhouse gas levels requires innovative adsorbent materials for clean energy applications. Recent progress in machine learning has stimulated technological breakthroughs in the discovery, design, and deployment of materials with potential for high-performance and low-cost clean energy applications. This review summarizes basic machine learning methods-data collection, featurization, model generation, and model evaluation-and reviews their use in the development of robust adsorbent materials. Key case studies are provided where these methods are used to accelerate adsorbent materials design and discovery, optimize synthesis conditions, and understand complex feature-property relationships. The review provides a concise resource for researchers wishing to use machine learning methods to rapidly develop effective adsorbent materials with a positive impact on the environment.
Collapse
Affiliation(s)
- Haoxin Mai
- Applied Chemistry and Environmental ScienceSchool of ScienceSTEM CollegeRMIT UniversityMelbourneVictoria3001Australia
| | - Tu C. Le
- School of EngineeringSTEM CollegeRMIT UniversityGPO Box 2476MelbourneVictoria3001Australia
| | - Dehong Chen
- Applied Chemistry and Environmental ScienceSchool of ScienceSTEM CollegeRMIT UniversityMelbourneVictoria3001Australia
| | - David A. Winkler
- Monash Institute of Pharmaceutical SciencesMonash UniversityParkvilleVIC3052Australia
- School of Biochemistry and ChemistryLa Trobe UniversityKingsbury DriveBundoora3042Australia
- School of PharmacyUniversity of NottinghamNottinghamNG7 2RDUK
| | - Rachel A. Caruso
- Applied Chemistry and Environmental ScienceSchool of ScienceSTEM CollegeRMIT UniversityMelbourneVictoria3001Australia
| |
Collapse
|
26
|
Islamaj R, Leaman R, Cissel D, Coss C, Denicola J, Fisher C, Guzman R, Kochar PG, Miliaras N, Punske Z, Sekiya K, Trinh D, Whitman D, Schmidt S, Lu Z. NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles. Database (Oxford) 2022; 2022:baac102. [PMID: 36458799 PMCID: PMC9716560 DOI: 10.1093/database/baac102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Revised: 10/17/2022] [Accepted: 11/28/2022] [Indexed: 12/03/2022]
Abstract
The automatic recognition of chemical names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. The task is even more challenging when considering the identification of these entities in the article's full text and, furthermore, the identification of candidate substances for that article's metadata [Medical Subject Heading (MeSH) article indexing]. The National Library of Medicine (NLM)-Chem track at BioCreative VII aimed to foster the development of algorithms that can predict with high quality the chemical entities in the biomedical literature and further identify the chemical substances that are candidates for article indexing. As a result of this challenge, the NLM-Chem track produced two comprehensive, manually curated corpora annotated with chemical entities and indexed with chemical substances: the chemical identification corpus and the chemical indexing corpus. The NLM-Chem BioCreative VII (NLM-Chem-BC7) Chemical Identification corpus consists of 204 full-text PubMed Central (PMC) articles, fully annotated for chemical entities by 12 NLM indexers for both span (i.e. named entity recognition) and normalization (i.e. entity linking) using MeSH. This resource was used for the training and testing of the Chemical Identification task to evaluate the accuracy of algorithms in predicting chemicals mentioned in recently published full-text articles. The NLM-Chem-BC7 Chemical Indexing corpus consists of 1333 recently published PMC articles, equipped with chemical substance indexing by manual experts at the NLM. This resource was used for the evaluation of the Chemical Indexing task, which evaluated the accuracy of algorithms in predicting the chemicals that should be indexed, i.e. appear in the listing of MeSH terms for the document. This set was further enriched after the challenge in two ways: (i) 11 NLM indexers manually verified each of the candidate terms appearing in the prediction results of the challenge participants, but not in the MeSH indexing, and the chemical indexing terms appearing in the MeSH indexing list, but not in the prediction results, and (ii) the challenge organizers algorithmically merged the chemical entity annotations in the full text for all predicted chemical entities and used a statistical approach to keep those with the highest degree of confidence. As a result, the NLM-Chem-BC7 Chemical Indexing corpus is a gold-standard corpus for chemical indexing of journal articles and a silver-standard corpus for chemical entity identification in full-text journal articles. Together, these resources are currently the most comprehensive resources for chemical entity recognition, and we demonstrate improvements in the chemical entity recognition algorithms. We detail the characteristics of these novel resources and make them available for the community. Database URL: https://ftp.ncbi.nlm.nih.gov/pub/lu/NLM-Chem-BC7-corpus/.
Collapse
Affiliation(s)
- Rezarta Islamaj
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Robert Leaman
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - David Cissel
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Cathleen Coss
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Joseph Denicola
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Carol Fisher
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Rob Guzman
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Preeti Gokal Kochar
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Nicholas Miliaras
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Zoe Punske
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Keiko Sekiya
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Dorothy Trinh
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Deborah Whitman
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Susan Schmidt
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
27
|
Zhang X, Mao R, Cambria E. A survey on syntactic processing techniques. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10300-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
28
|
Abstract
Artificial intelligence (AI) methods have been and are now being increasingly integrated in prediction software implemented in bioinformatics and its glycoscience branch known as glycoinformatics. AI techniques have evolved in the past decades, and their applications in glycoscience are not yet widespread. This limited use is partly explained by the peculiarities of glyco-data that are notoriously hard to produce and analyze. Nonetheless, as time goes, the accumulation of glycomics, glycoproteomics, and glycan-binding data has reached a point where even the most recent deep learning methods can provide predictors with good performance. We discuss the historical development of the application of various AI methods in the broader field of glycoinformatics. A particular focus is placed on shining a light on challenges in glyco-data handling, contextualized by lessons learnt from related disciplines. Ending on the discussion of state-of-the-art deep learning approaches in glycoinformatics, we also envision the future of glycoinformatics, including development that need to occur in order to truly unleash the capabilities of glycoscience in the systems biology era.
Collapse
Affiliation(s)
- Daniel Bojar
- Department
of Chemistry and Molecular Biology, University
of Gothenburg, Gothenburg 41390, Sweden
- Wallenberg
Centre for Molecular and Translational Medicine, University of Gothenburg, Gothenburg 41390, Sweden
| | - Frederique Lisacek
- Proteome
Informatics Group, Swiss Institute of Bioinformatics, CH-1227 Geneva, Switzerland
- Computer
Science Department & Section of Biology, University of Geneva, route de Drize 7, CH-1227, Geneva, Switzerland
| |
Collapse
|
29
|
Mroz A, Posligua V, Tarzia A, Wolpert EH, Jelfs KE. Into the Unknown: How Computation Can Help Explore Uncharted Material Space. J Am Chem Soc 2022; 144:18730-18743. [PMID: 36206484 PMCID: PMC9585593 DOI: 10.1021/jacs.2c06833] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Indexed: 11/28/2022]
Abstract
Novel functional materials are urgently needed to help combat the major global challenges facing humanity, such as climate change and resource scarcity. Yet, the traditional experimental materials discovery process is slow and the material space at our disposal is too vast to effectively explore using intuition-guided experimentation alone. Most experimental materials discovery programs necessarily focus on exploring the local space of known materials, so we are not fully exploiting the enormous potential material space, where more novel materials with unique properties may exist. Computation, facilitated by improvements in open-source software and databases, as well as computer hardware has the potential to significantly accelerate the rational development of materials, but all too often is only used to postrationalize experimental observations. Thus, the true predictive power of computation, where theory leads experimentation, is not fully utilized. Here, we discuss the challenges to successful implementation of computation-driven materials discovery workflows, and then focus on the progress of the field, with a particular emphasis on the challenges to reaching novel materials.
Collapse
Affiliation(s)
- Austin
M. Mroz
- Department
of Chemistry, Molecular Sciences Research Hub, Imperial College London, White City Campus,
Wood Lane, London, W12 0BZ, U.K.
| | - Victor Posligua
- Department
of Chemistry, Molecular Sciences Research Hub, Imperial College London, White City Campus,
Wood Lane, London, W12 0BZ, U.K.
| | - Andrew Tarzia
- Department
of Chemistry, Molecular Sciences Research Hub, Imperial College London, White City Campus,
Wood Lane, London, W12 0BZ, U.K.
| | - Emma H. Wolpert
- Department
of Chemistry, Molecular Sciences Research Hub, Imperial College London, White City Campus,
Wood Lane, London, W12 0BZ, U.K.
| | - Kim E. Jelfs
- Department
of Chemistry, Molecular Sciences Research Hub, Imperial College London, White City Campus,
Wood Lane, London, W12 0BZ, U.K.
| |
Collapse
|
30
|
Parastar H, Tauler R. Big (Bio)Chemical Data Mining Using Chemometric Methods: A Need for Chemists. Angew Chem Int Ed Engl 2022. [DOI: 10.1002/ange.201801134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Hadi Parastar
- Department of Chemistry Sharif University of Technology Tehran Iran
| | - Roma Tauler
- Department of Environmental Chemistry IDAEA-CSIC 08034 Barcelona Spain
| |
Collapse
|
31
|
Ohms J. Validity of PubChem compounds supplied by Patentscope or SureChEMBL. WORLD PATENT INFORMATION 2022. [DOI: 10.1016/j.wpi.2022.102134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
32
|
Ghosh S, Lu K. Band gap information extraction from materials science literature – a pilot study. ASLIB J INFORM MANAG 2022. [DOI: 10.1108/ajim-03-2022-0141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeThe purpose of this paper is to present a preliminary work on extracting band gap information of materials from academic papers. With increasing demand for renewable energy, band gap information will help material scientists design and implement novel photovoltaic (PV) cells.Design/methodology/approachThe authors collected 1.44 million titles and abstracts of scholarly articles related to materials science, and then filtered the collection to 11,939 articles that potentially contain relevant information about materials and their band gap values. ChemDataExtractor was extended to extract information about PV materials and their band gap information. Evaluation was performed on randomly sampled information records of 415 papers.FindingsThe findings of this study show that the current system is able to correctly extract information for 51.32% articles, with partially correct extraction for 36.62% articles and incorrect for 12.04%. The authors have also identified the errors belonging to three main categories pertaining to chemical entity identification, band gap information and interdependency resolution. Future work will focus on addressing these errors to improve the performance of the system.Originality/valueThe authors did not find any literature to date on band gap information extraction from academic text using automated methods. This work is unique and original. Band gap information is of importance to materials scientists in applications such as solar cells, light emitting diodes and laser diodes.
Collapse
|
33
|
Tarasova OA, Rudik AV, Biziukova NY, Filimonov DA, Poroikov VV. Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach. J Cheminform 2022; 14:55. [PMID: 35964150 PMCID: PMC9375066 DOI: 10.1186/s13321-022-00633-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 07/12/2022] [Indexed: 11/24/2022] Open
Abstract
Motivation Application of chemical named entity recognition (CNER) algorithms allows retrieval of information from texts about chemical compound identifiers and creates associations with physical–chemical properties and biological activities. Scientific texts represent low-formalized sources of information. Most methods aimed at CNER are based on machine learning approaches, including conditional random fields and deep neural networks. In general, most machine learning approaches require either vector or sparse word representation of texts. Chemical named entities (CNEs) constitute only a small fraction of the whole text, and the datasets used for training are highly imbalanced. Methods and results We propose a new method for extracting CNEs from texts based on the naïve Bayes classifier combined with specially developed filters. In contrast to the earlier developed CNER methods, our approach uses the representation of the data as a set of fragments of text (FoTs) with the subsequent preparati`on of a set of multi-n-grams (sequences from one to n symbols) for each FoT. Our approach may provide the recognition of novel CNEs. For CHEMDNER corpus, the values of the sensitivity (recall) was 0.95, precision was 0.74, specificity was 0.88, and balanced accuracy was 0.92 based on five-fold cross validation. We applied the developed algorithm to the extracted CNEs of potential Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) main protease (Mpro) inhibitors. A set of CNEs corresponding to the chemical substances evaluated in the biochemical assays used for the discovery of Mpro inhibitors was retrieved. Manual analysis of the appropriate texts showed that CNEs of potential SARS-CoV-2 Mpro inhibitors were successfully identified by our method. Conclusion The obtained results show that the proposed method can be used for filtering out words that are not related to CNEs; therefore, it can be successfully applied to the extraction of CNEs for the purposes of cheminformatics and medicinal chemistry. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-022-00633-4.
Collapse
Affiliation(s)
- O A Tarasova
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia.
| | - A V Rudik
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia
| | - N Yu Biziukova
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia
| | - D A Filimonov
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia
| | - V V Poroikov
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia
| |
Collapse
|
34
|
Mai H, Le TC, Chen D, Winkler DA, Caruso RA. Machine Learning for Electrocatalyst and Photocatalyst Design and Discovery. Chem Rev 2022; 122:13478-13515. [PMID: 35862246 DOI: 10.1021/acs.chemrev.2c00061] [Citation(s) in RCA: 85] [Impact Index Per Article: 28.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Electrocatalysts and photocatalysts are key to a sustainable future, generating clean fuels, reducing the impact of global warming, and providing solutions to environmental pollution. Improved processes for catalyst design and a better understanding of electro/photocatalytic processes are essential for improving catalyst effectiveness. Recent advances in data science and artificial intelligence have great potential to accelerate electrocatalysis and photocatalysis research, particularly the rapid exploration of large materials chemistry spaces through machine learning. Here a comprehensive introduction to, and critical review of, machine learning techniques used in electrocatalysis and photocatalysis research are provided. Sources of electro/photocatalyst data and current approaches to representing these materials by mathematical features are described, the most commonly used machine learning methods summarized, and the quality and utility of electro/photocatalyst models evaluated. Illustrations of how machine learning models are applied to novel electro/photocatalyst discovery and used to elucidate electrocatalytic or photocatalytic reaction mechanisms are provided. The review offers a guide for materials scientists on the selection of machine learning methods for electrocatalysis and photocatalysis research. The application of machine learning to catalysis science represents a paradigm shift in the way advanced, next-generation catalysts will be designed and synthesized.
Collapse
Affiliation(s)
- Haoxin Mai
- Applied Chemistry and Environmental Science, School of Science, STEM College, RMIT University, GPO Box 2476, Melbourne, Victoria 3001, Australia
| | - Tu C Le
- School of Engineering, STEM College, RMIT University, GPO Box 2476, Melbourne, Victoria 3001, Australia
| | - Dehong Chen
- Applied Chemistry and Environmental Science, School of Science, STEM College, RMIT University, GPO Box 2476, Melbourne, Victoria 3001, Australia
| | - David A Winkler
- Monash Institute of Pharmaceutical Sciences, Monash University, Parkville, Victoria 3052, Australia.,Biochemistry and Chemistry, La Trobe University, Kingsbury Drive, Bundoora, Victoria 3042, Australia.,School of Pharmacy, University of Nottingham, Nottingham NG7 2RD, United Kingdom
| | - Rachel A Caruso
- Applied Chemistry and Environmental Science, School of Science, STEM College, RMIT University, GPO Box 2476, Melbourne, Victoria 3001, Australia
| |
Collapse
|
35
|
Chang YC, Chiu YW, Chuang TW. Linguistic Pattern-Infused Dual-Channel Bidirectional Long Short-term Memory With Attention for Dengue Case Summary Generation From the Program for Monitoring Emerging Diseases-Mail Database: Algorithm Development Study. JMIR Public Health Surveill 2022; 8:e34583. [PMID: 35830225 PMCID: PMC9491834 DOI: 10.2196/34583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2021] [Revised: 04/15/2022] [Accepted: 05/27/2022] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Globalization and environmental changes have intensified the emergence or re-emergence of infectious diseases worldwide, such as outbreaks of dengue fever in Southeast Asia. Collaboration on region-wide infectious disease surveillance systems is therefore critical but difficult to achieve because of the different transparency levels of health information systems in different countries. Although the Program for Monitoring Emerging Diseases (ProMED)-mail is the most comprehensive international expert-curated platform providing rich disease outbreak information on humans, animals, and plants, the unstructured text content of the reports makes analysis for further application difficult. OBJECTIVE To make monitoring the epidemic situation in Southeast Asia more efficient, this study aims to develop an automatic summary of the alert articles from ProMED-mail, a huge textual data source. In this paper, we proposed a text summarization method that uses natural language processing technology to automatically extract important sentences from alert articles in ProMED-mail emails to generate summaries. Using our method, we can quickly capture crucial information to help make important decisions regarding epidemic surveillance. METHODS Our data, which span a period from 1994 to 2019, come from the ProMED-mail website. We analyzed the collected data to establish a unique Taiwan dengue corpus that was validated with professionals' annotations to achieve almost perfect agreement (Cohen κ=90%). To generate a ProMED-mail summary, we developed a dual-channel bidirectional long short-term memory with attention mechanism with infused latent syntactic features to identify key sentences from the alerting article. RESULTS Our method is superior to many well-known machine learning and neural network approaches in identifying important sentences, achieving a macroaverage F1 score of 93%. Moreover, it can successfully extract the relevant correct information on dengue fever from a ProMED-mail alerting article, which can help researchers or general users to quickly understand the essence of the alerting article at first glance. In addition to verifying the model, we also recruited 3 professional experts and 2 students from related fields to participate in a satisfaction survey on the generated summaries, and the results show that 84% (63/75) of the summaries received high satisfaction ratings. CONCLUSIONS The proposed approach successfully fuses latent syntactic features into a deep neural network to analyze the syntactic, semantic, and contextual information in the text. It then exploits the derived information to identify crucial sentences in the ProMED-mail alerting article. The experiment results show that the proposed method is not only effective but also outperforms the compared methods. Our approach also demonstrates the potential for case summary generation from ProMED-mail alerting articles. In terms of practical application, when a new alerting article arrives, our method can quickly identify the relevant case information, which is the most critical part, to use as a reference or for further analysis.
Collapse
Affiliation(s)
- Yung-Chun Chang
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
- Clinical Big Data Research Center, Taipei Medical University Hospital, Taipei, Taiwan
| | - Yu-Wen Chiu
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
- Department of Molecular Parasitology and Tropical Diseases, School of Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
| | - Ting-Wu Chuang
- Department of Molecular Parasitology and Tropical Diseases, School of Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
| |
Collapse
|
36
|
Yan R, Jiang X, Wang W, Dang D, Su Y. Materials information extraction via automatically generated corpus. Sci Data 2022; 9:401. [PMID: 35831367 PMCID: PMC9279422 DOI: 10.1038/s41597-022-01492-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Accepted: 06/28/2022] [Indexed: 11/12/2022] Open
Abstract
Information Extraction (IE) in Natural Language Processing (NLP) aims to extract structured information from unstructured text to assist a computer in understanding natural language. Machine learning-based IE methods bring more intelligence and possibilities but require an extensive and accurate labeled corpus. In the materials science domain, giving reliable labels is a laborious task that requires the efforts of many professionals. To reduce manual intervention and automatically generate materials corpus during IE, in this work, we propose a semi-supervised IE framework for materials via automatically generated corpus. Taking the superalloy data extraction in our previous work as an example, the proposed framework using Snorkel automatically labels the corpus containing property values. Then Ordered Neurons-Long Short-Term Memory (ON-LSTM) network is adopted to train an information extraction model on the generated corpus. The experimental results show that the F1-score of γ’ solvus temperature, density and solidus temperature of superalloys are 83.90%, 94.02%, 89.27%, respectively. Furthermore, we conduct similar experiments on other materials, the experimental results show that the proposed framework is universal in the field of materials.
Collapse
Affiliation(s)
- Rongen Yan
- School of Artificial Intelligence, Beijing Normal University, Beijing, 100875, China
| | - Xue Jiang
- Beijing Advanced Innovation Center for Materials Genome Engineering, Institute for Advanced Materials and Technology, University of Science and Technology Beijing, Beijing, 100083, China.,Collaborative Innovation Center of Steel Technology, University of Science and Technology Beijing, Beijing, 100083, China
| | - Weiren Wang
- Beijing Advanced Innovation Center for Materials Genome Engineering, Institute for Advanced Materials and Technology, University of Science and Technology Beijing, Beijing, 100083, China
| | - Depeng Dang
- School of Artificial Intelligence, Beijing Normal University, Beijing, 100875, China.
| | - Yanjing Su
- Beijing Advanced Innovation Center for Materials Genome Engineering, Institute for Advanced Materials and Technology, University of Science and Technology Beijing, Beijing, 100083, China.
| |
Collapse
|
37
|
Abstract
AbstractThe emerging field of material-based data science requires information-rich databases to generate useful results which are currently sparse in the stress engineering domain. To this end, this study uses the’materials-aware’ text-mining toolkit, ChemDataExtractor, to auto-generate databases of yield-strength and grain-size values by extracting such information from the literature. The precision of the extracted data is 83.0% for yield strength and 78.8% for grain size. The automatically-extracted data were organised into four databases: a Yield Strength, Grain Size, Engineering-Ready Yield Strength and Combined database. For further validation of the databases, the Combined database was used to plot the Hall-Petch relationship for, the alloy, AZ31, and similar results to the literature were found, demonstrating how one can make use of these automatically-extracted datasets.
Collapse
|
38
|
Li M, Tian S, Meng F, Yin M, Yue Q, Wang S, Bu W, Luo L. Continuously Multiplexed Ultrastrong Raman Probes by Precise Isotopic Polymer Backbone Doping for Multidimensional Information Storage and Encryption. NANO LETTERS 2022; 22:4544-4551. [PMID: 35604007 DOI: 10.1021/acs.nanolett.2c01443] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Raman-based super multiplexing has attracted great interest in imaging, biological analysis, identity security, and information storage. It still remains a great challenge to synthesize a large number of different Raman-active molecules to fulfill the Raman color palette. Here, we report a facile and systematic strategy to construct continuously multiplexed ultrastrong Raman probes. By precisely incorporating different ratios of 13C isotope into the backbone of poly(deca-4,6-diynedioic acid) (PDDA), we can obtain a library of PDDAs with tunable double-bond Raman frequencies and adjustable intensity ratios of two triple-bond (13C≡13C and 12C≡12C) Raman peaks, while retaining the ultrastrong Raman signals and physicochemical properties of the polymer. We also demonstrate the successful application of 13C-doped PDDAs as security inks to generate a novel 3D matrix barcode system for information encryption and high-density data storage. The isotopically doped PDDA series herein pave a new way to advance Raman-based super multiplexing for diverse applications.
Collapse
Affiliation(s)
- Mengyang Li
- National Engineering Research Center for Nanomedicine, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, P. R. China
| | - Sidan Tian
- National Engineering Research Center for Nanomedicine, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, P. R. China
| | - Fanling Meng
- National Engineering Research Center for Nanomedicine, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, P. R. China
- Key Laboratory of Molecular Biophysics of Minister of Education, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, P. R. China
| | - Mingming Yin
- National Engineering Research Center for Nanomedicine, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, P. R. China
| | - Qiang Yue
- National Engineering Research Center for Nanomedicine, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, P. R. China
| | - Shun Wang
- MOE Key Laboratory of Fundamental Physical Quantities Measurement & Hubei Key Laboratory of Gravitation and Quantum Physics, PGMF and School of Physics, Huazhong University of Science and Technology, Wuhan 430074, P. R. China
| | - Wenting Bu
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, P. R. China
| | - Liang Luo
- National Engineering Research Center for Nanomedicine, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, P. R. China
- Key Laboratory of Molecular Biophysics of Minister of Education, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, P. R. China
| |
Collapse
|
39
|
Text-mined dataset of gold nanoparticle synthesis procedures, morphologies, and size entities. Sci Data 2022; 9:234. [PMID: 35618761 PMCID: PMC9135747 DOI: 10.1038/s41597-022-01321-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Accepted: 04/08/2022] [Indexed: 12/13/2022] Open
Abstract
Gold nanoparticles are highly desired for a range of technological applications due to their tunable properties, which are dictated by the size and shape of the constituent particles. Many heuristic methods for controlling the morphological characteristics of gold nanoparticles are well known. However, the underlying mechanisms controlling their size and shape remain poorly understood, partly due to the immense range of possible combinations of synthesis parameters. Data-driven methods can offer insight to help guide understanding of these underlying mechanisms, so long as sufficient synthesis data are available. To facilitate data mining in this direction, we have constructed and made publicly available a dataset of codified gold nanoparticle synthesis protocols and outcomes extracted directly from the nanoparticle materials science literature using natural language processing and text-mining techniques. This dataset contains 5,154 data records, each representing a single gold nanoparticle synthesis article, filtered from a database of 4,973,165 publications. Each record contains codified synthesis protocols and extracted morphological information from a total of 7,608 experimental and 12,519 characterization paragraphs. Measurement(s) | gold nanoparticle morphology • gold nanoparticle size • gold nanoparticle synthesis data | Technology Type(s) | natural language processing |
Collapse
|
40
|
Dataset of solution-based inorganic materials synthesis procedures extracted from the scientific literature. Sci Data 2022; 9:231. [PMID: 35614129 PMCID: PMC9132903 DOI: 10.1038/s41597-022-01317-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2021] [Accepted: 04/05/2022] [Indexed: 11/10/2022] Open
Abstract
The development of a materials synthesis route is usually based on heuristics and experience. A possible new approach would be to apply data-driven approaches to learn the patterns of synthesis from past experience and use them to predict the syntheses of novel materials. However, this route is impeded by the lack of a large-scale database of synthesis formulations. In this work, we applied advanced machine learning and natural language processing techniques to construct a dataset of 35,675 solution-based synthesis procedures extracted from the scientific literature. Each procedure contains essential synthesis information including the precursors and target materials, their quantities, and the synthesis actions and corresponding attributes. Every procedure is also augmented with the reaction formula. Through this work, we are making freely available the first large dataset of solution-based inorganic materials synthesis procedures. Measurement(s) | solution-based inorganic synthesis data | Technology Type(s) | natural language processing |
Collapse
|
41
|
Serov N, Vinogradov V. Inverse Material Search and Synthesis Verification by Hand Drawings via Transfer Learning and Contour Detection. SMALL METHODS 2022; 6:e2101619. [PMID: 35285181 DOI: 10.1002/smtd.202101619] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Revised: 02/12/2022] [Indexed: 06/14/2023]
Abstract
Nano- and micromaterials of various morphologies and compositions have extensive use in many different areas. However, the search for procedures giving custom nanomaterials with the desired structure, shape, and size remains a challenge and is often implemented by manual article screening. Here, for the first time, scanning and transmission electron microscopy inverse image search and hand drawing-based search via transfer learning are developed, namely, VGG16 convolution neural network repurposing for image features extraction and image similarity determination. Moreover, the case use of this platform is demonstrated on the calcium carbonate system, where the data are acquired by random high throughput experimental synthesis, and on Au nanoparticles data extracted from the articles. This approach can be used for advanced nanomaterials search, synthesis procedure verification, and can be further combined with machine learning solutions to provide data-driven nanomaterials discovery.
Collapse
Affiliation(s)
- Nikita Serov
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, Saint Petersburg, 191002, Russian Federation
| | - Vladimir Vinogradov
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, Saint Petersburg, 191002, Russian Federation
| |
Collapse
|
42
|
Trewartha A, Walker N, Huo H, Lee S, Cruse K, Dagdelen J, Dunn A, Persson KA, Ceder G, Jain A. Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science. PATTERNS (NEW YORK, N.Y.) 2022; 3:100488. [PMID: 35465225 PMCID: PMC9024010 DOI: 10.1016/j.patter.2022.100488] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2021] [Revised: 01/21/2022] [Accepted: 03/15/2022] [Indexed: 11/03/2022]
Abstract
A bottleneck in efficiently connecting new materials discoveries to established literature has arisen due to an increase in publications. This problem may be addressed by using named entity recognition (NER) to extract structured summary-level data from unstructured materials science text. We compare the performance of four NER models on three materials science datasets. The four models include a bidirectional long short-term memory (BiLSTM) and three transformer models (BERT, SciBERT, and MatBERT) with increasing degrees of domain-specific materials science pre-training. MatBERT improves over the other two BERTBASE-based models by 1%∼12%, implying that domain-specific pre-training provides measurable advantages. Despite relative architectural simplicity, the BiLSTM model consistently outperforms BERT, perhaps due to its domain-specific pre-trained word embeddings. Furthermore, MatBERT and SciBERT models outperform the original BERT model to a greater extent in the small data limit. MatBERT’s higher-quality predictions should accelerate the extraction of structured data from materials science literature. Efficient extraction of information from materials science literature is needed Domain-specific materials science pre-training improves results Even simpler domain-specific models can outperform more complex general models
A bottleneck in efficiently connecting new materials discoveries to established literature has arisen due to a massive increase in publications. Four different language models are trained to automatically collect important information from materials science articles. We compare a simple model (BiLSTM) with materials science knowledge to three variants of a more complex model: one with general knowledge (BERT), one with general scientific knowledge (SciBERT), and one with materials science knowledge (MatBERT). We find that MatBERT performs the best overall. This implies that language models with greater extents of materials science knowledge will perform better on materials science-related tasks. The simpler model even consistently outperforms BERT. Furthermore, the performance gaps grow when the models are given fewer examples of information extraction to learn from. MatBERT’s higher-quality results should accelerate the collection of information from materials science literature.
Collapse
Affiliation(s)
- Amalie Trewartha
- Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - Nicholas Walker
- Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - Haoyan Huo
- Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - Sanghoon Lee
- Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - Kevin Cruse
- Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - John Dagdelen
- Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - Alexander Dunn
- Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - Kristin A Persson
- Molecular Foundry, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - Gerbrand Ceder
- Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - Anubhav Jain
- Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| |
Collapse
|
43
|
Zhu M, Cole JM. PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format. J Chem Inf Model 2022; 62:1633-1643. [PMID: 35349259 PMCID: PMC9049592 DOI: 10.1021/acs.jcim.1c01198] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
![]()
The
layout of portable document format (PDF) files is constant
to any screen, and the metadata therein are latent, compared to mark-up
languages such as HTML and XML. No semantic tags are usually provided,
and a PDF file is not designed to be edited or its data interpreted
by software. However, data held in PDF files need to be extracted
in order to comply with open-source data requirements that are now
government-regulated. In the chemical domain, related chemical and
property data also need to be found, and their correlations need to
be exploited to enable data science in areas such as data-driven materials
discovery. Such relationships may be realized using text-mining software
such as the “chemistry-aware” natural-language-processing
tool, ChemDataExtractor; however, this tool has limited data-extraction
capabilities from PDF files. This study presents the PDFDataExtractor
tool, which can act as a plug-in to ChemDataExtractor. It outperforms
other PDF-extraction tools for the chemical literature by coupling
its functionalities to the chemical-named entity-recognition capabilities
of ChemDataExtractor. The intrinsic PDF-reading abilities of ChemDataExtractor
are much improved. The system features a template-based architecture.
This enables semantic information to be extracted from the PDF files
of scientific articles in order to reconstruct the logical structure
of articles. While other existing PDF-extracting tools focus on quantity
mining, this template-based system is more focused on quality mining
on different layouts. PDFDataExtractor outputs information in JSON
and plain text, including the metadata of a PDF file, such as paper
title, authors, affiliation, email, abstract, keywords, journal, year,
document object identifier (DOI), reference, and issue number. With
a self-created evaluation article set, PDFDataExtractor achieved promising
precision for all key assessed metadata areas of the document text.
Collapse
Affiliation(s)
- Miao Zhu
- Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K
| | - Jacqueline M Cole
- Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.,ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.,Department of Chemical Engineering and Biotechnology, University of Cambridge, West Cambridge Site, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K
| |
Collapse
|
44
|
Nandy A, Terrones G, Arunachalam N, Duan C, Kastner DW, Kulik HJ. MOFSimplify, machine learning models with extracted stability data of three thousand metal-organic frameworks. Sci Data 2022; 9:74. [PMID: 35277533 PMCID: PMC8917177 DOI: 10.1038/s41597-022-01181-0] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Accepted: 01/17/2022] [Indexed: 11/09/2022] Open
Abstract
We report a workflow and the output of a natural language processing (NLP)-based procedure to mine the extant metal–organic framework (MOF) literature describing structurally characterized MOFs and their solvent removal and thermal stabilities. We obtain over 2,000 solvent removal stability measures from text mining and 3,000 thermal decomposition temperatures from thermogravimetric analysis data. We assess the validity of our NLP methods and the accuracy of our extracted data by comparing to a hand-labeled subset. Machine learning (ML, i.e. artificial neural network) models trained on this data using graph- and pore-geometry-based representations enable prediction of stability on new MOFs with quantified uncertainty. Our web interface, MOFSimplify, provides users access to our curated data and enables them to harness that data for predictions on new MOFs. MOFSimplify also encourages community feedback on existing data and on ML model predictions for community-based active learning for improved MOF stability models. Measurement(s) | thermal decomposition | Technology Type(s) | thermogravimetry |
Collapse
Affiliation(s)
- Aditya Nandy
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.,Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Gianmarco Terrones
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Naveen Arunachalam
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Chenru Duan
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.,Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - David W Kastner
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.,Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Heather J Kulik
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
| |
Collapse
|
45
|
Affiliation(s)
- Leo H. Chiang
- Core R&D The Dow Chemical Company Lake Jackson Texas 77566 USA
| | - Birgit Braun
- Core R&D The Dow Chemical Company Lake Jackson Texas 77566 USA
| | - Zhenyu Wang
- Chemometrics, AI & Statistics The Dow Chemical Company Lake Jackson Texas 77566 USA
| | - Ivan Castillo
- Chemometrics, AI & Statistics The Dow Chemical Company Lake Jackson Texas 77566 USA
| |
Collapse
|
46
|
Saldívar-González FI, Aldas-Bulos VD, Medina-Franco JL, Plisson F. Natural product drug discovery in the artificial intelligence era. Chem Sci 2022; 13:1526-1546. [PMID: 35282622 PMCID: PMC8827052 DOI: 10.1039/d1sc04471k] [Citation(s) in RCA: 70] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Accepted: 12/10/2021] [Indexed: 12/19/2022] Open
Abstract
Natural products (NPs) are primarily recognized as privileged structures to interact with protein drug targets. Their unique characteristics and structural diversity continue to marvel scientists for developing NP-inspired medicines, even though the pharmaceutical industry has largely given up. High-performance computer hardware, extensive storage, accessible software and affordable online education have democratized the use of artificial intelligence (AI) in many sectors and research areas. The last decades have introduced natural language processing and machine learning algorithms, two subfields of AI, to tackle NP drug discovery challenges and open up opportunities. In this article, we review and discuss the rational applications of AI approaches developed to assist in discovering bioactive NPs and capturing the molecular "patterns" of these privileged structures for combinatorial design or target selectivity.
Collapse
Affiliation(s)
- F I Saldívar-González
- DIFACQUIM Research Group, School of Chemistry, Department of Pharmacy, Universidad Nacional Autónoma de México Avenida Universidad 3000 04510 Mexico Mexico
| | - V D Aldas-Bulos
- Unidad de Genómica Avanzada, Laboratorio Nacional de Genómica para la Biodiversidad (Langebio), Centro de Investigación y de Estudios Avanzados del IPN Irapuato Guanajuato Mexico
| | - J L Medina-Franco
- DIFACQUIM Research Group, School of Chemistry, Department of Pharmacy, Universidad Nacional Autónoma de México Avenida Universidad 3000 04510 Mexico Mexico
| | - F Plisson
- CONACYT - Unidad de Genómica Avanzada, Laboratorio Nacional de Genómica para la Biodiversidad (Langebio), Centro de Investigación y de Estudios Avanzados del IPN Irapuato Guanajuato Mexico
| |
Collapse
|
47
|
Li Y, Yu L, Liu J, Guo L, Wu Y, Wu X. NetDPO: (delta, gamma)-approximate pattern matching with gap constraints under one-off condition. APPL INTELL 2022. [DOI: 10.1007/s10489-021-03000-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
|
48
|
Designing a multilayer film via machine learning of scientific literature. Sci Rep 2022; 12:930. [PMID: 35042971 PMCID: PMC8766440 DOI: 10.1038/s41598-022-05010-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 01/04/2022] [Indexed: 12/23/2022] Open
Abstract
Scientists who design chemical substances often use materials informatics (MI), a data-driven approach with either computer simulation or artificial intelligence (AI). MI is a valuable technique, but applying it to layered structures is difficult. Most of the proposed computer-aided material search techniques use atomic or molecular simulations, which are limited to small areas. Some AI approaches have planned layered structures, but they require a physical theory or abundant experimental results. There is no universal design tool for multilayer films in MI. Here, we show a multilayer film can be designed through machine learning (ML) of experimental procedures extracted from chemical-coating articles. We converted material names according to International Union of Pure and Applied Chemistry rules and stored them in databases for each fabrication step without any physicochemical theory. Compared with experimental results which depend on authors, experimental protocol is superiority at almost unified and less data loss. Connecting scientific knowledge through ML enables us to predict untrained film structures. This suggests that AI imitates research activity, which is normally inspired by other scientific achievements and can thus be used as a general design technique.
Collapse
|
49
|
Cai X, Wang N, Yang L, Mei X. Global-local neighborhood based network representation for citation recommendation. APPL INTELL 2022. [DOI: 10.1007/s10489-021-02964-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
50
|
Chen K, Tian H, Li B, Rangarajan S. A chemistry‐inspired neural network kinetic model for oxidative coupling of methane from high‐throughput data. AIChE J 2022. [DOI: 10.1002/aic.17584] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Affiliation(s)
- Kexin Chen
- Department of Chemical and Biomolecular Engineering Lehigh University Bethlehem Pennsylvania USA
| | - Huijie Tian
- Department of Chemical and Biomolecular Engineering Lehigh University Bethlehem Pennsylvania USA
| | - Bowen Li
- Department of Chemical and Biomolecular Engineering Lehigh University Bethlehem Pennsylvania USA
| | - Srinivas Rangarajan
- Department of Chemical and Biomolecular Engineering Lehigh University Bethlehem Pennsylvania USA
| |
Collapse
|