1
|
Jeong J, Park T, Song J, Kang S, Won J, Han J, Min K. Integrating Data Mining and Natural Language Processing to Construct a Melting Point Database for Organometallic Compounds. J Chem Inf Model 2024; 64:7432-7446. [PMID: 39352375 DOI: 10.1021/acs.jcim.4c01254] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/03/2024]
Abstract
As semiconductor devices are miniaturized, the importance of atomic layer deposition (ALD) technology is growing. When designing ALD precursors, it is important to consider the melting point, because the precursors should have melting points lower than the process temperature. However, obtaining melting point data is challenging due to experimental sensitivity and high computational costs. As a result, a comprehensive and well-organized database for the melting point of the OMCs has not been fully reported yet. Therefore, in this study, we constructed a database of melting points for 1,845 OMCs, including 58 metal and 6 metalloid elements. The database contains CAS numbers, molecular formulas, and structural information and was constructed through automatic extraction and systematic curation. The melting point information was extracted using two methods: 1) 1,434 materials from 11 chemical vendor databases and 2) 411 materials identified through natural language processing (NLP) techniques with an accuracy of 86.3%, based on 2,096 scientific papers published over the past 29 years. In our database, the OMCs contain up to around 250 atoms and have melting points that range from -170 to 1610 °C. The main source is the Chemsrc database, accounting for 607 materials (32.9%), and Fe is the most common central metal or metalloid element (15.0%), followed by Si (11.6%) and B (6.7%). To validate the utilization of the constructed database, a multimodal neural network model was developed integrating graph-based and feature-based information as descriptors to predict the melting points of the OMCs but moderate performance. We believe the current approach reduces the time and cost associated with hand-operated data collection and processing, contributing to effective screening of potentially promising ALD precursors and providing crucial information for the advancement of the semiconductor industry.
Collapse
Affiliation(s)
- Jinyoung Jeong
- School of Mechanical Engineering, Soongsil University, 369 Sangdo-ro, Dongjak-gu, Seoul 06978, Republic of Korea
| | - Taehyun Park
- School of Mechanical Engineering, Soongsil University, 369 Sangdo-ro, Dongjak-gu, Seoul 06978, Republic of Korea
| | - JunHo Song
- School of Mechanical Engineering, Soongsil University, 369 Sangdo-ro, Dongjak-gu, Seoul 06978, Republic of Korea
| | - Seungpyo Kang
- School of Mechanical Engineering, Soongsil University, 369 Sangdo-ro, Dongjak-gu, Seoul 06978, Republic of Korea
| | - Joonghee Won
- POC TU, Samsung Advanced Institute of Technology, Suwon, Gyeonggi-do 16678, Republic of Korea
| | - Jungim Han
- POC TU, Samsung Advanced Institute of Technology, Suwon, Gyeonggi-do 16678, Republic of Korea
| | - Kyoungmin Min
- School of Mechanical Engineering, Soongsil University, 369 Sangdo-ro, Dongjak-gu, Seoul 06978, Republic of Korea
| |
Collapse
|
2
|
Chen LY, Li YP. Machine learning-guided strategies for reaction conditions design and optimization. Beilstein J Org Chem 2024; 20:2476-2492. [PMID: 39376489 PMCID: PMC11457048 DOI: 10.3762/bjoc.20.212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Accepted: 09/19/2024] [Indexed: 10/09/2024] Open
Abstract
This review surveys the recent advances and challenges in predicting and optimizing reaction conditions using machine learning techniques. The paper emphasizes the importance of acquiring and processing large and diverse datasets of chemical reactions, and the use of both global and local models to guide the design of synthetic processes. Global models exploit the information from comprehensive databases to suggest general reaction conditions for new reactions, while local models fine-tune the specific parameters for a given reaction family to improve yield and selectivity. The paper also identifies the current limitations and opportunities in this field, such as the data quality and availability, and the integration of high-throughput experimentation. The paper demonstrates how the combination of chemical engineering, data science, and ML algorithms can enhance the efficiency and effectiveness of reaction conditions design, and enable novel discoveries in synthetic chemistry.
Collapse
Affiliation(s)
- Lung-Yi Chen
- Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan
| | - Yi-Pei Li
- Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan
- Taiwan International Graduate Program on Sustainable Chemical Science and Technology (TIGP-SCST), No. 128, Sec. 2, Academia Road, Taipei 11529, Taiwan
| |
Collapse
|
3
|
Ai Q, Meng F, Shi J, Pelkie B, Coley CW. Extracting structured data from organic synthesis procedures using a fine-tuned large language model. DIGITAL DISCOVERY 2024; 3:1822-1831. [PMID: 39157760 PMCID: PMC11322921 DOI: 10.1039/d4dd00091a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Accepted: 07/30/2024] [Indexed: 08/20/2024]
Abstract
The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and despite the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we fine-tune a large language model (LLM) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD "messages" (e.g., full compound, workups, or condition definitions) and 92.25% for individual data fields (e.g., compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.
Collapse
Affiliation(s)
- Qianxiang Ai
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| | - Fanwang Meng
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| | - Jiale Shi
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| | - Brenden Pelkie
- Department of Chemical Engineering, University of Washington Seattle WA USA
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| |
Collapse
|
4
|
Huang Z, Li X, Li A, Yang Y, He L, Zhang Z, Wu S, Wang Y, Cai S, He Y, Liu X. MPNTEXT: An Interactive Platform for Automatically Extracting Metal-Polyphenol Networks and Their Applications from Scientific Literature. J Chem Inf Model 2024. [PMID: 39258795 DOI: 10.1021/acs.jcim.4c01093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/12/2024]
Abstract
In recent years, metal-polyphenol networks (MPNs) have gained significant attention due to their unique properties and broad applications across various fields. However, the burgeoning volume of MPN literature necessitates the automation of chemical information extraction from the extensive corpus of unstructured data, including scientific publications. To address this challenge, we proposed a platform named MPNTEXT, which utilized natural language processing techniques and machine learning algorithms to efficiently identify and extract pertinent information, thereby assisting users in comprehending complex MPNs and their textual descriptions of applications. Users can enter keywords, such as "Fe", "drug delivery", or "tannic acid", to retrieve relevant information, which is then presented in a structured format. This study aims to provide a user-friendly tool for collecting and retrieving MPN data and promotes data-driven material design. The platform offers researchers a more convenient and efficient way to design versatile MPNs and explore their applications.
Collapse
Affiliation(s)
- Zihui Huang
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Xinyi Li
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Andi Li
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Yuhang Yang
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Liqiang He
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Zhiwen Zhang
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Siwei Wu
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Yang Wang
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Shuting Cai
- School of Integrated Circuits, Guangdong University of Technology, Guangzhou 510006, China
| | - Yan He
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Xujie Liu
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| |
Collapse
|
5
|
Zhang X, Li Y, Li C, Zhu J, Gan Z, Wang L, Sun X, You H. A chemical reaction entity recognition method based on a natural language data augmentation strategy. Chem Commun (Camb) 2024; 60:9610-9613. [PMID: 39148332 DOI: 10.1039/d4cc01471e] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
Impressive applications of artificial intelligence in the field of chemical reaction prediction heavily depend on abundant reliable datasets. The automated extraction of reaction procedures to build structured chemical databases is of growing importance. Here, we propose a novel model named DACRER for large-scale reaction extraction, in which transfer learning and a data augmentation strategy were employed. This model was evaluated for chemical datasets and shows good performance in identifying and processing chemical texts.
Collapse
Affiliation(s)
- Xiaowen Zhang
- School of Science, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China.
| | - Yang Li
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, Anhui, China
| | - Chaoyi Li
- School of Science, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China.
| | - Jingyuan Zhu
- School of Science, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China.
| | - Zhiqiang Gan
- School of Science, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China.
| | - Lei Wang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang 277160, Shandong, China
| | - Xiaofei Sun
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang 277160, Shandong, China
| | - Hengzhi You
- School of Science, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China.
- Green Pharmaceutical Engineering Research Center, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China
| |
Collapse
|
6
|
Rebello NJ, Arora A, Mochigase H, Lin TS, Shi J, Audus DJ, Muckley ES, Osmani A, Olsen BD. The Block Copolymer Phase Behavior Database. J Chem Inf Model 2024; 64:6464-6476. [PMID: 39126359 DOI: 10.1021/acs.jcim.4c00242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/12/2024]
Abstract
The Block Copolymer Database (BCDB) is a platform that allows users to search, submit, visualize, benchmark, and download experimental phase measurements and their associated characterization information for di- and multiblock copolymers. To the best of our knowledge, there is no widely accepted data model for publishing experimental and simulation data on block copolymer self-assembly. This proposed data schema with traceable information can accommodate any number of blocks and at the time of publication contains over 5400 block copolymer total melt phase measurements mined from the literature and manually curated and simulation data points of the phase diagram generated from self-consistent field theory that can rapidly be augmented. This database can be accessed via the Community Resource for Innovation in Polymer Technology (CRIPT) web application and the Materials Data Facility. The chemical structure of the polymer is encoded in BigSMILES, an extension of the Simplified Molecular-Input Line-Entry System (SMILES) into the macromolecular domain, and the user can search repeat units and functional groups using the SMARTS search syntax (SMILES Arbitrary Target Specification). The user can also query characterization and phase information using Structured Query Language (SQL) and download custom sets of block copolymer data to train machine learning models. Finally, a protocol is presented in which GPT-4, an AI-powered large language model, can be used to rapidly screen and identify block copolymer papers from the literature using only the abstract text and determine whether they have BCDB data, allowing the database to grow as the number of published papers on the World Wide Web increases. The F1 score for this model is 0.74. This platform is an important step in making polymer data more accessible to the broader community.
Collapse
Affiliation(s)
- Nathan J Rebello
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Akash Arora
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Hidenobu Mochigase
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Tzyy-Shyang Lin
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Jiale Shi
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Debra J Audus
- Materials Science and Engineering Division, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, Maryland 20899, United States
| | - Eric S Muckley
- Citrine Informatics, Redwood City, California 94063-2483, United States
| | - Ardiana Osmani
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Bradley D Olsen
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
7
|
Su Y, Wang X, Ye Y, Xie Y, Xu Y, Jiang Y, Wang C. Automation and machine learning augmented by large language models in a catalysis study. Chem Sci 2024; 15:12200-12233. [PMID: 39118602 PMCID: PMC11304797 DOI: 10.1039/d3sc07012c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2023] [Accepted: 06/21/2024] [Indexed: 08/10/2024] Open
Abstract
Recent advancements in artificial intelligence and automation are transforming catalyst discovery and design from traditional trial-and-error manual mode into intelligent, high-throughput digital methodologies. This transformation is driven by four key components, including high-throughput information extraction, automated robotic experimentation, real-time feedback for iterative optimization, and interpretable machine learning for generating new knowledge. These innovations have given rise to the development of self-driving labs and significantly accelerated materials research. Over the past two years, the emergence of large language models (LLMs) has added a new dimension to this field, providing unprecedented flexibility in information integration, decision-making, and interacting with human researchers. This review explores how LLMs are reshaping catalyst design, heralding a revolutionary change in the fields.
Collapse
Affiliation(s)
- Yuming Su
- iChem, State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Chemistry and Chemical Engineering, Xiamen University Xiamen 361005 P. R. China
- Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM) Xiamen 361005 P. R. China
| | - Xue Wang
- iChem, State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Chemistry and Chemical Engineering, Xiamen University Xiamen 361005 P. R. China
| | - Yuanxiang Ye
- Institute of Artificial Intelligence, Xiamen University Xiamen 361005 P. R. China
| | - Yibo Xie
- Institute of Artificial Intelligence, Xiamen University Xiamen 361005 P. R. China
| | - Yujing Xu
- iChem, State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Chemistry and Chemical Engineering, Xiamen University Xiamen 361005 P. R. China
| | - Yibin Jiang
- Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM) Xiamen 361005 P. R. China
| | - Cheng Wang
- iChem, State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Chemistry and Chemical Engineering, Xiamen University Xiamen 361005 P. R. China
- Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM) Xiamen 361005 P. R. China
| |
Collapse
|
8
|
Fan V, Qian Y, Wang A, Wang A, Coley CW, Barzilay R. OpenChemIE: An Information Extraction Toolkit for Chemistry Literature. J Chem Inf Model 2024; 64:5521-5534. [PMID: 38950894 DOI: 10.1021/acs.jcim.4c00572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/03/2024]
Abstract
Information extraction from chemistry literature is vital for constructing up-to-date reaction databases for data-driven chemistry. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this paper, we present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities and then integrating the results to obtain a final list of reactions. For the first step, we employ specialized neural models that each address a specific task for chemistry information extraction, such as parsing molecules or reactions from text or figures. We then integrate the information from these modules using chemistry-informed algorithms, allowing for the extraction of fine-grained reaction data from reaction condition and substrate scope investigations. Our machine learning models attain state-of-the-art performance when evaluated individually, and we meticulously annotate a challenging dataset of reaction schemes with R-groups to evaluate our pipeline as a whole, achieving an F1 score of 69.5%. Additionally, the reaction extraction results of OpenChemIE attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. OpenChemIE is most suited for information extraction on organic chemistry literature, where molecules are generally depicted as planar graphs or written in text and can be consolidated into a SMILES format. We provide OpenChemIE freely to the public as an open-source package, as well as through a web interface.
Collapse
Affiliation(s)
- Vincent Fan
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Yujie Qian
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Alex Wang
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Amber Wang
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
9
|
Zhang W, Wang Q, Kong X, Xiong J, Ni S, Cao D, Niu B, Chen M, Li Y, Zhang R, Wang Y, Zhang L, Li X, Xiong Z, Shi Q, Huang Z, Fu Z, Zheng M. Fine-tuning large language models for chemical text mining. Chem Sci 2024; 15:10600-10611. [PMID: 38994403 PMCID: PMC11234886 DOI: 10.1039/d4sc00924j] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Accepted: 06/02/2024] [Indexed: 07/13/2024] Open
Abstract
Extracting knowledge from complex and diverse chemical texts is a pivotal task for both experimental and computational chemists. The task is still considered to be extremely challenging due to the complexity of the chemical language and scientific literature. This study explored the power of fine-tuned large language models (LLMs) on five intricate chemical text mining tasks: compound entity recognition, reaction role labelling, metal-organic framework (MOF) synthesis information extraction, nuclear magnetic resonance spectroscopy (NMR) data extraction, and the conversion of reaction paragraphs to action sequences. The fine-tuned LLMs demonstrated impressive performance, significantly reducing the need for repetitive and extensive prompt engineering experiments. For comparison, we guided ChatGPT (GPT-3.5-turbo) and GPT-4 with prompt engineering and fine-tuned GPT-3.5-turbo as well as other open-source LLMs such as Mistral, Llama3, Llama2, T5, and BART. The results showed that the fine-tuned ChatGPT models excelled in all tasks. They achieved exact accuracy levels ranging from 69% to 95% on these tasks with minimal annotated data. They even outperformed those task-adaptive pre-training and fine-tuning models that were based on a significantly larger amount of in-domain data. Notably, fine-tuned Mistral and Llama3 show competitive abilities. Given their versatility, robustness, and low-code capability, leveraging fine-tuned LLMs as flexible and effective toolkits for automated data acquisition could revolutionize chemical knowledge extraction.
Collapse
Affiliation(s)
- Wei Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Qinggong Wang
- Nanjing University of Chinese Medicine 138 Xianlin Road Nanjing 210023 China
| | - Xiangtai Kong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Jiacheng Xiong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Shengkun Ni
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Duanhua Cao
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou Zhejiang 310058 China
| | - Buying Niu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Mingan Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- School of Physical Science and Technology, ShanghaiTech University Shanghai 201210 China
- Lingang Laboratory Shanghai 200031 China
| | - Yameng Li
- ProtonUnfold Technology Co., Ltd Suzhou China
| | - Runze Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Yitian Wang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Lehan Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Xutong Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | | | - Qian Shi
- Lingang Laboratory Shanghai 200031 China
| | - Ziming Huang
- Medizinische Klinik und Poliklinik I, Klinikum der Universität München, Ludwig-Maximilians-Universität Munich Germany
| | - Zunyun Fu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
- Nanjing University of Chinese Medicine 138 Xianlin Road Nanjing 210023 China
| |
Collapse
|
10
|
Shaw WJ, Kidder MK, Bare SR, Delferro M, Morris JR, Toma FM, Senanayake SD, Autrey T, Biddinger EJ, Boettcher S, Bowden ME, Britt PF, Brown RC, Bullock RM, Chen JG, Daniel C, Dorhout PK, Efroymson RA, Gaffney KJ, Gagliardi L, Harper AS, Heldebrant DJ, Luca OR, Lyubovsky M, Male JL, Miller DJ, Prozorov T, Rallo R, Rana R, Rioux RM, Sadow AD, Schaidle JA, Schulte LA, Tarpeh WA, Vlachos DG, Vogt BD, Weber RS, Yang JY, Arenholz E, Helms BA, Huang W, Jordahl JL, Karakaya C, Kian KC, Kothandaraman J, Lercher J, Liu P, Malhotra D, Mueller KT, O'Brien CP, Palomino RM, Qi L, Rodriguez JA, Rousseau R, Russell JC, Sarazen ML, Sholl DS, Smith EA, Stevens MB, Surendranath Y, Tassone CJ, Tran B, Tumas W, Walton KS. A US perspective on closing the carbon cycle to defossilize difficult-to-electrify segments of our economy. Nat Rev Chem 2024; 8:376-400. [PMID: 38693313 DOI: 10.1038/s41570-024-00587-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/16/2024] [Indexed: 05/03/2024]
Abstract
Electrification to reduce or eliminate greenhouse gas emissions is essential to mitigate climate change. However, a substantial portion of our manufacturing and transportation infrastructure will be difficult to electrify and/or will continue to use carbon as a key component, including areas in aviation, heavy-duty and marine transportation, and the chemical industry. In this Roadmap, we explore how multidisciplinary approaches will enable us to close the carbon cycle and create a circular economy by defossilizing these difficult-to-electrify areas and those that will continue to need carbon. We discuss two approaches for this: developing carbon alternatives and improving our ability to reuse carbon, enabled by separations. Furthermore, we posit that co-design and use-driven fundamental science are essential to reach aggressive greenhouse gas reduction targets.
Collapse
Affiliation(s)
- Wendy J Shaw
- Pacific Northwest National Laboratory, Richland, WA, USA.
| | | | - Simon R Bare
- SLAC National Accelerator Laboratory, Menlo Park, CA, USA.
| | | | | | - Francesca M Toma
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
- Institute of Functional Materials for Sustainability, Helmholtz Zentrum Hereon, Teltow, Brandenburg, Germany.
| | | | - Tom Autrey
- Pacific Northwest National Laboratory, Richland, WA, USA
| | | | - Shannon Boettcher
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Department of Chemical & Biomolecular Engineering and Department of Chemistry, University of California, Berkeley, Berkeley, CA, USA
| | - Mark E Bowden
- Pacific Northwest National Laboratory, Richland, WA, USA
| | | | - Robert C Brown
- Department of Mechanical Engineering, Iowa State University, Ames, IA, USA
| | | | - Jingguang G Chen
- Brookhaven National Laboratory, Upton, NY, USA
- Department of Chemical Engineering, Columbia University, New York, NY, USA
| | | | - Peter K Dorhout
- Vice President for Research, Iowa State University, Ames, IA, USA
| | | | | | - Laura Gagliardi
- Department of Chemistry, The University of Chicago, Chicago, IL, USA
| | - Aaron S Harper
- Pacific Northwest National Laboratory, Richland, WA, USA
| | - David J Heldebrant
- Pacific Northwest National Laboratory, Richland, WA, USA
- Chemical Engineering and Bioengineering, Washington State University, Pullman, WA, USA
| | - Oana R Luca
- Department of Chemistry, University of Colorado Boulder, Boulder, CO, USA
| | | | - Jonathan L Male
- Pacific Northwest National Laboratory, Richland, WA, USA
- Biological Systems Engineering Department, Washington State University, Pullman, WA, USA
| | | | | | - Robert Rallo
- Pacific Northwest National Laboratory, Richland, WA, USA
| | - Rachita Rana
- Department of Chemical Engineering, University of California, Davis, CA, USA
| | - Robert M Rioux
- Department of Chemical Engineering, The Pennsylvania State University, University Park, PA, USA
| | - Aaron D Sadow
- Ames National Laboratory, Ames, IA, USA
- Department of Chemistry, Iowa State University, Ames, IA, USA
| | | | - Lisa A Schulte
- Department of Natural Resource Ecology and Management, Iowa State University, Ames, IA, USA
| | - William A Tarpeh
- Department of Chemical Engineering, Stanford University, Stanford, CA, USA
| | - Dionisios G Vlachos
- Department of Chemical and Biomolecular Engineering, University of Delaware, Newark, DE, USA
| | - Bryan D Vogt
- Department of Chemical Engineering, The Pennsylvania State University, University Park, PA, USA
| | - Robert S Weber
- Pacific Northwest National Laboratory, Richland, WA, USA
| | - Jenny Y Yang
- Department of Chemistry, University of California Irvine, Irvine, CA, USA
| | - Elke Arenholz
- Pacific Northwest National Laboratory, Richland, WA, USA
| | - Brett A Helms
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Wenyu Huang
- Ames National Laboratory, Ames, IA, USA
- Department of Chemistry, Iowa State University, Ames, IA, USA
| | - James L Jordahl
- Department of Natural Resource Ecology and Management, Iowa State University, Ames, IA, USA
| | | | - Kourosh Cyrus Kian
- Independent consultant, Washington DC, USA
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | | | - Johannes Lercher
- Pacific Northwest National Laboratory, Richland, WA, USA
- Department of Chemistry, Technical University of Munich, Munich, Germany
| | - Ping Liu
- Brookhaven National Laboratory, Upton, NY, USA
| | | | - Karl T Mueller
- Pacific Northwest National Laboratory, Richland, WA, USA
| | - Casey P O'Brien
- Department of Chemical and Biomolecular Engineering, University of Notre Dame, Notre Dame, IN, USA
| | | | - Long Qi
- Ames National Laboratory, Ames, IA, USA
| | | | | | - Jake C Russell
- Advanced Research Projects Agency - Energy, Department of Energy, Washington DC, USA
| | - Michele L Sarazen
- Department of Chemical and Biological Engineering, Princeton University, Princeton, NJ, USA
| | | | - Emily A Smith
- Ames National Laboratory, Ames, IA, USA
- Department of Chemistry, Iowa State University, Ames, IA, USA
| | | | - Yogesh Surendranath
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA, USA
| | | | - Ba Tran
- Pacific Northwest National Laboratory, Richland, WA, USA
| | - William Tumas
- National Renewable Energy Laboratory, Golden, CO, USA
| | - Krista S Walton
- School of Chemical and Biomolecular Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| |
Collapse
|
11
|
Bai J, Mosbach S, Taylor CJ, Karan D, Lee KF, Rihm SD, Akroyd J, Lapkin AA, Kraft M. A dynamic knowledge graph approach to distributed self-driving laboratories. Nat Commun 2024; 15:462. [PMID: 38263405 PMCID: PMC10805810 DOI: 10.1038/s41467-023-44599-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 12/21/2023] [Indexed: 01/25/2024] Open
Abstract
The ability to integrate resources and share knowledge across organisations empowers scientists to expedite the scientific discovery process. This is especially crucial in addressing emerging global challenges that require global solutions. In this work, we develop an architecture for distributed self-driving laboratories within The World Avatar project, which seeks to create an all-encompassing digital twin based on a dynamic knowledge graph. We employ ontologies to capture data and material flows in design-make-test-analyse cycles, utilising autonomous agents as executable knowledge components to carry out the experimentation workflow. Data provenance is recorded to ensure its findability, accessibility, interoperability, and reusability. We demonstrate the practical application of our framework by linking two robots in Cambridge and Singapore for a collaborative closed-loop optimisation for a pharmaceutically-relevant aldol condensation reaction in real-time. The knowledge graph autonomously evolves toward the scientist's research goals, with the two robots effectively generating a Pareto front for cost-yield optimisation in three days.
Collapse
Affiliation(s)
- Jiaru Bai
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK
| | - Sebastian Mosbach
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK
- Cambridge Centre for Advanced Research and Education in Singapore (CARES), 1 Create Way, CREATE Tower, #05-05, Singapore, 138602, Singapore
| | - Connor J Taylor
- Astex Pharmaceuticals, 436 Cambridge Science Park Milton Road, Cambridge, CB4 0QA, UK
- Innovation Centre in Digital Molecular Technologies, Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
- Faculty of Engineering, University of Nottingham, University Park, Nottingham, NG7 2RD, UK
| | - Dogancan Karan
- Cambridge Centre for Advanced Research and Education in Singapore (CARES), 1 Create Way, CREATE Tower, #05-05, Singapore, 138602, Singapore
| | - Kok Foong Lee
- CMCL Innovations, Sheraton House, Cambridge, CB3 0AX, UK
| | - Simon D Rihm
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK
- Cambridge Centre for Advanced Research and Education in Singapore (CARES), 1 Create Way, CREATE Tower, #05-05, Singapore, 138602, Singapore
| | - Jethro Akroyd
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK
- Cambridge Centre for Advanced Research and Education in Singapore (CARES), 1 Create Way, CREATE Tower, #05-05, Singapore, 138602, Singapore
| | - Alexei A Lapkin
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK
- Cambridge Centre for Advanced Research and Education in Singapore (CARES), 1 Create Way, CREATE Tower, #05-05, Singapore, 138602, Singapore
- Innovation Centre in Digital Molecular Technologies, Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - Markus Kraft
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK.
- Cambridge Centre for Advanced Research and Education in Singapore (CARES), 1 Create Way, CREATE Tower, #05-05, Singapore, 138602, Singapore.
- School of Chemical and Biomedical Engineering, Nanyang Technological University, 62 Nanyang Drive, 637459, Singapore, Singapore.
- The Alan Turing Institute, London, NW1 2DB, UK.
| |
Collapse
|
12
|
Zhang B, Xiao H, Ye G, Song Z, Han T, Sharman E, Luo M, Cheng A, Zhu Q, Zhao H, Zhang G, Wang S, Jiang J. Label-Free Data Mining of Scientific Literature by Unsupervised Syntactic Distance Analysis. J Phys Chem Lett 2024; 15:212-219. [PMID: 38157213 DOI: 10.1021/acs.jpclett.3c03345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2024]
Abstract
Label-free data mining can efficiently feed large amounts of data from the vast scientific literature into artificial intelligence (AI) processing systems. Here, we demonstrate an unsupervised syntactic distance analysis (SDA) approach that is capable of mining chemical substances, functions, properties, and operations without annotation. This SDA approach was evaluated in several areas of research from the physical sciences and achieved performance in information mining comparable to that of supervised learning, as shown by its satisfactory scores of 0.62-0.72, 0.60-0.82, and 0.86-0.95 in precision, recall, and accuracy, respectively. We also showcase how our approach can assist robotic chemists programmed to perform research focused on double-perovskite colloidal nanocrystals, gold colloidal nanocrystals, oxygen evolution reaction catalysts, and enzyme-like catalysts by designing materials, formulations, and synthesis parameters based on data mined from 1.1 million literature references.
Collapse
Affiliation(s)
- Baicheng Zhang
- Key Laboratory of Precision and Intelligent Chemistry, School of Chemistry and Materials Science, University of Science and Technology of China, Hefei, Anhui 230026, China
| | - Hengyu Xiao
- Key Laboratory of Precision and Intelligent Chemistry, School of Chemistry and Materials Science, University of Science and Technology of China, Hefei, Anhui 230026, China
| | - Guilin Ye
- Hefei JiShu Quantum Technology Co. Ltd., Hefei 230026, China
| | - Zhaokun Song
- Hefei JiShu Quantum Technology Co. Ltd., Hefei 230026, China
| | - Tiantian Han
- Hefei JiShu Quantum Technology Co. Ltd., Hefei 230026, China
| | - Edward Sharman
- Department of Neurology, University of California, Irvine, California 92697, United States
| | - Man Luo
- Key Laboratory of Precision and Intelligent Chemistry, School of Chemistry and Materials Science, University of Science and Technology of China, Hefei, Anhui 230026, China
| | - Aoyuan Cheng
- Key Laboratory of Precision and Intelligent Chemistry, School of Chemistry and Materials Science, University of Science and Technology of China, Hefei, Anhui 230026, China
| | - Qing Zhu
- Key Laboratory of Precision and Intelligent Chemistry, School of Chemistry and Materials Science, University of Science and Technology of China, Hefei, Anhui 230026, China
| | - Haitao Zhao
- Materials Interfaces Center, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Guoqing Zhang
- Key Laboratory of Precision and Intelligent Chemistry, School of Chemistry and Materials Science, University of Science and Technology of China, Hefei, Anhui 230026, China
| | - Song Wang
- Key Laboratory of Precision and Intelligent Chemistry, School of Chemistry and Materials Science, University of Science and Technology of China, Hefei, Anhui 230026, China
| | - Jun Jiang
- Key Laboratory of Precision and Intelligent Chemistry, School of Chemistry and Materials Science, University of Science and Technology of China, Hefei, Anhui 230026, China
| |
Collapse
|
13
|
Matsumoto Y, Gotoh H. Compound Classification and Consideration of Correlation with Chemical Descriptors from Articles on Antioxidant Capacity Using Natural Language Processing. J Chem Inf Model 2024; 64:119-127. [PMID: 38118462 DOI: 10.1021/acs.jcim.3c01826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2023]
Abstract
In recent times, there has been a substantial increase in the number of articles focusing on antioxidants. However, the development of a comprehensive estimator for antioxidant capacity remains elusive due to the challenge of integrating information from these articles. Furthermore, the complexity of the antioxidant mechanism, which involves a multitude of factors, makes it difficult to establish a simple equation or correlation. Hence, there is a pressing need for a model that can effectively interpret the collective knowledge from these articles, especially from a chemistry perspective. In this research, we employed natural language processing techniques, specifically Word2Vec, to analyze articles related to antioxidant capacity. We extracted representation vectors of compound names from these documents and organized them into 10 distinct clusters. In our investigation of two of these clusters, we unveiled that the majority of the compounds in question were flavonoids and flavonoid glycosides. To establish a link between the descriptors and clusters, we utilized kernel density estimation and generated scatter plots to visualize their similarity. These visualizations clearly indicated a strong relationship between the descriptors and clusters, affirming that a tangible connection exists between word vectors and compound descriptors through a document analysis conducted with natural language processing techniques. This study represents a pioneering approach that utilizes document analysis to shed light on the field of antioxidant capacity research, marking a significant advancement in this domain.
Collapse
Affiliation(s)
- Yuto Matsumoto
- Department of Chemistry and Life Science, Yokohama National University, 79-5 Tokiwadai, Hodogaya-ku, Yokohama 240-8501, Japan
| | - Hiroaki Gotoh
- Department of Chemistry and Life Science, Yokohama National University, 79-5 Tokiwadai, Hodogaya-ku, Yokohama 240-8501, Japan
| |
Collapse
|
14
|
Zhang Y, Liu C, Liu M, Liu T, Lin H, Huang CB, Ning L. Attention is all you need: utilizing attention in AI-enabled drug discovery. Brief Bioinform 2023; 25:bbad467. [PMID: 38189543 PMCID: PMC10772984 DOI: 10.1093/bib/bbad467] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 11/03/2023] [Accepted: 11/25/2023] [Indexed: 01/09/2024] Open
Abstract
Recently, attention mechanism and derived models have gained significant traction in drug development due to their outstanding performance and interpretability in handling complex data structures. This review offers an in-depth exploration of the principles underlying attention-based models and their advantages in drug discovery. We further elaborate on their applications in various aspects of drug development, from molecular screening and target binding to property prediction and molecule generation. Finally, we discuss the current challenges faced in the application of attention mechanisms and Artificial Intelligence technologies, including data quality, model interpretability and computational resource constraints, along with future directions for research. Given the accelerating pace of technological advancement, we believe that attention-based models will have an increasingly prominent role in future drug discovery. We anticipate that these models will usher in revolutionary breakthroughs in the pharmaceutical domain, significantly accelerating the pace of drug development.
Collapse
Affiliation(s)
- Yang Zhang
- Innovative Institute of Chinese Medicine and Pharmacy, Academy for Interdiscipline, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Caiqi Liu
- Department of Gastrointestinal Medical Oncology, Harbin Medical University Cancer Hospital, No.150 Haping Road, Nangang District, Harbin, Heilongjiang 150081, China
- Key Laboratory of Molecular Oncology of Heilongjiang Province, No.150 Haping Road, Nangang District, Harbin, Heilongjiang 150081, China
| | - Mujiexin Liu
- Chongqing Key Laboratory of Sichuan-Chongqing Co-construction for Diagnosis and Treatment of Infectious Diseases Integrated Traditional Chinese and Western Medicine, College of Medical Technology, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Tianyuan Liu
- Graduate School of Science and Technology, University of Tsukuba, Tsukuba, Japan
| | - Hao Lin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Cheng-Bing Huang
- School of Computer Science and Technology, Aba Teachers University, Aba, China
| | - Lin Ning
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China
| |
Collapse
|
15
|
Machi K, Akiyama S, Nagata Y, Yoshioka M. OSPAR: A Corpus for Extraction of Organic Synthesis Procedures with Argument Roles. J Chem Inf Model 2023; 63:6619-6628. [PMID: 37859303 PMCID: PMC10647022 DOI: 10.1021/acs.jcim.3c01449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Revised: 10/05/2023] [Accepted: 10/06/2023] [Indexed: 10/21/2023]
Abstract
There is a pressing need for the automated extraction of chemical reaction information because of the rapid growth of scientific documents. The previously reported works in the literature for the procedure extraction either (a) did not consider the semantic relations between the action and argument or (b) defined a detailed schema for the extraction. The former method was insufficient for reproducing the reaction, while the latter methods were too specific to their own schema and did not consider the general semantic relation between the verb and argument. In addition, they did not provide an annotated text that aligned with the structured procedure. Along these lines, in this work, we propose a corpus named organic synthesis procedures with argument roles (OSPAR) that is annotated with rolesets to consider the semantic relation between the verb and argument. We also provide rolesets for chemical reactions, especially for organic synthesis, which represent the argument roles of actions in the corpus. More specifically, we annotated 112 organic synthesis procedures in journal articles from Organic Syntheses and defined 19 new rolesets in addition to 29 rolesets from an existing language resource (Proposition Bank). After that, we constructed a simple deep learning system trained on OSPAR and discussed the usefulness of the corpus by comparing it with chemical description language (XDL) generated by a natural language processing tool, namely, SynthReader. While our system's output required more detailed parsing, it covered comparable information against XDL. Moreover, we confirmed that the validation of the output action sequence was easy as it was aligned with the original text.
Collapse
Affiliation(s)
- Kojiro Machi
- Graduate
School of Information Science and Technology, Hokkaido University, Kita 14, Nishi
9, Kita-ku, Sapporo, Hokkaido 060-0814, Japan
| | - Seiji Akiyama
- Institute
for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University,
Kita 21, Nishi 10, Kita-ku, Sapporo, Hokkaido 001-0021, Japan
| | - Yuuya Nagata
- Institute
for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University,
Kita 21, Nishi 10, Kita-ku, Sapporo, Hokkaido 001-0021, Japan
| | - Masaharu Yoshioka
- Graduate
School of Information Science and Technology, Hokkaido University, Kita 14, Nishi
9, Kita-ku, Sapporo, Hokkaido 060-0814, Japan
- Institute
for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University,
Kita 21, Nishi 10, Kita-ku, Sapporo, Hokkaido 001-0021, Japan
- Faculty
of Information Science and Technology, Hokkaido
University, Kita 14, Nishi 9, Kita-ku, Sapporo, Hokkaido 060-0814, Japan
| |
Collapse
|
16
|
Li S, Zhang Y, Fang Z, Meng K, Tian R, He H, Sun S. Extracting the Synthetic Route of Pd-Based Catalysts in Methanol Steam Reforming from the Scientific Literature. J Chem Inf Model 2023; 63:6249-6260. [PMID: 37807535 DOI: 10.1021/acs.jcim.3c01442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]
Abstract
The structured material synthesis route is crucial for chemists in performing experiments and modern applications such as machine learning material design. With the exponential growth of the chemical literature in recent years, manual extraction from the published literature is time-consuming and labor-intensive. This study focuses on developing an automated method for extracting Pd-based catalyst synthesis routes from the chemical literature. First, a paragraph classification model based on regular expressions is employed to identify paragraphs that contain material synthesis processes. The identified paragraphs are verified using machine learning techniques. Second, natural language processing techniques are applied to automatically parse the material synthesis routes from the identified paragraphs, generate regularized flowcharts, and output structured data. Lastly, we utilized the structured data of the synthesis routes to train machine learning models and predict the performance of the materials. The extracted material entities include the product, preparation method, precursor, support, loading, synthesis operation, and operation condition. This method avoids extensive manual data annotation and improves the scientific literature information acquisition efficiency. The accuracy of the 11 material entities exceeds 80%, and the accuracy of the method, support, precursor, drying time, and reduction time exceeds 90%.
Collapse
Affiliation(s)
- Shuyuan Li
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Yunjiang Zhang
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Zhaolin Fang
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Kong Meng
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Rui Tian
- Beijing Engineering Research Center for IoT Software and Systems, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
| | - Hong He
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Shaorui Sun
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| |
Collapse
|
17
|
Jablonka KM, Ai Q, Al-Feghali A, Badhwar S, Bocarsly JD, Bran AM, Bringuier S, Brinson LC, Choudhary K, Circi D, Cox S, de Jong WA, Evans ML, Gastellu N, Genzling J, Gil MV, Gupta AK, Hong Z, Imran A, Kruschwitz S, Labarre A, Lála J, Liu T, Ma S, Majumdar S, Merz GW, Moitessier N, Moubarak E, Mouriño B, Pelkie B, Pieler M, Ramos MC, Ranković B, Rodriques SG, Sanders JN, Schwaller P, Schwarting M, Shi J, Smit B, Smith BE, Van Herck J, Völker C, Ward L, Warren S, Weiser B, Zhang S, Zhang X, Zia GA, Scourtas A, Schmidt KJ, Foster I, White AD, Blaiszik B. 14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon. DIGITAL DISCOVERY 2023; 2:1233-1250. [PMID: 38013906 PMCID: PMC10561547 DOI: 10.1039/d3dd00113j] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Accepted: 08/08/2023] [Indexed: 11/04/2023]
Abstract
Large-language models (LLMs) such as GPT-4 caught the interest of many scientists. Recent studies suggested that these models could be useful in chemistry and materials science. To explore these possibilities, we organized a hackathon. This article chronicles the projects built as part of this hackathon. Participants employed LLMs for various applications, including predicting properties of molecules and materials, designing novel interfaces for tools, extracting knowledge from unstructured data, and developing new educational applications. The diverse topics and the fact that working prototypes could be generated in less than two days highlight that LLMs will profoundly impact the future of our fields. The rich collection of ideas and projects also indicates that the applications of LLMs are not limited to materials science and chemistry but offer potential benefits to a wide range of scientific disciplines.
Collapse
Affiliation(s)
- Kevin Maik Jablonka
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Sion Valais Switzerland
| | - Qianxiang Ai
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge Massachusetts 02139 USA
| | | | | | - Joshua D Bocarsly
- Yusuf Hamied Department of Chemistry, University of Cambridge Lensfield Road Cambridge CB2 1EW UK
| | - Andres M Bran
- Laboratory of Artificial Chemical Intelligence (LIAC), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne Switzerland
- National Centre of Competence in Research (NCCR) Catalysis, Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne Switzerland
| | | | | | - Kamal Choudhary
- Material Measurement Laboratory, National Institute of Standards and Technology Maryland 20899 USA
| | - Defne Circi
- Mechanical Engineering and Materials Science, Duke University USA
| | - Sam Cox
- Department of Chemical Engineering, University of Rochester USA
| | - Wibe A de Jong
- Applied Mathematics and Computational Research Division, Lawrence Berkeley National Laboratory Berkeley CA 94720 USA
| | - Matthew L Evans
- Institut de la Matière Condensée et des Nanosciences (IMCN), UCLouvain Chemin des Étoiles 8 Louvain-la-Neuve 1348 Belgium
- Matgenix SRL 185 Rue Armand Bury 6534 Gozée Belgium
| | - Nicolas Gastellu
- Department of Chemistry, McGill University Montreal Quebec Canada
| | - Jerome Genzling
- Department of Chemistry, McGill University Montreal Quebec Canada
| | - María Victoria Gil
- Instituto de Ciencia y Tecnología del Carbono (INCAR), CSIC Francisco Pintado Fe 26 33011 Oviedo Spain
| | - Ankur K Gupta
- Applied Mathematics and Computational Research Division, Lawrence Berkeley National Laboratory Berkeley CA 94720 USA
| | - Zhi Hong
- Department of Computer Science, University of Chicago Chicago Illinois 60637 USA
| | - Alishba Imran
- Computer Science, University of California Berkeley CA 94704 USA
| | - Sabine Kruschwitz
- Bundesanstalt für Materialforschung und -prüfung Unter den Eichen 87 12205 Berlin Germany
| | - Anne Labarre
- Department of Chemistry, McGill University Montreal Quebec Canada
| | - Jakub Lála
- Francis Crick Institute 1 Midland Rd London NW1 1AT UK
| | - Tao Liu
- Department of Chemistry, McGill University Montreal Quebec Canada
| | - Steven Ma
- Department of Chemistry, McGill University Montreal Quebec Canada
| | - Sauradeep Majumdar
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Sion Valais Switzerland
| | - Garrett W Merz
- American Family Insurance Data Science Institute, University of Wisconsin-Madison Madison WI 53706 USA
| | | | - Elias Moubarak
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Sion Valais Switzerland
| | - Beatriz Mouriño
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Sion Valais Switzerland
| | - Brenden Pelkie
- Department of Chemical Engineering, University of Washington Seattle WA 98105 USA
| | | | | | - Bojana Ranković
- Laboratory of Artificial Chemical Intelligence (LIAC), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne Switzerland
- National Centre of Competence in Research (NCCR) Catalysis, Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne Switzerland
| | | | - Jacob N Sanders
- Department of Chemistry and Biochemistry, University of California Los Angeles CA 90095 USA
| | - Philippe Schwaller
- Laboratory of Artificial Chemical Intelligence (LIAC), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne Switzerland
- National Centre of Competence in Research (NCCR) Catalysis, Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne Switzerland
| | - Marcus Schwarting
- Department of Computer Science, University of Chicago Chicago IL 60490 USA
| | - Jiale Shi
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge Massachusetts 02139 USA
| | - Berend Smit
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Sion Valais Switzerland
| | - Ben E Smith
- Yusuf Hamied Department of Chemistry, University of Cambridge Lensfield Road Cambridge CB2 1EW UK
| | - Joren Van Herck
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Sion Valais Switzerland
| | - Christoph Völker
- Bundesanstalt für Materialforschung und -prüfung Unter den Eichen 87 12205 Berlin Germany
| | - Logan Ward
- Data Science and Learning Division, Argonne National Lab USA
| | - Sean Warren
- Department of Chemistry, McGill University Montreal Quebec Canada
| | - Benjamin Weiser
- Department of Chemistry, McGill University Montreal Quebec Canada
| | - Sylvester Zhang
- Department of Chemistry, McGill University Montreal Quebec Canada
| | - Xiaoqi Zhang
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Sion Valais Switzerland
| | - Ghezal Ahmad Zia
- Bundesanstalt für Materialforschung und -prüfung Unter den Eichen 87 12205 Berlin Germany
| | - Aristana Scourtas
- Globus, University of Chicago, Data Science and Learning Division, Argonne National Lab USA
| | - K J Schmidt
- Globus, University of Chicago, Data Science and Learning Division, Argonne National Lab USA
| | - Ian Foster
- Department of Computer Science, University of Chicago, Data Science and Learning Division, Argonne National Lab USA
| | - Andrew D White
- Department of Chemical Engineering, University of Rochester USA
| | - Ben Blaiszik
- Globus, University of Chicago, Data Science and Learning Division, Argonne National Lab USA
| |
Collapse
|
18
|
Panayi A, Ward K, Benhadji-Schaff A, Ibanez-Lopez AS, Xia A, Barzilay R. Evaluation of a prototype machine learning tool to semi-automate data extraction for systematic literature reviews. Syst Rev 2023; 12:187. [PMID: 37803451 PMCID: PMC10557215 DOI: 10.1186/s13643-023-02351-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 09/13/2023] [Indexed: 10/08/2023] Open
Abstract
BACKGROUND Evidence-based medicine requires synthesis of research through rigorous and time-intensive systematic literature reviews (SLRs), with significant resource expenditure for data extraction from scientific publications. Machine learning may enable the timely completion of SLRs and reduce errors by automating data identification and extraction. METHODS We evaluated the use of machine learning to extract data from publications related to SLRs in oncology (SLR 1) and Fabry disease (SLR 2). SLR 1 predominantly contained interventional studies and SLR 2 observational studies. Predefined key terms and data were manually annotated to train and test bidirectional encoder representations from transformers (BERT) and bidirectional long-short-term memory machine learning models. Using human annotation as a reference, we assessed the ability of the models to identify biomedical terms of interest (entities) and their relations. We also pretrained BERT on a corpus of 100,000 open access clinical publications and/or enhanced context-dependent entity classification with a conditional random field (CRF) model. Performance was measured using the F1 score, a metric that combines precision and recall. We defined successful matches as partial overlap of entities of the same type. RESULTS For entity recognition, the pretrained BERT+CRF model had the best performance, with an F1 score of 73% in SLR 1 and 70% in SLR 2. Entity types identified with the highest accuracy were metrics for progression-free survival (SLR 1, F1 score 88%) or for patient age (SLR 2, F1 score 82%). Treatment arm dosage was identified less successfully (F1 scores 60% [SLR 1] and 49% [SLR 2]). The best-performing model for relation extraction, pretrained BERT relation classification, exhibited F1 scores higher than 90% in cases with at least 80 relation examples for a pair of related entity types. CONCLUSIONS The performance of BERT is enhanced by pretraining with biomedical literature and by combining with a CRF model. With refinement, machine learning may assist with manual data extraction for SLRs.
Collapse
Affiliation(s)
- Antonia Panayi
- Takeda Pharmaceuticals International AG, Thurgauerstrasse 130, 8152, Glattpark-Opfikon, Zurich, Switzerland.
| | | | | | | | - Andrew Xia
- Takeda Pharmaceuticals International AG, Thurgauerstrasse 130, 8152, Glattpark-Opfikon, Zurich, Switzerland
| | | |
Collapse
|
19
|
Reid JP, Betinol IO, Kuang Y. Mechanism to model: a physical organic chemistry approach to reaction prediction. Chem Commun (Camb) 2023; 59:10711-10721. [PMID: 37552047 DOI: 10.1039/d3cc03229a] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/09/2023]
Abstract
The application of mechanistic generalizations is at the core of chemical reaction development and application. These strategies are rooted in physical organic chemistry where mechanistic understandings can be derived from one reaction and applied to explain another. Over time these techniques have evolved from rationalizing observed outcomes to leading experimental design through reaction prediction. In parallel, significant progression in asymmetric organocatalysis has expanded the reach of chiral transfer to new reactions with increased efficiency. However, the complex and diverse catalyst structures applied in this arena have rendered the generalization of asymmetric catalytic processes to be exceptionally challenging. Recognizing this, a portion of our research has been focused on understanding the transferability of chemical observations between similar reactions and exploiting this phenomenon as a platform for prediction. Through these experiences, we have relied on a working knowledge of reaction mechanism to guide the development and application of our models which have been advanced from simple qualitative rules to large statistical models for quantitative predictions. In this feature article, we describe the models acquired to generalize organocatalytic reaction mechanisms and demonstrate their use as a powerful approach for accelerating enantioselective synthesis.
Collapse
Affiliation(s)
- Jolene P Reid
- Department of Chemistry, University of British Columbia, 2036 Main Mall, Vancouver, British Columbia, V6T 1Z1, Canada.
| | - Isaiah O Betinol
- Department of Chemistry, University of British Columbia, 2036 Main Mall, Vancouver, British Columbia, V6T 1Z1, Canada.
| | - Yutao Kuang
- Department of Chemistry, University of British Columbia, 2036 Main Mall, Vancouver, British Columbia, V6T 1Z1, Canada.
| |
Collapse
|
20
|
Qian Y, Guo J, Tu Z, Coley CW, Barzilay R. RxnScribe: A Sequence Generation Model for Reaction Diagram Parsing. J Chem Inf Model 2023. [PMID: 37368970 DOI: 10.1021/acs.jcim.3c00439] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2023]
Abstract
Reaction diagram parsing is the task of extracting reaction schemes from a diagram in the chemistry literature. The reaction diagrams can be arbitrarily complex; thus, robustly parsing them into structured data is an open challenge. In this paper, we present RxnScribe, a machine learning model for parsing reaction diagrams of varying styles. We formulate this structured prediction task with a sequence generation approach, which condenses the traditional pipeline into an end-to-end model. We train RxnScribe on a dataset of 1378 diagrams and evaluate it with cross validation, achieving an 80.0% soft match F1 score, with significant improvements over previous models. Our code and data are publicly available at https://github.com/thomas0809/RxnScribe.
Collapse
Affiliation(s)
- Yujie Qian
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Jiang Guo
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Zhengkai Tu
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
21
|
Shetty P, Rajan AC, Kuenneth C, Gupta S, Panchumarti LP, Holm L, Zhang C, Ramprasad R. A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing. NPJ COMPUTATIONAL MATERIALS 2023; 9:52. [PMID: 37033291 PMCID: PMC10073792 DOI: 10.1038/s41524-023-01003-w] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Accepted: 03/16/2023] [Indexed: 06/19/2023]
Abstract
The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from literature. We used natural language processing methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available at polymerscholar.org which can be used to locate material property data recorded in abstracts. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with extracted material property information.
Collapse
Affiliation(s)
- Pranav Shetty
- School of Computational Science & Engineering, Atlanta, GA USA
| | - Arunkumar Chitteth Rajan
- School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, 30332 GA USA
| | - Chris Kuenneth
- School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, 30332 GA USA
| | - Sonakshi Gupta
- Department of Metallurgy Engineering and Materials Science, Indian Institute of Technology, Indore, Madhya Pradesh India
| | - Lakshmi Prerana Panchumarti
- School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, 30332 GA USA
| | - Lauren Holm
- School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, 30332 GA USA
| | - Chao Zhang
- School of Computational Science & Engineering, Atlanta, GA USA
| | - Rampi Ramprasad
- School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, 30332 GA USA
| |
Collapse
|
22
|
Qian Y, Guo J, Tu Z, Li Z, Coley CW, Barzilay R. MolScribe: Robust Molecular Structure Recognition with Image-to-Graph Generation. J Chem Inf Model 2023; 63:1925-1934. [PMID: 36971363 DOI: 10.1021/acs.jcim.2c01480] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/29/2023]
Abstract
Molecular structure recognition is the task of translating a molecular image into its graph structure. Significant variation in drawing styles and conventions exhibited in chemical literature poses a significant challenge for automating this task. In this paper, we propose MolScribe, a novel image-to-graph generation model that explicitly predicts atoms and bonds, along with their geometric layouts, to construct the molecular structure. Our model flexibly incorporates symbolic chemistry constraints to recognize chirality and expand abbreviated structures. We further develop data augmentation strategies to enhance the model robustness against domain shifts. In experiments on both synthetic and realistic molecular images, MolScribe significantly outperforms previous models, achieving 76-93% accuracy on public benchmarks. Chemists can also easily verify MolScribe's prediction, informed by its confidence estimation and atom-level alignment with the input image. MolScribe is publicly available through Python and web interfaces: https://github.com/thomas0809/MolScribe.
Collapse
Affiliation(s)
- Yujie Qian
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Jiang Guo
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Zhengkai Tu
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Zhening Li
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
23
|
Wang W, Liu Y, Wang Z, Hao G, Song B. The way to AI-controlled synthesis: how far do we need to go? Chem Sci 2022; 13:12604-12615. [PMID: 36519036 PMCID: PMC9645373 DOI: 10.1039/d2sc04419f] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Accepted: 09/26/2022] [Indexed: 09/08/2024] Open
Abstract
Chemical synthesis always plays an irreplaceable role in chemical, materials, and pharmacological fields. Meanwhile, artificial intelligence (AI) is causing a rapid technological revolution in many fields by replacing manual chemical synthesis and has exhibited a much more economical and time-efficient manner. However, the rate-determining step of AI-controlled synthesis systems is rarely mentioned, which makes it difficult to apply them in general laboratories. Here, the history of developing AI-aided synthesis has been overviewed and summarized. We propose that the hardware of AI-controlled synthesis systems should be more adaptive to execute reactions with different phase reagents and under different reaction conditions, and the software of AI-controlled synthesis systems should have richer kinds of reaction prediction modules. An updated system will better address more different kinds of syntheses. Our viewpoint could help scientists advance the revolution that combines AI and synthesis to achieve more progress in complicated systems.
Collapse
Affiliation(s)
- Wei Wang
- State Key Laboratory Breeding Base of Green Pesticide and Agricultural Bioengineering, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Research and Development Center for Fine Chemicals, Guizhou University Guiyang 550025 P. R. China
| | - Yingwei Liu
- State Key Laboratory of Public Big Data, Guizhou University Guiyang 550025 P. R. China
| | - Zheng Wang
- State Key Laboratory Breeding Base of Green Pesticide and Agricultural Bioengineering, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Research and Development Center for Fine Chemicals, Guizhou University Guiyang 550025 P. R. China
| | - Gefei Hao
- State Key Laboratory Breeding Base of Green Pesticide and Agricultural Bioengineering, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Research and Development Center for Fine Chemicals, Guizhou University Guiyang 550025 P. R. China
| | - Baoan Song
- State Key Laboratory Breeding Base of Green Pesticide and Agricultural Bioengineering, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Research and Development Center for Fine Chemicals, Guizhou University Guiyang 550025 P. R. China
| |
Collapse
|
24
|
Shavalieva G, Papadokonstantakis S, Peters G. Prior Knowledge for Predictive Modeling: The Case of Acute Aquatic Toxicity. J Chem Inf Model 2022; 62:4018-4031. [PMID: 35998659 PMCID: PMC9472271 DOI: 10.1021/acs.jcim.1c01079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2021] [Indexed: 11/30/2022]
Abstract
Early assessment of the potential impact of chemicals on health and the environment requires toxicological properties of the molecules. Predictive modeling is often used to estimate the property values in silico from pre-existing experimental data, which is often scarce and uncertain. One of the ways to advance the predictive modeling procedure might be the use of knowledge existing in the field. Scientific publications contain a vast amount of knowledge. However, the amount of manual work required to process the enormous volumes of information gathered in scientific articles might hinder its utilization. This work explores the opportunity of semiautomated knowledge extraction from scientific papers and investigates a few potential ways of its use for predictive modeling. The knowledge extraction and predictive modeling are applied to the field of acute aquatic toxicity. Acute aquatic toxicity is an important parameter of the safety assessment of chemicals. The extensive amount of diverse information existing in the field makes acute aquatic toxicity an attractive area for investigation of knowledge use for predictive modeling. The work demonstrates that the knowledge collection and classification procedure could be useful in hybrid modeling studies concerning the model and predictor selection, addressing data gaps, and evaluation of models' performance.
Collapse
Affiliation(s)
- Gulnara Shavalieva
- Department
of Space, Earth and Environment, Division of Energy Technology, Chalmers University of Technology, SE-412 96 Gothenburg, Sweden
| | - Stavros Papadokonstantakis
- Department
of Space, Earth and Environment, Division of Energy Technology, Chalmers University of Technology, SE-412 96 Gothenburg, Sweden
- Institute
of Chemical, Environmental and Bioscience Engineering, TU Wien, Getreidemarkt 9, 1060 Vienna, Austria
| | - Gregory Peters
- Department
of Technology Management and Economics, Chalmers University of Technology, SE-411 33 Gothenburg, Sweden
| |
Collapse
|
25
|
Tang C, McInnes BT. Cascade Processes with Micellar Reaction Media: Recent Advances and Future Directions. Molecules 2022; 27:molecules27175611. [PMID: 36080376 PMCID: PMC9458028 DOI: 10.3390/molecules27175611] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 08/27/2022] [Accepted: 08/29/2022] [Indexed: 11/26/2022] Open
Abstract
Reducing the use of solvents is an important aim of green chemistry. Using micelles self-assembled from amphiphilic molecules dispersed in water (considered a green solvent) has facilitated reactions of organic compounds. When performing reactions in micelles, the hydrophobic effect can considerably accelerate apparent reaction rates, as well as enhance selectivity. Here, we review micellar reaction media and their potential role in sustainable chemical production. The focus of this review is applications of engineered amphiphilic systems for reactions (surface-active ionic liquids, designer surfactants, and block copolymers) as reaction media. Micelles are a versatile platform for performing a large array of organic chemistries using water as the bulk solvent. Building on this foundation, synthetic sequences combining several reaction steps in one pot have been developed. Telescoping multiple reactions can reduce solvent waste by limiting the volume of solvents, as well as eliminating purification processes. Thus, in particular, we review recent advances in “one-pot” multistep reactions achieved using micellar reaction media with potential applications in medicinal chemistry and agrochemistry. Photocatalyzed reactions in micellar reaction media are also discussed. In addition to the use of micelles, we emphasize the process (steps to isolate the product and reuse the catalyst).
Collapse
Affiliation(s)
- Christina Tang
- Chemical and Life Science Engineering Department, Virginia Commonwealth University, Richmond, VA 23284, USA
- Correspondence:
| | - Bridget T. McInnes
- Computer Science Department, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
26
|
Gao H, Zhu LT, Luo ZH, Fraga MA, Hsing IM. Machine Learning and Data Science in Chemical Engineering. Ind Eng Chem Res 2022. [DOI: 10.1021/acs.iecr.2c01788] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Hanyu Gao
- Department of Chemical and Biological Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, People’s Republic of China
| | - Li-Tao Zhu
- Department of Chemical Engineering, School of Chemistry and Chemical Engineering, State Key Laboratory of Metal Matrix Composites, Shanghai Jiao Tong University, Shanghai 200240, People’s Republic of China
| | - Zheng-Hong Luo
- Department of Chemical Engineering, School of Chemistry and Chemical Engineering, State Key Laboratory of Metal Matrix Composites, Shanghai Jiao Tong University, Shanghai 200240, People’s Republic of China
| | - Marco A. Fraga
- Instituto Nacional de Tecnologia − INT, Av. Venezuela, 82/518, Rio de Janeiro, RJ 20081-312, Brazil
| | - I-Ming Hsing
- Department of Chemical and Biological Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, People’s Republic of China
| |
Collapse
|
27
|
Rarey M, Nicklaus MC, Warr W. Special Issue on Reaction Informatics and Chemical Space. J Chem Inf Model 2022; 62:2009-2010. [DOI: 10.1021/acs.jcim.2c00390] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Affiliation(s)
- Matthias Rarey
- Universität Hamburg, ZBH − Center for Bioinformatics, 20146 Hamburg, Germany
| | - Marc C. Nicklaus
- NCI, NIH, CADD Group, NCI-Frederick, Frederick, Maryland 21702, United States
| | - Wendy Warr
- Wendy Warr & Associates, Cheshire CW4 7HZ, U.K
| |
Collapse
|
28
|
Zhang L, He M. Prediction of solar cell materials via unsupervised literature learning. JOURNAL OF PHYSICS. CONDENSED MATTER : AN INSTITUTE OF PHYSICS JOURNAL 2021; 34:095902. [PMID: 34844235 DOI: 10.1088/1361-648x/ac3e1e] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 11/29/2021] [Indexed: 06/13/2023]
Abstract
Despite the significant advancement of the data-driven studies for physical science, the textual data that are numerous in the literature are not fully embraced by the physics and materials community. In this manuscript, we successfully employ the natural language processing (NLP) technique to unsupervisedly predict the existence of solar cell types including the dye-sensitized solar cells and the perovskite solar cells based on literatures published prior to their first discovery without human annotation. Enlightened by this, we further identify possible solar cell material candidates via NLP starting with a comprehensive training database of 3.2 million paper abstracts published before 2021. The NLP model effectively predicts the existing solar cell materials, while an uncommon solar cell material namely PtSe2is suggested as an appropriate candidate for the future solar cells. Its optoelectronic properties are comprehensive investigated via first-principles calculations to reveal the decent stability and optoelectronic performance of the NLP-predicted candidate. This study demonstrates the viability of the textual data for the data-driven materials prediction and highlights the NLP method as a powerful tool to reliably predict the solar cell materials.
Collapse
Affiliation(s)
- Lei Zhang
- Institute of Advanced Materials and Flexible Electronics (IAMFE), School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, 210044, Nanjing, People's Republic of China
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, 210044, Nanjing, People's Republic of China
| | - Mu He
- Institute of Advanced Materials and Flexible Electronics (IAMFE), School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, 210044, Nanjing, People's Republic of China
| |
Collapse
|