1
|
Yang L, Guo Q, Zhang L. AI-assisted chemistry research: a comprehensive analysis of evolutionary paths and hotspots through knowledge graphs. Chem Commun (Camb) 2024. [PMID: 38910536 DOI: 10.1039/d4cc01892c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/25/2024]
Abstract
Artificial intelligence (AI) offers transformative potential for chemical research through its ability to optimize reactions and processes, enhance energy efficiency, and reduce waste. AI-assisted chemical research (AI + chem) has become a global hotspot. To better understand the current research status of "AI + chem", this study conducted a scientific bibliometric investigation using CiteSpace. The web of science core collection was utilized to retrieve original articles related to "AI + chem" published from 2000 to 2024. The obtained data allowed for the visualization of the knowledge background, current research status, and latest knowledge structure of "AI + chem". The "AI + chem" has entered a stage of explosive growth, and the number of papers will maintain long-term high-speed growth. This article systematically analyzes the latest progress in "AI + chem" and objectively predicts future trends, including molecular design, reaction prediction, materials design, drug design, and quantum chemistry. The outcomes of this study will provide readers with a comprehensive understanding of the overall landscape of "AI + chem".
Collapse
Affiliation(s)
- Lin Yang
- School of Intellectual Property, Dalian University of Technology, Dalian 116024, Liaoning, P. R. China
| | - Qingle Guo
- School of Intellectual Property, Dalian University of Technology, Dalian 116024, Liaoning, P. R. China
| | - Lijing Zhang
- School of Chemistry, Dalian University of Technology, Dalian 116024, Liaoning, P. R. China.
| |
Collapse
|
2
|
Wigh D, Arrowsmith J, Pomberger A, Felton KC, Lapkin AA. ORDerly: Data Sets and Benchmarks for Chemical Reaction Data. J Chem Inf Model 2024; 64:3790-3798. [PMID: 38648077 PMCID: PMC11094788 DOI: 10.1021/acs.jcim.4c00292] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 04/03/2024] [Accepted: 04/04/2024] [Indexed: 04/25/2024]
Abstract
Machine learning has the potential to provide tremendous value to life sciences by providing models that aid in the discovery of new molecules and reduce the time for new products to come to market. Chemical reactions play a significant role in these fields, but there is a lack of high-quality open-source chemical reaction data sets for training machine learning models. Herein, we present ORDerly, an open-source Python package for the customizable and reproducible preparation of reaction data stored in accordance with the increasingly popular Open Reaction Database (ORD) schema. We use ORDerly to clean United States patent data stored in ORD and generate data sets for forward prediction, retrosynthesis, as well as the first benchmark for reaction condition prediction. We train neural networks on data sets generated with ORDerly for condition prediction and show that data sets missing key cleaning steps can lead to silently overinflated performance metrics. Additionally, we train transformers for forward and retrosynthesis prediction and demonstrate how non-patent data can be used to evaluate model generalization. By providing a customizable open-source solution for cleaning and preparing large chemical reaction data, ORDerly is poised to push forward the boundaries of machine learning applications in chemistry.
Collapse
Affiliation(s)
- Daniel
S. Wigh
- Department of Chemical Engineering
and Biotechnology, University of Cambridge, Cambridge CB3 0AS, U.K.
| | - Joe Arrowsmith
- Department of Chemical Engineering
and Biotechnology, University of Cambridge, Cambridge CB3 0AS, U.K.
| | - Alexander Pomberger
- Department of Chemical Engineering
and Biotechnology, University of Cambridge, Cambridge CB3 0AS, U.K.
| | - Kobi C. Felton
- Department of Chemical Engineering
and Biotechnology, University of Cambridge, Cambridge CB3 0AS, U.K.
| | - Alexei A. Lapkin
- Department of Chemical Engineering
and Biotechnology, University of Cambridge, Cambridge CB3 0AS, U.K.
| |
Collapse
|
3
|
Wang X, Hsieh CY, Yin X, Wang J, Li Y, Deng Y, Jiang D, Wu Z, Du H, Chen H, Li Y, Liu H, Wang Y, Luo P, Hou T, Yao X. Generic Interpretable Reaction Condition Predictions with Open Reaction Condition Datasets and Unsupervised Learning of Reaction Center. RESEARCH (WASHINGTON, D.C.) 2023; 6:0231. [PMID: 37849643 PMCID: PMC10578430 DOI: 10.34133/research.0231] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Accepted: 08/29/2023] [Indexed: 10/19/2023]
Abstract
Effective synthesis planning powered by deep learning (DL) can significantly accelerate the discovery of new drugs and materials. However, most DL-assisted synthesis planning methods offer either none or very limited capability to recommend suitable reaction conditions (RCs) for their reaction predictions. Currently, the prediction of RCs with a DL framework is hindered by several factors, including: (a) lack of a standardized dataset for benchmarking, (b) lack of a general prediction model with powerful representation, and (c) lack of interpretability. To address these issues, we first created 2 standardized RC datasets covering a broad range of reaction classes and then proposed a powerful and interpretable Transformer-based RC predictor named Parrot. Through careful design of the model architecture, pretraining method, and training strategy, Parrot improved the overall top-3 prediction accuracy on catalysis, solvents, and other reagents by as much as 13.44%, compared to the best previous model on a newly curated dataset. Additionally, the mean absolute error of the predicted temperatures was reduced by about 4 °C. Furthermore, Parrot manifests strong generalization capacity with superior cross-chemical-space prediction accuracy. Attention analysis indicates that Parrot effectively captures crucial chemical information and exhibits a high level of interpretability in the prediction of RCs. The proposed model Parrot exemplifies how modern neural network architecture when appropriately pretrained can be versatile in making reliable, generalizable, and interpretable recommendation for RCs even when the underlying training dataset may still be limited in diversity.
Collapse
Affiliation(s)
- Xiaorui Wang
- Dr. Neher’s Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health,
Macau University of Science and Technology, Macao, 999078, China
- CarbonSilicon AI Technology Co.,
Ltd, Hangzhou, Zhejiang310018, China
| | - Chang-Yu Hsieh
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences,
Zhejiang University, Hangzhou, 310058, China
| | - Xiaodan Yin
- Dr. Neher’s Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health,
Macau University of Science and Technology, Macao, 999078, China
- CarbonSilicon AI Technology Co.,
Ltd, Hangzhou, Zhejiang310018, China
| | - Jike Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences,
Zhejiang University, Hangzhou, 310058, China
- CarbonSilicon AI Technology Co.,
Ltd, Hangzhou, Zhejiang310018, China
| | - Yuquan Li
- College of Chemistry and Chemical Engineering,
Lanzhou University, Lanzhou, 730000, China
| | - Yafeng Deng
- CarbonSilicon AI Technology Co.,
Ltd, Hangzhou, Zhejiang310018, China
| | - Dejun Jiang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences,
Zhejiang University, Hangzhou, 310058, China
- CarbonSilicon AI Technology Co.,
Ltd, Hangzhou, Zhejiang310018, China
| | - Zhenxing Wu
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences,
Zhejiang University, Hangzhou, 310058, China
- CarbonSilicon AI Technology Co.,
Ltd, Hangzhou, Zhejiang310018, China
| | - Hongyan Du
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences,
Zhejiang University, Hangzhou, 310058, China
| | - Hongming Chen
- Center of Chemistry and Chemical Biology,
Guangzhou Regenerative Medicine and Health Guangdong Laboratory, Guangzhou 510530, China
| | - Yun Li
- College of Chemistry and Chemical Engineering,
Lanzhou University, Lanzhou, 730000, China
| | - Huanxiang Liu
- Faculty of Applied Sciences,
Macao Polytechnic University, Macao, 999078, China
| | - Yuwei Wang
- College of Pharmacy,
Shaanxi University of Chinese Medicine, Xianyang, Shaanxi, 712044, China
| | - Pei Luo
- Dr. Neher’s Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health,
Macau University of Science and Technology, Macao, 999078, China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences,
Zhejiang University, Hangzhou, 310058, China
| | - Xiaojun Yao
- Faculty of Applied Sciences,
Macao Polytechnic University, Macao, 999078, China
| |
Collapse
|
4
|
Nizam ZM, Stowe AM, Mckinney JK, Ohata J. Iron-sensitive protein conjugates formed with a Wittig reaction precursor in ionic liquid. Chem Commun (Camb) 2023; 59:12160-12163. [PMID: 37743738 DOI: 10.1039/d3cc03825d] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
In this report, formation of protein conjugates with an iron-sensitive enamine linkage is demonstrated through the ionic liquid-based bioconjugation method.
Collapse
Affiliation(s)
- Zeinab M Nizam
- Department of Chemistry, North Carolina State University, Raleigh, North Carolina 27695, USA.
| | - Ashton M Stowe
- Department of Chemistry, North Carolina State University, Raleigh, North Carolina 27695, USA.
| | - Jada K Mckinney
- Department of Chemistry, North Carolina State University, Raleigh, North Carolina 27695, USA.
| | - Jun Ohata
- Department of Chemistry, North Carolina State University, Raleigh, North Carolina 27695, USA.
| |
Collapse
|
5
|
Shim E, Tewari A, Cernak T, Zimmerman PM. Machine Learning Strategies for Reaction Development: Toward the Low-Data Limit. J Chem Inf Model 2023; 63:3659-3668. [PMID: 37312524 PMCID: PMC11163943 DOI: 10.1021/acs.jcim.3c00577] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Machine learning models are increasingly being utilized to predict outcomes of organic chemical reactions. A large amount of reaction data is used to train these models, which is in stark contrast to how expert chemists discover and develop new reactions by leveraging information from a small number of relevant transformations. Transfer learning and active learning are two strategies that can operate in low-data situations, which may help fill this gap and promote the use of machine learning for tackling real-world challenges in organic synthesis. This Perspective introduces active and transfer learning and connects these to potential opportunities and directions for further research, especially in the area of prospective development of chemical transformations.
Collapse
Affiliation(s)
- Eunjae Shim
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Ambuj Tewari
- Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109, United States
- Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Tim Cernak
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Paul M Zimmerman
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| |
Collapse
|
6
|
Chen Y, Ou Y, Zheng P, Huang Y, Ge F, Dral PO. Benchmark of general-purpose machine learning-based quantum mechanical method AIQM1 on reaction barrier heights. J Chem Phys 2023; 158:074103. [PMID: 36813722 DOI: 10.1063/5.0137101] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
Artificial intelligence-enhanced quantum mechanical method 1 (AIQM1) is a general-purpose method that was shown to achieve high accuracy for many applications with a speed close to its baseline semiempirical quantum mechanical (SQM) method ODM2*. Here, we evaluate the hitherto unknown performance of out-of-the-box AIQM1 without any refitting for reaction barrier heights on eight datasets, including a total of ∼24 thousand reactions. This evaluation shows that AIQM1's accuracy strongly depends on the type of transition state and ranges from excellent for rotation barriers to poor for, e.g., pericyclic reactions. AIQM1 clearly outperforms its baseline ODM2* method and, even more so, a popular universal potential, ANI-1ccx. Overall, however, AIQM1 accuracy largely remains similar to SQM methods (and B3LYP/6-31G* for most reaction types) suggesting that it is desirable to focus on improving AIQM1 performance for barrier heights in the future. We also show that the built-in uncertainty quantification helps in identifying confident predictions. The accuracy of confident AIQM1 predictions is approaching the level of popular density functional theory methods for most reaction types. Encouragingly, AIQM1 is rather robust for transition state optimizations, even for the type of reactions it struggles with the most. Single-point calculations with high-level methods on AIQM1-optimized geometries can be used to significantly improve barrier heights, which cannot be said for its baseline ODM2* method.
Collapse
Affiliation(s)
- Yuxinxin Chen
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| | - Yanchi Ou
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| | - Peikun Zheng
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| | - Yaohuang Huang
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| | - Fuchun Ge
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| | - Pavlo O Dral
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| |
Collapse
|
7
|
Zhang SQ, Xu LC, Li SW, Oliveira JCA, Li X, Ackermann L, Hong X. Bridging Chemical Knowledge and Machine Learning for Performance Prediction of Organic Synthesis. Chemistry 2023; 29:e202202834. [PMID: 36206170 PMCID: PMC10099903 DOI: 10.1002/chem.202202834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Indexed: 11/29/2022]
Abstract
Recent years have witnessed a boom of machine learning (ML) applications in chemistry, which reveals the potential of data-driven prediction of synthesis performance. Digitalization and ML modelling are the key strategies to fully exploit the unique potential within the synergistic interplay between experimental data and the robust prediction of performance and selectivity. A series of exciting studies have demonstrated the importance of chemical knowledge implementation in ML, which improves the model's capability for making predictions that are challenging and often go beyond the abilities of human beings. This Minireview summarizes the cutting-edge embedding techniques and model designs in synthetic performance prediction, elaborating how chemical knowledge can be incorporated into machine learning until June 2022. By merging organic synthesis tactics and chemical informatics, we hope this Review can provide a guide map and intrigue chemists to revisit the digitalization and computerization of organic chemistry principles.
Collapse
Affiliation(s)
- Shuo-Qing Zhang
- Center of Chemistry for Frontier Technologies, Department of Chemistry, State Key Laboratory of Clean Energy Utilization, Zhejiang University, 38 Zheda Road, Hangzhou, 310027, P. R. China
| | - Li-Cheng Xu
- Center of Chemistry for Frontier Technologies, Department of Chemistry, State Key Laboratory of Clean Energy Utilization, Zhejiang University, 38 Zheda Road, Hangzhou, 310027, P. R. China
| | - Shu-Wen Li
- Center of Chemistry for Frontier Technologies, Department of Chemistry, State Key Laboratory of Clean Energy Utilization, Zhejiang University, 38 Zheda Road, Hangzhou, 310027, P. R. China
| | - João C A Oliveira
- Institut für Organische und Biomolekulare Chemie, Wöhler Research Institute for Sustainable Chemistry (WISCh), Georg-August-Universität, Tammannstraße 2, 37077, Göttingen, Germany
| | - Xin Li
- Center of Chemistry for Frontier Technologies, Department of Chemistry, State Key Laboratory of Clean Energy Utilization, Zhejiang University, 38 Zheda Road, Hangzhou, 310027, P. R. China
| | - Lutz Ackermann
- Institut für Organische und Biomolekulare Chemie, Wöhler Research Institute for Sustainable Chemistry (WISCh), Georg-August-Universität, Tammannstraße 2, 37077, Göttingen, Germany
| | - Xin Hong
- Center of Chemistry for Frontier Technologies, Department of Chemistry, State Key Laboratory of Clean Energy Utilization, Zhejiang University, 38 Zheda Road, Hangzhou, 310027, P. R. China.,Beijing National Laboratory for Molecular Sciences, Zhongguancun North First Street No. 2, Beijing, 100190, P. R. China.,Key Laboratory of Precise Synthesis of, Functional Molecules of Zhejiang Province, School of Science, Westlake University, 18 Shilongshan Road, Hangzhou, 310024, Zhejiang Province, P. R. China
| |
Collapse
|
8
|
A Review on Artificial Intelligence Enabled Design, Synthesis, and Process Optimization of Chemical Products for Industry 4.0. Processes (Basel) 2023. [DOI: 10.3390/pr11020330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
With the development of Industry 4.0, artificial intelligence (AI) is gaining increasing attention for its performance in solving particularly complex problems in industrial chemistry and chemical engineering. Therefore, this review provides an overview of the application of AI techniques, in particular machine learning, in chemical design, synthesis, and process optimization over the past years. In this review, the focus is on the application of AI for structure-function relationship analysis, synthetic route planning, and automated synthesis. Finally, we discuss the challenges and future of AI in making chemical products.
Collapse
|
9
|
Andronov M, Voinarovska V, Andronova N, Wand M, Clevert DA, Schmidhuber J. Reagent prediction with a molecular transformer improves reaction data quality. Chem Sci 2023; 14:3235-3246. [PMID: 36970100 PMCID: PMC10034139 DOI: 10.1039/d2sc06798f] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Accepted: 02/12/2023] [Indexed: 03/05/2023] Open
Abstract
A molecular transformer predicts reagents for organic reactions. It is also able to replace questionable reagents in reaction data, e.g. USPTO, to enable better product prediction models to be trained on these new data.
Collapse
Affiliation(s)
- Mikhail Andronov
- IDSIA, USI, SUPSI, 6900 Lugano, Switzerland
- Machine Learning Research, Pfizer Worldwide Research Development and Medical, Linkstr.10, Berlin, Germany
| | - Varvara Voinarovska
- Institute of Structural Biology, Molecular Targets and Therapeutics Center, Helmholtz Munich – Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), 85764 Neuherberg, Germany
| | | | - Michael Wand
- IDSIA, USI, SUPSI, 6900 Lugano, Switzerland
- Institute for Digital Technologies for Personalized Healthcare, SUPSI, 6900 Lugano, Switzerland
| | - Djork-Arné Clevert
- Machine Learning Research, Pfizer Worldwide Research Development and Medical, Linkstr.10, Berlin, Germany
| | | |
Collapse
|
10
|
Kwon Y, Kim S, Choi YS, Kang S. Generative Modeling to Predict Multiple Suitable Conditions for Chemical Reactions. J Chem Inf Model 2022; 62:5952-5960. [PMID: 36413480 DOI: 10.1021/acs.jcim.2c01085] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
In synthesis planning, it is important to determine suitable reaction conditions such that a chemical reaction proceeds as intended. Recent research attempts based on machine learning have proven to be effective in recommending reaction elements for specific categories regarding critical chemical context and operating conditions. However, existing methods can only make a single prediction per reaction and do not directly provide a complete specification of the reaction elements as the prediction. Therefore, their achievable performance is limited. In this study, we propose a generative modeling approach to predict multiple different reaction conditions for a chemical reaction, each of which fully specifies critical reaction elements such that these elements can be directly used as a feasible reaction condition. We formulate the problem of predicting reaction conditions as sampling from a generative distribution. We model the distribution by introducing a variational autoencoder augmented with a graph neural network and learn it from a reaction dataset. For a query reaction, multiple predictions can be obtained by repeated sampling from the distribution. Through experimental investigation on the reaction datasets of four major types of cross-coupling reactions, we demonstrate that the proposed method significantly outperforms existing methods in retrieving ground-truth reaction conditions.
Collapse
Affiliation(s)
- Youngchun Kwon
- Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd., 130 Samsung-ro, Yeongtong-gu, Suwon16678, Republic of Korea.,Department of Computer Science and Engineering, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul08826, Republic of Korea
| | - Sun Kim
- Department of Computer Science and Engineering, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul08826, Republic of Korea
| | - Youn-Suk Choi
- Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd., 130 Samsung-ro, Yeongtong-gu, Suwon16678, Republic of Korea
| | - Seokho Kang
- Department of Industrial Engineering, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu, Suwon16419, Republic of Korea
| |
Collapse
|
11
|
Mroz AM, Posligua V, Tarzia A, Wolpert EH, Jelfs KE. Into the Unknown: How Computation Can Help Explore Uncharted Material Space. J Am Chem Soc 2022; 144:18730-18743. [PMID: 36206484 PMCID: PMC9585593 DOI: 10.1021/jacs.2c06833] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
![]()
Novel functional materials are urgently needed to help
combat the
major global challenges facing humanity, such as climate change and
resource scarcity. Yet, the traditional experimental materials discovery
process is slow and the material space at our disposal is too vast
to effectively explore using intuition-guided experimentation alone.
Most experimental materials discovery programs necessarily focus on
exploring the local space of known materials, so we are not fully
exploiting the enormous potential material space, where more novel
materials with unique properties may exist. Computation, facilitated
by improvements in open-source software and databases, as well as
computer hardware has the potential to significantly accelerate the
rational development of materials, but all too often is only used
to postrationalize experimental observations. Thus, the true predictive
power of computation, where theory leads experimentation, is not fully
utilized. Here, we discuss the challenges to successful implementation
of computation-driven materials discovery workflows, and then focus
on the progress of the field, with a particular emphasis on the challenges
to reaching novel materials.
Collapse
Affiliation(s)
- Austin M Mroz
- Department of Chemistry, Molecular Sciences Research Hub, Imperial College London, White City Campus, Wood Lane, London, W12 0BZ, U.K
| | - Victor Posligua
- Department of Chemistry, Molecular Sciences Research Hub, Imperial College London, White City Campus, Wood Lane, London, W12 0BZ, U.K
| | - Andrew Tarzia
- Department of Chemistry, Molecular Sciences Research Hub, Imperial College London, White City Campus, Wood Lane, London, W12 0BZ, U.K
| | - Emma H Wolpert
- Department of Chemistry, Molecular Sciences Research Hub, Imperial College London, White City Campus, Wood Lane, London, W12 0BZ, U.K
| | - Kim E Jelfs
- Department of Chemistry, Molecular Sciences Research Hub, Imperial College London, White City Campus, Wood Lane, London, W12 0BZ, U.K
| |
Collapse
|
12
|
Shim E, Kammeraad JA, Xu Z, Tewari A, Cernak T, Zimmerman PM. Predicting reaction conditions from limited data through active transfer learning. Chem Sci 2022; 13:6655-6668. [PMID: 35756521 PMCID: PMC9172577 DOI: 10.1039/d1sc06932b] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Accepted: 05/10/2022] [Indexed: 12/30/2022] Open
Abstract
Transfer and active learning have the potential to accelerate the development of new chemical reactions, using prior data and new experiments to inform models that adapt to the target area of interest. This article shows how specifically tuned machine learning models, based on random forest classifiers, can expand the applicability of Pd-catalyzed cross-coupling reactions to types of nucleophiles unknown to the model. First, model transfer is shown to be effective when reaction mechanisms and substrates are closely related, even when models are trained on relatively small numbers of data points. Then, a model simplification scheme is tested and found to provide comparative predictivity on reactions of new nucleophiles that include unseen reagent combinations. Lastly, for a challenging target where model transfer only provides a modest benefit over random selection, an active transfer learning strategy is introduced to improve model predictions. Simple models, composed of a small number of decision trees with limited depths, are crucial for securing generalizability, interpretability, and performance of active transfer learning.
Collapse
Affiliation(s)
- Eunjae Shim
- Department of Chemistry, University of MichiganAnn ArborMIUSA
| | - Joshua A. Kammeraad
- Department of Chemistry, University of MichiganAnn ArborMIUSA,Department of Statistics, University of MichiganAnn ArborMIUSA
| | - Ziping Xu
- Department of Statistics, University of MichiganAnn ArborMIUSA
| | - Ambuj Tewari
- Department of Statistics, University of MichiganAnn ArborMIUSA,Department of Electrical Engineering and Computer Science, University of MichiganAnn ArborMIUSA
| | - Tim Cernak
- Department of Chemistry, University of MichiganAnn ArborMIUSA,Department of Medicinal Chemistry, University of MichiganAnn ArborMIUSA
| | | |
Collapse
|
13
|
|
14
|
Park H, Kang Y, Choe W, Kim J. Mining Insights on Metal-Organic Framework Synthesis from Scientific Literature Texts. J Chem Inf Model 2022; 62:1190-1198. [PMID: 35195419 DOI: 10.1021/acs.jcim.1c01297] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Identifying optimal synthesis conditions for metal-organic frameworks (MOFs) is a major challenge that can serve as a bottleneck for new materials discovery and development. A trial-and-error approach that relies on a chemist's intuition and knowledge has limitations in efficiency due to the large MOF synthesis space. To this end, 46,701 MOFs were data mined using our in-house developed code to extract their synthesis information from 28,565 MOF papers. The joint machine-learning/rule-based algorithm yields an average F1 score of 90.3% across different synthesis parameters (i.e., metal precursors, organic precursors, solvents, temperature, time, and composition). From this data set, a positive-unlabeled learning algorithm was developed to predict the synthesis of a given MOF material using synthesis conditions as inputs, and this algorithm successfully predicted successful synthesis in 83.1% of the synthesized data in the test set. Finally, our model correctly predicted three amorphous MOFs (with their representative experimental synthesis conditions) as having low synthesizability scores, while the counterpart crystalline MOFs showed high synthesizability scores. Our results show that big data extracted from the texts of MOF papers can be used to rationally predict synthesis conditions for these materials, which can accelerate the speed in which new MOFs are synthesized.
Collapse
Affiliation(s)
- Hyunsoo Park
- Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291, Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| | - Yeonghun Kang
- Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291, Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| | - Wonyoung Choe
- Department of Chemistry, Ulsan National Institute of Science and Technology (UNIST), 50, UNIST-gil, Eonyang-eup, Ulju-gun, Ulsan 44919, Republic of Korea
| | - Jihan Kim
- Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291, Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| |
Collapse
|
15
|
Genheden S, Mårdh A, Lahti G, Engkvist O, Olsson S, Kogej T. Prediction of the chemical context for Buchwald‐Hartwig coupling reactions. Mol Inform 2022; 41:e2100294. [PMID: 35122702 PMCID: PMC9540548 DOI: 10.1002/minf.202100294] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Accepted: 02/05/2022] [Indexed: 11/10/2022]
Abstract
We present machine learning models for predicting the chemical context for Buchwald‐Hartwig coupling reactions, i. e., what chemicals to add to the reactants to give a productive reaction. Using reaction data from in‐house electronic lab notebooks, we train two models: one based on single‐label data and one based on multi‐label data. Both models show excellent top‐3 accuracy of approximately 90 %, which suggests strong predictivity. Furthermore, there seems to be an advantage of including multi‐label data because the multi‐label model shows higher accuracy and better sensitivity for the individual contexts than the single‐label model. Although the models are performant, we also show that such models need to be re‐trained periodically as there is a strong temporal characteristic to the usage of different contexts. Therefore, a model trained on historical data will decrease in usefulness with time as newer and better contexts emerge and replace older ones. We hypothesize that such significant transitions in the context‐usage will likely affect any model predicting chemical contexts trained on historical data. Consequently, training context prediction models warrants careful planning of what data is used for training and how often the model needs to be re‐trained.
Collapse
|
16
|
Afonina VA, Mazitov DA, Nurmukhametova A, Shevelev MD, Khasanova DA, Nugmanov RI, Burilov VA, Madzhidov TI, Varnek A. Prediction of Optimal Conditions of Hydrogenation Reaction Using the Likelihood Ranking Approach. Int J Mol Sci 2021; 23:ijms23010248. [PMID: 35008674 PMCID: PMC8745269 DOI: 10.3390/ijms23010248] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Revised: 12/18/2021] [Accepted: 12/23/2021] [Indexed: 11/20/2022] Open
Abstract
The selection of experimental conditions leading to a reasonable yield is an important and essential element for the automated development of a synthesis plan and the subsequent synthesis of the target compound. The classical QSPR approach, requiring one-to-one correspondence between chemical structure and a target property, can be used for optimal reaction conditions prediction only on a limited scale when only one condition component (e.g., catalyst or solvent) is considered. However, a particular reaction can proceed under several different conditions. In this paper, we describe the Likelihood Ranking Model representing an artificial neural network that outputs a list of different conditions ranked according to their suitability to a given chemical transformation. Benchmarking calculations demonstrated that our model outperformed some popular approaches to the theoretical assessment of reaction conditions, such as k Nearest Neighbors, and a recurrent artificial neural network performance prediction of condition components (reagents, solvents, catalysts, and temperature). The ability of the Likelihood Ranking model trained on a hydrogenation reactions dataset, (~42,000 reactions) from Reaxys® database, to propose conditions that led to the desired product was validated experimentally on a set of three reactions with rich selectivity issues.
Collapse
Affiliation(s)
- Valentina A. Afonina
- Chemoinformatics and Molecular Modelling Lab, A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kremlyovskaya Str. 18, 420008 Kazan, Russia; (V.A.A.); (D.A.M.); (A.N.); (M.D.S.); (D.A.K.); (R.I.N.); (V.A.B.)
| | - Daniyar A. Mazitov
- Chemoinformatics and Molecular Modelling Lab, A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kremlyovskaya Str. 18, 420008 Kazan, Russia; (V.A.A.); (D.A.M.); (A.N.); (M.D.S.); (D.A.K.); (R.I.N.); (V.A.B.)
| | - Albina Nurmukhametova
- Chemoinformatics and Molecular Modelling Lab, A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kremlyovskaya Str. 18, 420008 Kazan, Russia; (V.A.A.); (D.A.M.); (A.N.); (M.D.S.); (D.A.K.); (R.I.N.); (V.A.B.)
| | - Maxim D. Shevelev
- Chemoinformatics and Molecular Modelling Lab, A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kremlyovskaya Str. 18, 420008 Kazan, Russia; (V.A.A.); (D.A.M.); (A.N.); (M.D.S.); (D.A.K.); (R.I.N.); (V.A.B.)
- Laboratory of Chemoinformatics (UMR 7140 CNRS/UniStra), Université de Strasbourg, 4, Rue Blaise Pascal, 67000 Strasbourg, France
| | - Dina A. Khasanova
- Chemoinformatics and Molecular Modelling Lab, A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kremlyovskaya Str. 18, 420008 Kazan, Russia; (V.A.A.); (D.A.M.); (A.N.); (M.D.S.); (D.A.K.); (R.I.N.); (V.A.B.)
| | - Ramil I. Nugmanov
- Chemoinformatics and Molecular Modelling Lab, A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kremlyovskaya Str. 18, 420008 Kazan, Russia; (V.A.A.); (D.A.M.); (A.N.); (M.D.S.); (D.A.K.); (R.I.N.); (V.A.B.)
| | - Vladimir A. Burilov
- Chemoinformatics and Molecular Modelling Lab, A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kremlyovskaya Str. 18, 420008 Kazan, Russia; (V.A.A.); (D.A.M.); (A.N.); (M.D.S.); (D.A.K.); (R.I.N.); (V.A.B.)
| | - Timur I. Madzhidov
- Chemoinformatics and Molecular Modelling Lab, A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kremlyovskaya Str. 18, 420008 Kazan, Russia; (V.A.A.); (D.A.M.); (A.N.); (M.D.S.); (D.A.K.); (R.I.N.); (V.A.B.)
- Correspondence: (T.I.M.); (A.V.)
| | - Alexandre Varnek
- Laboratory of Chemoinformatics (UMR 7140 CNRS/UniStra), Université de Strasbourg, 4, Rue Blaise Pascal, 67000 Strasbourg, France
- Institute for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University, Kita 21 Nishi 10, Kita-ku, Sapporo 001-0021, Japan
- Correspondence: (T.I.M.); (A.V.)
| |
Collapse
|
17
|
Gong Y, Xue D, Chuai G, Yu J, Liu Q. DeepReac+: deep active learning for quantitative modeling of organic chemical reactions. Chem Sci 2021; 12:14459-14472. [PMID: 34880997 PMCID: PMC8580052 DOI: 10.1039/d1sc02087k] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Accepted: 10/08/2021] [Indexed: 11/21/2022] Open
Abstract
Various computational methods have been developed for quantitative modeling of organic chemical reactions; however, the lack of universality as well as the requirement of large amounts of experimental data limit their broad applications. Here, we present DeepReac+, an efficient and universal computational framework for prediction of chemical reaction outcomes and identification of optimal reaction conditions based on deep active learning. Under this framework, DeepReac is designed as a graph-neural-network-based model, which directly takes 2D molecular structures as inputs and automatically adapts to different prediction tasks. In addition, carefully-designed active learning strategies are incorporated to substantially reduce the number of necessary experiments for model training. We demonstrate the universality and high efficiency of DeepReac+ by achieving the state-of-the-art results with a minimum of labeled data on three diverse chemical reaction datasets in several scenarios. Collectively, DeepReac+ has great potential and utility in the development of AI-aided chemical synthesis. DeepReac+ is freely accessible at https://github.com/bm2-lab/DeepReac.
Collapse
Affiliation(s)
- Yukang Gong
- Department of Ophthalmology, Shanghai Tenth People's Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University Shanghai 200072 China
| | - Dongyu Xue
- Department of Ophthalmology, Shanghai Tenth People's Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University Shanghai 200072 China
| | - Guohui Chuai
- Department of Ophthalmology, Shanghai Tenth People's Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University Shanghai 200072 China
| | - Jing Yu
- Department of Ophthalmology, Shanghai Tenth People's Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University Shanghai 200072 China
| | - Qi Liu
- Department of Ophthalmology, Shanghai Tenth People's Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University Shanghai 200072 China
| |
Collapse
|
18
|
Machine learning modelling of chemical reaction characteristics: yesterday, today, tomorrow. MENDELEEV COMMUNICATIONS 2021. [DOI: 10.1016/j.mencom.2021.11.003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
19
|
Haywood AL, Redshaw J, Hanson-Heine MWD, Taylor A, Brown A, Mason AM, Gärtner T, Hirst JD. Kernel Methods for Predicting Yields of Chemical Reactions. J Chem Inf Model 2021; 62:2077-2092. [PMID: 34699222 DOI: 10.1021/acs.jcim.1c00699] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
The use of machine learning methods for the prediction of reaction yield is an emerging area. We demonstrate the applicability of support vector regression (SVR) for predicting reaction yields, using combinatorial data. Molecular descriptors used in regression tasks related to chemical reactivity have often been based on time-consuming, computationally demanding quantum chemical calculations, usually density functional theory. Structure-based descriptors (molecular fingerprints and molecular graphs) are quicker and easier to calculate and are applicable to any molecule. In this study, SVR models built on structure-based descriptors were compared to models built on quantum chemical descriptors. The models were evaluated along the dimension of each reaction component in a set of Buchwald-Hartwig amination reactions. The structure-based SVR models outperformed the quantum chemical SVR models, along the dimension of each reaction component. The applicability of the models was assessed with respect to similarity to training. Prospective predictions of unseen Buchwald-Hartwig reactions are presented for synthetic assessment, to validate the generalizability of the models, with particular interest along the aryl halide dimension.
Collapse
Affiliation(s)
- Alexe L Haywood
- School of Chemistry, University of Nottingham, University Park, Nottingham NG7 2RD, U.K
| | - Joseph Redshaw
- School of Chemistry, University of Nottingham, University Park, Nottingham NG7 2RD, U.K
| | | | - Adam Taylor
- GlaxoSmithKline, Gunnels Wood Road, Stevenage SG1 2NY, U.K
| | - Alex Brown
- GlaxoSmithKline, Gunnels Wood Road, Stevenage SG1 2NY, U.K
| | - Andrew M Mason
- GlaxoSmithKline, Gunnels Wood Road, Stevenage SG1 2NY, U.K
| | - Thomas Gärtner
- Machine Learning Research Unit, TU Wien Informatics, Vienna 1040, Austria
| | - Jonathan D Hirst
- School of Chemistry, University of Nottingham, University Park, Nottingham NG7 2RD, U.K
| |
Collapse
|
20
|
Nakajima M, Nemoto T. Machine learning enabling prediction of the bond dissociation enthalpy of hypervalent iodine from SMILES. Sci Rep 2021; 11:20207. [PMID: 34642360 PMCID: PMC8511102 DOI: 10.1038/s41598-021-99369-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Accepted: 09/22/2021] [Indexed: 12/17/2022] Open
Abstract
Machine learning to create models on the basis of big data enables predictions from new input data. Many tasks formerly performed by humans can now be achieved by machine learning algorithms in various fields, including scientific areas. Hypervalent iodine compounds (HVIs) have long been applied as useful reactive molecules. The bond dissociation enthalpy (BDE) value is an important indicator of reactivity and stability. Experimentally measuring the BDE value of HVIs is difficult, however, and the value has been estimated by quantum calculations, especially density functional theory (DFT) calculations. Although DFT calculations can access the BDE value with high accuracy, the process is highly time-consuming. Thus, we aimed to reduce the time for predicting the BDE by applying machine learning. We calculated the BDE of more than 1000 HVIs using DFT calculations, and performed machine learning. Converting SMILES strings to Avalon fingerprints and learning using a traditional Elastic Net made it possible to predict the BDE value with high accuracy. Furthermore, an applicability domain search revealed that the learning model could accurately predict the BDE even for uncovered inputs that were not completely included in the training data.
Collapse
Affiliation(s)
- Masaya Nakajima
- Graduate School of Pharmaceutical Sciences, Chiba University, Chiba, Japan.
| | - Tetsuhiro Nemoto
- Graduate School of Pharmaceutical Sciences, Chiba University, Chiba, Japan.
| |
Collapse
|
21
|
Abstract
Computational methods have emerged as a powerful tool to augment traditional experimental molecular catalyst design by providing useful predictions of catalyst performance and decreasing the time needed for catalyst screening. In this perspective, we discuss three approaches for computational molecular catalyst design: (i) the reaction mechanism-based approach that calculates all relevant elementary steps, finds the rate and selectivity determining steps, and ultimately makes predictions on catalyst performance based on kinetic analysis, (ii) the descriptor-based approach where physical/chemical considerations are used to find molecular properties as predictors of catalyst performance, and (iii) the data-driven approach where statistical analysis as well as machine learning (ML) methods are used to obtain relationships between available data/features and catalyst performance. Following an introduction to these approaches, we cover their strengths and weaknesses and highlight some recent key applications. Furthermore, we present an outlook on how the currently applied approaches may evolve in the near future by addressing how recent developments in building automated computational workflows and implementing advanced ML models hold promise for reducing human workload, eliminating human bias, and speeding up computational catalyst design at the same time. Finally, we provide our viewpoint on how some of the challenges associated with the up-and-coming approaches driven by automation and ML may be resolved.
Collapse
Affiliation(s)
- Ademola Soyemi
- Department of Chemical and Biological Engineering, The University of Alabama, Tuscaloosa, AL 35487, USA.
| | - Tibor Szilvási
- Department of Chemical and Biological Engineering, The University of Alabama, Tuscaloosa, AL 35487, USA.
| |
Collapse
|
22
|
Poltavsky I, Tkatchenko A. Machine Learning Force Fields: Recent Advances and Remaining Challenges. J Phys Chem Lett 2021; 12:6551-6564. [PMID: 34242032 DOI: 10.1021/acs.jpclett.1c01204] [Citation(s) in RCA: 41] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
In chemistry and physics, machine learning (ML) methods promise transformative impacts by advancing modeling and improving our understanding of complex molecules and materials. Each ML method comprises a mathematically well-defined procedure, and an increasingly larger number of easy-to-use ML packages for modeling atomistic systems are becoming available. In this Perspective, we discuss the general aspects of ML techniques in the context of creating ML force fields. We describe common features of ML modeling and quantum-mechanical approximations, so-called global and local ML models, and the physical differences behind these two classes of approaches. Finally, we describe the recent developments and emerging directions in the field of ML-driven molecular modeling. This Perspective aims to inspire interdisciplinary collaborations crossing the borders between physical chemistry, chemical physics, computer science, and data science.
Collapse
Affiliation(s)
- Igor Poltavsky
- Department of Physics and Materials Science, University of Luxembourg, L-1511 Luxembourg City, Luxembourg
| | - Alexandre Tkatchenko
- Department of Physics and Materials Science, University of Luxembourg, L-1511 Luxembourg City, Luxembourg
| |
Collapse
|
23
|
Vaucher AC, Schwaller P, Geluykens J, Nair VH, Iuliano A, Laino T. Inferring experimental procedures from text-based representations of chemical reactions. Nat Commun 2021; 12:2573. [PMID: 33958589 PMCID: PMC8102565 DOI: 10.1038/s41467-021-22951-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2020] [Accepted: 04/07/2021] [Indexed: 11/19/2022] Open
Abstract
The experimental execution of chemical reactions is a context-dependent and time-consuming process, often solved using the experience collected over multiple decades of laboratory work or searching similar, already executed, experimental protocols. Although data-driven schemes, such as retrosynthetic models, are becoming established technologies in synthetic organic chemistry, the conversion of proposed synthetic routes to experimental procedures remains a burden on the shoulder of domain experts. In this work, we present data-driven models for predicting the entire sequence of synthesis steps starting from a textual representation of a chemical equation, for application in batch organic chemistry. We generated a data set of 693,517 chemical equations and associated action sequences by extracting and processing experimental procedure text from patents, using state-of-the-art natural language models. We used the attained data set to train three different models: a nearest-neighbor model based on recently-introduced reaction fingerprints, and two deep-learning sequence-to-sequence models based on the Transformer and BART architectures. An analysis by a trained chemist revealed that the predicted action sequences are adequate for execution without human intervention in more than 50% of the cases.
Collapse
Affiliation(s)
| | | | | | | | - Anna Iuliano
- Dipartimento di Chimica e Chimica Industriale, Università di Pisa, Pisa, Italy
| | | |
Collapse
|
24
|
|
25
|
Abstract
As more data are introduced in the building of models of chemical reactivity, the mechanistic component can be reduced until 'big data' applications are reached. These methods no longer depend on underlying mechanistic hypotheses, potentially learning them implicitly through extensive data training. Reactivity models often focus on reaction barriers, but can also be trained to directly predict lab-relevant properties, such as yields or conditions. Calculations with a quantum-mechanical component are still preferred for quantitative predictions of reactivity. Although big data applications tend to be more qualitative, they have the advantage to be broadly applied to different kinds of reactions. There is a continuum of methods in between these extremes, such as methods that use quantum-derived data or descriptors in machine learning models. Here, we present an overview of the recent machine learning applications in the field of chemical reactivity from a mechanistic perspective. Starting with a summary of how reactivity questions are addressed by quantum-mechanical methods, we discuss methods that augment or replace quantum-based modelling with faster alternatives relying on machine learning.
Collapse
|
26
|
Gao H, Pauphilet J, Struble TJ, Coley CW, Jensen KF. Direct Optimization across Computer-Generated Reaction Networks Balances Materials Use and Feasibility of Synthesis Plans for Molecule Libraries. J Chem Inf Model 2020; 61:493-504. [PMID: 33331158 DOI: 10.1021/acs.jcim.0c01032] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The synthesis of thousands of candidate compounds in drug discovery and development offers opportunities for computer-aided synthesis planning to simplify the synthesis of molecule libraries by leveraging common starting materials and reaction conditions. We develop an optimization-based method to analyze large organic chemical reaction networks and design overlapping synthesis plans for entire molecule libraries so as to minimize the overall number of unique chemical compounds needed as either starting materials or reaction conditions. We consider multiple objectives, including the number of starting materials, the number of catalysts/solvents/reagents, and the likelihood of success of the overall syntheses plan, to select an optimal reaction network to access the target molecules. The library synthesis planning task was formulated as a network flow optimization problem, and we design an efficient decomposition scheme that reduces solution time by a factor of 5 and scales to instance with 48 target molecules and nearly 8000 intermediate reactions within hours. In four case studies of pharmaceutical compounds, the approach reduces the number of starting materials and catalysts/solvents/reagents needed by 32.2 and 66.0% on average and up to 63.2 and 80.0% in the best cases. The code implementation can be found at https://github.com/Coughy1991/Molecule_library_synthesis.
Collapse
Affiliation(s)
- Hanyu Gao
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Jean Pauphilet
- London Business School, Regent's Park, London NW1 4SA, U.K
| | - Thomas J Struble
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Klavs F Jensen
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
27
|
Walker EA, Ravisankar K, Savara A. CheKiPEUQ Intro 2: Harnessing Uncertainties from Data Sets, Bayesian Design of Experiments in Chemical Kinetics**. ChemCatChem 2020. [DOI: 10.1002/cctc.202000976] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Affiliation(s)
- Eric A. Walker
- Institute for Computational and Data Sciences Chemical and Biological Engineering University at Buffalo Buffalo NY-14260 USA
| | - Kishore Ravisankar
- Institute for Computational and Data Sciences Chemical and Biological Engineering University at Buffalo Buffalo NY-14260 USA
| | - Aditya Savara
- Surface Chemistry and Catalysis group Oak Ridge National Laboratory 1 Bethel Valley Road Oak Ridge TN-37830 USA
| |
Collapse
|
28
|
Walker EA, Mohammadi MM, Swihart MT. Graph Theory Model of Dry Reforming of Methane Using Rh(111). J Phys Chem Lett 2020; 11:4917-4922. [PMID: 32459487 DOI: 10.1021/acs.jpclett.0c01038] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The adsorption energies of intermediates of the dry reforming of methane reaction (DRM CH4 + CO2 ⇔ 2CO + 2H2) using Rh(111) are approximated. Graph theory creates descriptors of the intermediates. The information recorded in these descriptors includes the elemental identities of each atom, its neighbors, and its next-nearest neighbors. Graph theory is employed because it is a rapid approximation of more expensive density functional theory (DFT) calculations and because the descriptors created by graph theory are both human and machine interpretable. DRM contains a significant number of adsorbates, and side reactions, including reverse water-gas shift, may occur simultaneously. Therefore, DRM is well-poised for analysis by a graph theory model to predict large numbers of adsorption energies. A portion of adsorbates were calculated with DFT. Then, predictions were reported for the remaining adsorption energies not calculated with DFT.
Collapse
Affiliation(s)
- Eric A Walker
- Department of Chemical and Biological Engineering, University at Buffalo, The State University of New York, Buffalo, New York 14260, United States
- Institute for Computational and Data Sciences, University at Buffalo, The State University of New York, Buffalo, New York 14260, United States
| | - Mohammad Moein Mohammadi
- Department of Chemical and Biological Engineering, University at Buffalo, The State University of New York, Buffalo, New York 14260, United States
| | - Mark T Swihart
- Department of Chemical and Biological Engineering, University at Buffalo, The State University of New York, Buffalo, New York 14260, United States
- RENEW Institute (Research and Education in eNergy, Environment and Water), University at Buffalo, The State University of New York, Buffalo, New York 14260, United States
| |
Collapse
|
29
|
Kammeraad JA, Goetz J, Walker EA, Tewari A, Zimmerman PM. What Does the Machine Learn? Knowledge Representations of Chemical Reactivity. J Chem Inf Model 2020; 60:1290-1301. [PMID: 32091880 DOI: 10.1021/acs.jcim.9b00721] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
In a departure from conventional chemical approaches, data-driven models of chemical reactions have recently been shown to be statistically successful using machine learning. These models, however, are largely black box in character and have not provided the kind of chemical insights that historically advanced the field of chemistry. To examine the knowledgebase of machine-learning models-what does the machine learn-this article deconstructs black-box machine-learning models of a diverse chemical reaction data set. Through experimentation with chemical representations and modeling techniques, the analysis provides insights into the nature of how statistical accuracy can arise, even when the model lacks informative physical principles. By peeling back the layers of these complicated models we arrive at a minimal, chemically intuitive model (and no machine learning involved). This model is based on systematic reaction-type classification and Evans-Polanyi relationships within reaction types which are easily visualized and interpreted. Through exploring this simple model, we gain deeper understanding of the data set and uncover a means for expert interactions to improve the model's reliability.
Collapse
Affiliation(s)
- Joshua A Kammeraad
- Department of Chemistry, University of Michigan, 930 North University Avenue, Ann Arbor, Michigan 48109, United States
| | - Jack Goetz
- Department of Statistics, University of Michigan, 1085 South University Avenue, Ann Arbor, Michigan 48109, United States
| | - Eric A Walker
- Department of Chemistry, University of Michigan, 930 North University Avenue, Ann Arbor, Michigan 48109, United States
| | - Ambuj Tewari
- Department of Statistics, University of Michigan, 1085 South University Avenue, Ann Arbor, Michigan 48109, United States
| | - Paul M Zimmerman
- Department of Chemistry, University of Michigan, 930 North University Avenue, Ann Arbor, Michigan 48109, United States
| |
Collapse
|