1
|
Van Herck J, Gil MV, Jablonka KM, Abrudan A, Anker AS, Asgari M, Blaiszik B, Buffo A, Choudhury L, Corminboeuf C, Daglar H, Elahi AM, Foster IT, Garcia S, Garvin M, Godin G, Good LL, Gu J, Xiao Hu N, Jin X, Junkers T, Keskin S, Knowles TPJ, Laplaza R, Lessona M, Majumdar S, Mashhadimoslem H, McIntosh RD, Moosavi SM, Mouriño B, Nerli F, Pevida C, Poudineh N, Rajabi-Kochi M, Saar KL, Hooriabad Saboor F, Sagharichiha M, Schmidt KJ, Shi J, Simone E, Svatunek D, Taddei M, Tetko I, Tolnai D, Vahdatifar S, Whitmer J, Wieland DCF, Willumeit-Römer R, Züttel A, Smit B. Assessment of fine-tuned large language models for real-world chemistry and material science applications. Chem Sci 2025; 16:670-684. [PMID: 39664810 PMCID: PMC11629507 DOI: 10.1039/d4sc04401k] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Accepted: 11/12/2024] [Indexed: 12/13/2024] Open
Abstract
The current generation of large language models (LLMs) has limited chemical knowledge. Recently, it has been shown that these LLMs can learn and predict chemical properties through fine-tuning. Using natural language to train machine learning models opens doors to a wider chemical audience, as field-specific featurization techniques can be omitted. In this work, we explore the potential and limitations of this approach. We studied the performance of fine-tuning three open-source LLMs (GPT-J-6B, Llama-3.1-8B, and Mistral-7B) for a range of different chemical questions. We benchmark their performances against "traditional" machine learning models and find that, in most cases, the fine-tuning approach is superior for a simple classification problem. Depending on the size of the dataset and the type of questions, we also successfully address more sophisticated problems. The most important conclusions of this work are that, for all datasets considered, their conversion into an LLM fine-tuning training set is straightforward and that fine-tuning with even relatively small datasets leads to predictive models. These results suggest that the systematic use of LLMs to guide experiments and simulations will be a powerful technique in any research study, significantly reducing unnecessary experiments or computations.
Collapse
Affiliation(s)
- Joren Van Herck
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
| | - María Victoria Gil
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
- Instituto de Ciencia y TecnologÍa del Carbono (INCAR), CSIC Francisco Pintado Fe 26 33011 Oviedo Spain
| | - Kevin Maik Jablonka
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
- Laboratory of Organic and Tecnolog'ıa Chemistry (IOMC), Friedrich Schiller University Jena Humboldtstrasse 10 07743 Jena Germany
- Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena) Lessingstrasse 12-14 07743 Jena Germany
| | - Alex Abrudan
- Yusuf Hamied Department of Chemistry, University of Cambridge Cambridge CB2 1EW UK
| | - Andy S Anker
- Department of Energy Conversion and Storage, Technical University of Denmark DK-2800 Kgs. Lyngby Denmark
- Department of Chemistry, University of Oxford Oxford OX1 3TA UK
| | - Mehrdad Asgari
- Department of Chemical Engineering & Biotechnology, University of Cambridge Philippa Fawcett Drive Cambridge CB3 0AS UK
| | - Ben Blaiszik
- Department of Computer Science, University of Chicago Chicago IL 60637 USA
- Data Science and Learning Division, Argonne National Laboratory Lemont IL 60439 USA
| | - Antonio Buffo
- Department of Applied Science and Technology (DISAT), Politecnico di Torino 10129 Turino Italy
| | - Leander Choudhury
- Laboratory of Catalysis and Organic Synthesis (LCSO), Institute of Chemical Sciences and Engineering (ISIC), École Polytechnique Fédérale de Lausanne (EPFL) CH-1015 Lausanne Switzerland
| | - Clemence Corminboeuf
- Laboratory for Computational Molecular Design (LCMD), Institute of Chemical Sciences and Engineering (ISIC), École Polytechnique Fédérale de Lausanne (EPFL) CH-1015 Lausanne Switzerland
| | - Hilal Daglar
- Department of Chemical and Biological Engineering, Koç University Rumelifeneri Yolu, Sariyer 34450 Istanbul Turkey
| | - Amir Mohammad Elahi
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
| | - Ian T Foster
- Department of Computer Science, University of Chicago Chicago IL 60637 USA
- Data Science and Learning Division, Argonne National Laboratory Lemont IL 60439 USA
| | - Susana Garcia
- The Research Centre for Carbon Solutions (RCCS), School of Engineering and Physical Sciences, Heriot-Watt University Edinburgh EH14 4AS UK
| | - Matthew Garvin
- The Research Centre for Carbon Solutions (RCCS), School of Engineering and Physical Sciences, Heriot-Watt University Edinburgh EH14 4AS UK
| | | | - Lydia L Good
- Yusuf Hamied Department of Chemistry, University of Cambridge Cambridge CB2 1EW UK
- Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health Bethesda Maryland 20892 USA
| | - Jianan Gu
- Institute of Metallic Biomaterials, Helmholtz Zentrum Hereon Geesthacht Germany
| | - Noémie Xiao Hu
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
| | - Xin Jin
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
| | - Tanja Junkers
- Polymer Reaction Design Group, School of Chemistry, Monash University Clayton VIC 3800 Australia
| | - Seda Keskin
- Department of Chemical and Biological Engineering, Koç University Rumelifeneri Yolu, Sariyer 34450 Istanbul Turkey
| | - Tuomas P J Knowles
- Yusuf Hamied Department of Chemistry, University of Cambridge Cambridge CB2 1EW UK
- Cavendish Laboratory, Department of Physics, University of Cambridge Cambridge CB3 0HE UK
| | - Ruben Laplaza
- Laboratory for Computational Molecular Design (LCMD), Institute of Chemical Sciences and Engineering (ISIC), École Polytechnique Fédérale de Lausanne (EPFL) CH-1015 Lausanne Switzerland
| | - Michele Lessona
- Department of Applied Science and Technology (DISAT), Politecnico di Torino 10129 Turino Italy
| | - Sauradeep Majumdar
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
| | | | - Ruaraidh D McIntosh
- Institute of Chemical Sciences, School of Engineering and Physical Sciences, Heriot-Watt University Edinburgh EH14 4AS UK
| | - Seyed Mohamad Moosavi
- Chemical Engineering & Applied Chemistry, University of Toronto Toronto Ontario M5S 3E5 Canada
| | - Beatriz Mouriño
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
| | - Francesca Nerli
- Dipartimento di Chimica e Chimica Industriale, Unità di Ricerca INSTM, Università di Pisa Via Giuseppe Moruzzi 13 56124 Pisa Italy
| | - Covadonga Pevida
- Instituto de Ciencia y TecnologÍa del Carbono (INCAR), CSIC Francisco Pintado Fe 26 33011 Oviedo Spain
| | - Neda Poudineh
- The Research Centre for Carbon Solutions (RCCS), School of Engineering and Physical Sciences, Heriot-Watt University Edinburgh EH14 4AS UK
| | - Mahyar Rajabi-Kochi
- Chemical Engineering & Applied Chemistry, University of Toronto Toronto Ontario M5S 3E5 Canada
| | - Kadi L Saar
- Yusuf Hamied Department of Chemistry, University of Cambridge Cambridge CB2 1EW UK
| | | | - Morteza Sagharichiha
- Department of Chemical Engineering, College of Engineering, University of Tehran Tehran Iran
| | - K J Schmidt
- Department of Computer Science, University of Chicago Chicago IL 60637 USA
| | - Jiale Shi
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA 02139 USA
- Department of Chemical and Biomolecular Engineering, University of Notre Dame Notre Dame Indiana 46556 USA
| | - Elena Simone
- Department of Applied Science and Technology (DISAT), Politecnico di Torino 10129 Turino Italy
| | - Dennis Svatunek
- Institute of Applied Synthetic Chemistry, TU Wien Getreidemarkt 9 1060 Vienna Austria
| | - Marco Taddei
- Dipartimento di Chimica e Chimica Industriale, Unità di Ricerca INSTM, Università di Pisa Via Giuseppe Moruzzi 13 56124 Pisa Italy
| | - Igor Tetko
- BIGCHEM GmbH Valerystraße 49 85716 Unterschleißheim Germany
- Institute of Structural Biology, Molecular Targets and Therapeutics Center, Helmholtz Munich - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH) Ingolstädter Landstraße 1 85764 Neuherberg Germany
| | - Domonkos Tolnai
- Institute of Metallic Biomaterials, Helmholtz Zentrum Hereon Geesthacht Germany
| | - Sahar Vahdatifar
- Department of Chemical Engineering, College of Engineering, University of Tehran Tehran Iran
| | - Jonathan Whitmer
- Department of Chemical and Biomolecular Engineering, University of Notre Dame Notre Dame Indiana 46556 USA
- Department of Chemistry and Biochemistry, University of Notre Dame Notre Dame Indiana 46556 USA
| | - D C Florian Wieland
- Institute of Metallic Biomaterials, Helmholtz Zentrum Hereon Geesthacht Germany
| | | | - Andreas Züttel
- Laboratory of Materials for Renewable Energy (LMER), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
| | - Berend Smit
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL) Rue de l'Industrie 17 CH-1951 Sion Switzerland
| |
Collapse
|
2
|
Wang Z, Lin K, Pei J, Lai L. Reacon: a template- and cluster-based framework for reaction condition prediction. Chem Sci 2025; 16:854-866. [PMID: 39650221 PMCID: PMC11622862 DOI: 10.1039/d4sc05946h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2024] [Accepted: 11/27/2024] [Indexed: 12/11/2024] Open
Abstract
Computer-assisted synthesis planning has emerged as a valuable tool for organic synthesis. Prediction of reaction conditions is crucial for applying the planned synthesis routes. However, achieving diverse suggestions while ensuring the reasonableness of predictions remains an underexplored challenge. In this study, we introduce an innovative method for forecasting reaction conditions using a combination of graph neural networks, reaction templates, and clustering algorithm. Our method, trained on the refined USPTO dataset, excels with a top-3 accuracy of 63.48% in recalling the recorded conditions. Moreover, when focusing solely on recalling reactions within the same cluster, the top-3 accuracy increases to 85.65%. Finally, by applying the method to recently published molecule synthesis routes and achieving an 85.00% top-3 accuracy at the cluster level, we demonstrate our approach's capability to deliver reliable and diverse condition predictions.
Collapse
Affiliation(s)
- Zihan Wang
- BNLMS, Peking-Tsinghua Center for Life Sciences, College of Chemistry and Molecular Engineering, Peking University Beijing 100871 China
| | - Kangjie Lin
- BNLMS, Peking-Tsinghua Center for Life Sciences, College of Chemistry and Molecular Engineering, Peking University Beijing 100871 China
| | - Jianfeng Pei
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University Beijing 100871 China
| | - Luhua Lai
- BNLMS, Peking-Tsinghua Center for Life Sciences, College of Chemistry and Molecular Engineering, Peking University Beijing 100871 China
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University Beijing 100871 China
| |
Collapse
|
3
|
Uceda RG, Gijón A, Míguez‐Lago S, Cruz CM, Blanco V, Fernández‐Álvarez F, Álvarez de Cienfuegos L, Molina‐Solana M, Gómez‐Romero J, Miguel D, Mota AJ, Cuerva JM. Can Deep Learning Search for Exceptional Chiroptical Properties? The Halogenated [6]Helicene Case. Angew Chem Int Ed Engl 2024; 63:e202409998. [PMID: 39329214 PMCID: PMC11586703 DOI: 10.1002/anie.202409998] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2024] [Revised: 09/11/2024] [Accepted: 09/24/2024] [Indexed: 09/28/2024]
Abstract
The relationship between chemical structure and chiroptical properties is not always clearly understood. Nowadays, efforts to develop new systems with enhanced optical properties follow the trial-error method. A large number of data would allow us to obtain more robust conclusions and guide research toward molecules with practical applications. In this sense, in this work we predict the chiroptical properties of millions of halogenated [6]helicenes in terms of the rotatory strength (R). We have used DFT calculations to randomly create derivatives including from 1 to 16 halogen atoms, that were then used as a data set to train different deep neural network models. These models allow us to i) predict the Rmax for any halogenated [6]helicene with a very low computational cost, and ii) to understand the physical reasons that favour some substitutions over others. Finally, we synthesized derivatives with higher predicted Rmax obtaining excellent correlation among the values obtained experimentally and the predicted ones.
Collapse
Affiliation(s)
- Rafael G. Uceda
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
| | - Alfonso Gijón
- Departamento de Ciencias de la Computación e Inteligencia Artificial, UGRE.T.S. de Ingenierías Informática y de TelecomunicaciónC/ Periodista Daniel Saucedo Aranda S/N18071GranadaSpain
| | - Sandra Míguez‐Lago
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
| | - Carlos M. Cruz
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
| | - Víctor Blanco
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
| | - Fátima Fernández‐Álvarez
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
| | - Luis Álvarez de Cienfuegos
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
- Instituto de Investigación BiosanitariaAvda. Madrid, 1518016GranadaSpain
| | - Miguel Molina‐Solana
- Departamento de Ciencias de la Computación e Inteligencia Artificial, UGRE.T.S. de Ingenierías Informática y de TelecomunicaciónC/ Periodista Daniel Saucedo Aranda S/N18071GranadaSpain
| | - Juan Gómez‐Romero
- Departamento de Ciencias de la Computación e Inteligencia Artificial, UGRE.T.S. de Ingenierías Informática y de TelecomunicaciónC/ Periodista Daniel Saucedo Aranda S/N18071GranadaSpain
| | - Delia Miguel
- Departamento de Fisicoquímica, UEQ, UGRFacultad de FarmaciaAvda. Profesor Clavera s/nC. U. Cartuja18071GranadaSpain
| | - Antonio J. Mota
- Departamento de Química Inorgánica, UEQ, UGRFacultad de CienciasC. U. Fuentenueva18071GranadaSpain
| | - Juan M. Cuerva
- Departamento de Química Orgánica, Unidad de Excelencia de Química Aplicada a la Biomedicina y Medioambiente (UEQ)Universidad de Granada (UGR), Facultad de CienciasC. U. Fuentenueva18071GranadaSpain
| |
Collapse
|
4
|
Sharma P, Chowdhury PR, Jain A, Patwari GN. Machine Learned Potential Enables Molecular Dynamics Simulation to Predict the Experimental Branching Ratios in the NO Release Channel of Nitroaromatic Compounds. J Phys Chem A 2024; 128:10137-10142. [PMID: 39550764 DOI: 10.1021/acs.jpca.4c04703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2024]
Abstract
This study employs a machine learning (ML) model using the Gaussian process regression algorithm to generate potential energy surfaces (PES) from density functional theory calculations, facilitating the investigation of photodissociation dynamics of nitroaromatic compounds, resulting in NO release. The experimentally observed trends in the slow-to-fast branching ratios of the NO moiety were captured by estimating the branching ratio between the two distinct reaction pathways, viz., roaming and oxaziridine mechanisms, calculated from molecular dynamics simulations performed on a reduced two-dimensional T1 surface. The qualitative agreement between the calculated and experimental results suggests that the mechanism dictating NO release is primarily governed by the dynamics on the T1 surface.
Collapse
Affiliation(s)
- Pooja Sharma
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai 400076, India
| | - Prahlad Roy Chowdhury
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai 400076, India
| | - Amber Jain
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai 400076, India
| | - G Naresh Patwari
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai 400076, India
| |
Collapse
|
5
|
Ruan Y, Lu C, Xu N, He Y, Chen Y, Zhang J, Xuan J, Pan J, Fang Q, Gao H, Shen X, Ye N, Zhang Q, Mo Y. An automatic end-to-end chemical synthesis development platform powered by large language models. Nat Commun 2024; 15:10160. [PMID: 39580482 PMCID: PMC11585555 DOI: 10.1038/s41467-024-54457-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2024] [Accepted: 11/07/2024] [Indexed: 11/25/2024] Open
Abstract
The rapid emergence of large language model (LLM) technology presents promising opportunities to facilitate the development of synthetic reactions. In this work, we leveraged the power of GPT-4 to build an LLM-based reaction development framework (LLM-RDF) to handle fundamental tasks involved throughout the chemical synthesis development. LLM-RDF comprises six specialized LLM-based agents, including Literature Scouter, Experiment Designer, Hardware Executor, Spectrum Analyzer, Separation Instructor, and Result Interpreter, which are pre-prompted to accomplish the designated tasks. A web application with LLM-RDF as the backend was built to allow chemist users to interact with automated experimental platforms and analyze results via natural language, thus, eliminating the need for coding skills and ensuring accessibility for all chemists. We demonstrated the capabilities of LLM-RDF in guiding the end-to-end synthesis development process for the copper/TEMPO catalyzed aerobic alcohol oxidation to aldehyde reaction, including literature search and information extraction, substrate scope and condition screening, reaction kinetics study, reaction condition optimization, reaction scale-up and product purification. Furthermore, LLM-RDF's broader applicability and versability was validated on various synthesis tasks of three distinct reactions (SNAr reaction, photoredox C-C cross-coupling reaction, and heterogeneous photoelectrochemical reaction).
Collapse
Affiliation(s)
- Yixiang Ruan
- College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, 310027, China
- Zhejiang-Hong Kong Joint Laboratory for Intelligent Molecule and Material Design and Synthesis, ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311215, China
| | - Chenyin Lu
- Zhejiang-Hong Kong Joint Laboratory for Intelligent Molecule and Material Design and Synthesis, ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311215, China
| | - Ning Xu
- College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, 310027, China
- Zhejiang-Hong Kong Joint Laboratory for Intelligent Molecule and Material Design and Synthesis, ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311215, China
| | - Yuchen He
- College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, 310027, China
- Zhejiang-Hong Kong Joint Laboratory for Intelligent Molecule and Material Design and Synthesis, ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311215, China
| | - Yixin Chen
- College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, 310027, China
- Zhejiang-Hong Kong Joint Laboratory for Intelligent Molecule and Material Design and Synthesis, ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311215, China
| | - Jian Zhang
- Zhejiang-Hong Kong Joint Laboratory for Intelligent Molecule and Material Design and Synthesis, ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311215, China
| | - Jun Xuan
- Zhejiang-Hong Kong Joint Laboratory for Intelligent Molecule and Material Design and Synthesis, ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311215, China
| | - Jianzhang Pan
- Zhejiang-Hong Kong Joint Laboratory for Intelligent Molecule and Material Design and Synthesis, ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311215, China
- Institute of Microanalytical Systems, Department of Chemistry, Zhejiang University, Hangzhou, 310058, China
| | - Qun Fang
- Zhejiang-Hong Kong Joint Laboratory for Intelligent Molecule and Material Design and Synthesis, ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311215, China
- Institute of Microanalytical Systems, Department of Chemistry, Zhejiang University, Hangzhou, 310058, China
| | - Hanyu Gao
- Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Hong Kong, 999077, China
| | - Xiaodong Shen
- Chemical & Analytical Development, Suzhou Novartis Technical Development Co. Ltd., Changshu, 215537, China
| | - Ning Ye
- Rezubio Pharmaceuticals Co. Ltd., Zhuhai, 519070, China
| | - Qiang Zhang
- Zhejiang-Hong Kong Joint Laboratory for Intelligent Molecule and Material Design and Synthesis, ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311215, China
- College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China
| | - Yiming Mo
- College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, 310027, China.
- Zhejiang-Hong Kong Joint Laboratory for Intelligent Molecule and Material Design and Synthesis, ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311215, China.
| |
Collapse
|
6
|
Maziarz K, Tripp A, Liu G, Stanley M, Xie S, Gaiński P, Seidl P, Segler MHS. Re-evaluating retrosynthesis algorithms with Syntheseus. Faraday Discuss 2024. [PMID: 39485491 DOI: 10.1039/d4fd00093e] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2024]
Abstract
Automated synthesis planning has recently re-emerged as a research area at the intersection of chemistry and machine learning. Despite the appearance of steady progress, we argue that imperfect benchmarks and inconsistent comparisons mask systematic shortcomings of existing techniques, and unnecessarily hamper progress. To remedy this, we present a synthesis planning library with an extensive benchmarking framework, called SYNTHESEUS, which promotes best practice by default, enabling consistent meaningful evaluation of single-step and multi-step synthesis planning algorithms. We demonstrate the capabilities of SYNTHESEUS by re-evaluating several previous retrosynthesis algorithms, and find that the ranking of state-of-the-art models changes in controlled evaluation experiments. We end with guidance for future works in this area, and call on the community to engage in the discussion on how to improve benchmarks for synthesis planning.
Collapse
|
7
|
Burke AJ, Carreiro EP. 5th International Symposium on Synthesis and Catalysis (ISySyCat2023). Beilstein J Org Chem 2024; 20:2704-2707. [PMID: 39498448 PMCID: PMC11533119 DOI: 10.3762/bjoc.20.227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2024] [Accepted: 10/15/2024] [Indexed: 11/07/2024] Open
Affiliation(s)
- Anthony J Burke
- Faculty of Pharmacy, University of Coimbra, Pólo das Ciências da Saúde, Azinhaga de Santa Coimbra, 3000-548 Coimbra, Portugal
- Coimbra Chemistry Centre, Institute of Molecular Sciences, Chemistry Department, Faculty of Science and Technology, University of Coimbra, 3004-535 Coimbra, Portugal
| | - Elisabete P Carreiro
- LAQV-REQUIMTE, Institute for Research and Advanced Training (IIFA), University of Évora, Rua Romão Ramalho, 59, 7000-671 Évora, Portugal
| |
Collapse
|
8
|
Yu M, Jia Q, Wang Q, Luo ZH, Yan F, Zhou YN. Data science-centric design, discovery, and evaluation of novel synthetically accessible polyimides with desired dielectric constants. Chem Sci 2024:d4sc05000b. [PMID: 39416299 PMCID: PMC11474456 DOI: 10.1039/d4sc05000b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Accepted: 10/01/2024] [Indexed: 10/19/2024] Open
Abstract
Rapidly advancing computer technology has demonstrated great potential in recent years to assist in the generation and discovery of promising molecular structures. Herein, we present a data science-centric "Design-Discovery-Evaluation" scheme for exploring novel polyimides (PIs) with desired dielectric constants (ε). A virtual library of over 100 000 synthetically accessible PIs is created by extending existing PIs. Within the framework of quantitative structure-property relationship (QSPR), a model sufficient to predict ε at multiple frequencies is developed with an R 2 of 0.9768, allowing further high-throughput screening of the prior structures with desired ε. Furthermore, the structural feature representation method of atomic adjacent group (AAG) is introduced, using which the reliability of high-throughput screening results is evaluated. This workflow identifies 9 novel PIs (ε >5 at 103 Hz and glass transition temperatures between 250 °C and 350 °C) with potential applications in high-temperature capacitive energy storage, and confirms these promising findings by high-fidelity molecular dynamics (MD) simulations.
Collapse
Affiliation(s)
- Mengxian Yu
- School of Chemical Engineering and Material Science, Tianjin University of Science and Technology Tianjin 300457 P. R. China
| | - Qingzhu Jia
- School of Chemical Engineering and Material Science, Tianjin University of Science and Technology Tianjin 300457 P. R. China
| | - Qiang Wang
- School of Chemical Engineering and Material Science, Tianjin University of Science and Technology Tianjin 300457 P. R. China
| | - Zheng-Hong Luo
- Department of Chemical Engineering, School of Chemistry and Chemical Engineering, State Key Laboratory of Metal Matrix Composites, Shanghai Jiao Tong University Shanghai 200240 P. R. China
| | - Fangyou Yan
- School of Chemical Engineering and Material Science, Tianjin University of Science and Technology Tianjin 300457 P. R. China
| | - Yin-Ning Zhou
- Department of Chemical Engineering, School of Chemistry and Chemical Engineering, State Key Laboratory of Metal Matrix Composites, Shanghai Jiao Tong University Shanghai 200240 P. R. China
| |
Collapse
|
9
|
Han Y, Deng M, Liu K, Chen J, Wang Y, Xu YN, Dian L. Computer-Aided Synthesis Planning (CASP) and Machine Learning: Optimizing Chemical Reaction Conditions. Chemistry 2024; 30:e202401626. [PMID: 39083362 DOI: 10.1002/chem.202401626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2024] [Revised: 07/27/2024] [Accepted: 07/28/2024] [Indexed: 08/02/2024]
Abstract
Computer-aided synthesis planning (CASP) has garnered increasing attention in light of recent advancements in machine learning models. While the focus is on reverse synthesis or forward outcome prediction, optimizing reaction conditions remains a significant challenge. For datasets with multiple variables, the choice of descriptors and models is pivotal. This selection dictates the effective extraction of conditional features and the achievement of higher prediction accuracy. This review delineates the origins of data in conditional optimization, the criteria for descriptor selection, the response models, and the metrics for outcome evaluation, aiming to acquaint readers with the latest research trends and facilitate more informed research in this domain.
Collapse
Affiliation(s)
- Yu Han
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Mingjing Deng
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Ke Liu
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Jia Chen
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Yuting Wang
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Yu-Ning Xu
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Longyang Dian
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
- Suzhou Institute of Shandong University, No. 388 Ruoshui Road, Suzhou Industrial Park, Suzhou, 215123, P. R. China
| |
Collapse
|
10
|
Singh S, Hernández-Lobato JM. Data-Driven Insights into the Transition-Metal-Catalyzed Asymmetric Hydrogenation of Olefins. J Org Chem 2024; 89:12467-12478. [PMID: 39149801 PMCID: PMC11382158 DOI: 10.1021/acs.joc.4c01396] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
The transition-metal-catalyzed asymmetric hydrogenation of olefins is one of the key transformations with great utility in various industrial applications. The field has been dominated by the use of noble metal catalysts, such as iridium and rhodium. The reactions with the earth-abundant cobalt metal have increased only in recent years. In this work, we analyze the large amount of literature data available on iridium- and rhodium-catalyzed asymmetric hydrogenation. The limited data on reactions using Co catalysts are then examined in the context of Ir and Rh to obtain a better understanding of the reactivity pattern. A detailed data-driven study of the types of olefins, ligands, and reaction conditions such as solvent, temperature, and pressure is carried out. Our analysis provides an understanding of the literature trends and demonstrates that only a few olefin-ligand combinations or reaction conditions are frequently used. The knowledge of this bias in the literature data toward a certain group of substrates or reaction conditions can be useful for practitioners to design new reaction data sets that are suitable to obtain meaningful predictions from machine-learning models.
Collapse
Affiliation(s)
- Sukriti Singh
- Department of Engineering, University of Cambridge, Cambridge CB2 1PZ, U.K
| | | |
Collapse
|
11
|
Schäfer F, Lückemeier L, Glorius F. Improving reproducibility through condition-based sensitivity assessments: application, advancement and prospect. Chem Sci 2024:d4sc03017f. [PMID: 39263664 PMCID: PMC11382186 DOI: 10.1039/d4sc03017f] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Accepted: 08/29/2024] [Indexed: 09/13/2024] Open
Abstract
The fluctuating reproducibility of scientific reports presents a well-recognised issue, frequently stemming from insufficient standardisation, transparency and a lack of information in scientific publications. Consequently, the incorporation of newly developed synthetic methods into practical applications often occurs at a slow rate. In recent years, various efforts have been made to analyse the sensitivity of chemical methodologies and the variation in quantitative outcome observed across different laboratory environments. For today's chemists, determining the key factors that really matter for a reaction's outcome from all the different aspects of chemical methodology can be a challenging task. In response, we provide a detailed examination and customised recommendations surrounding the sensitivity screen, offering a comprehensive assessment of various strategies and exploring their diverse applications by research groups to improve the practicality of their methodologies.
Collapse
Affiliation(s)
- Felix Schäfer
- Universität Münster, Organisch-Chemisches Institut Corrensstraße 36 48149 Münster Germany
| | - Lukas Lückemeier
- Universität Münster, Organisch-Chemisches Institut Corrensstraße 36 48149 Münster Germany
| | - Frank Glorius
- Universität Münster, Organisch-Chemisches Institut Corrensstraße 36 48149 Münster Germany
| |
Collapse
|
12
|
Feng Y, Morato NM, Huang KH, Lin M, Cooks RG. High-throughput label-free opioid receptor binding assays using an automated desorption electrospray ionization mass spectrometry platform. Chem Commun (Camb) 2024; 60:8224-8227. [PMID: 39007214 PMCID: PMC11293027 DOI: 10.1039/d4cc02346c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Accepted: 07/09/2024] [Indexed: 07/16/2024]
Abstract
The current opioid epidemic has incentivized the discovery of new non-addictive analgesics, a process that requires the screening of opioid receptor binding, traditionally performed using radiometric assays. Here we describe a label-free alternative based on high-throughput (1 Hz) ambient mass spectrometry for screening the receptor binding of new opioid analogues.
Collapse
Affiliation(s)
- Yunfei Feng
- Department of Chemistry, Bindley Bioscience Center, and Purdue Institute for Cancer Research, Purdue University, West Lafayette, IN 47907, USA.
| | - Nicolás M Morato
- Department of Chemistry, Bindley Bioscience Center, and Purdue Institute for Cancer Research, Purdue University, West Lafayette, IN 47907, USA.
| | - Kai-Hung Huang
- Department of Chemistry, Bindley Bioscience Center, and Purdue Institute for Cancer Research, Purdue University, West Lafayette, IN 47907, USA.
| | - Mina Lin
- Department of Chemistry, Bindley Bioscience Center, and Purdue Institute for Cancer Research, Purdue University, West Lafayette, IN 47907, USA.
| | - R Graham Cooks
- Department of Chemistry, Bindley Bioscience Center, and Purdue Institute for Cancer Research, Purdue University, West Lafayette, IN 47907, USA.
| |
Collapse
|
13
|
Lan T, Wang H, An Q. Enabling high throughput deep reinforcement learning with first principles to investigate catalytic reaction mechanisms. Nat Commun 2024; 15:6281. [PMID: 39060277 PMCID: PMC11282263 DOI: 10.1038/s41467-024-50531-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Accepted: 07/11/2024] [Indexed: 07/28/2024] Open
Abstract
Exploring catalytic reaction mechanisms is crucial for understanding chemical processes, optimizing reaction conditions, and developing more effective catalysts. We present a reaction-agnostic framework based on high-throughput deep reinforcement learning with first principles (HDRL-FP) that offers excellent generalizability for investigating catalytic reactions. HDRL-FP introduces a generalizable reinforcement learning representation of catalytic reactions constructed solely from atomic positions, which are subsequently mapped to first-principles-derived potential energy landscapes. By leveraging thousands of simultaneous simulations on a single GPU, HDRL-FP enables rapid convergence to the optimal reaction path at a low cost. Its effectiveness is demonstrated through the studies of hydrogen and nitrogen migration in Haber-Bosch ammonia synthesis on the Fe(111) surface. Our findings reveal that the Langmuir-Hinshelwood mechanism shares the same transition state as the Eley-Rideal mechanism for H migration to NH2, forming ammonia. Furthermore, the reaction path identified herein exhibits a lower energy barrier compared to that through nudged elastic band calculation.
Collapse
Affiliation(s)
- Tian Lan
- Salesforce A.I. Research, Palo Alto, CA, USA
| | - Huan Wang
- Salesforce A.I. Research, Palo Alto, CA, USA
| | - Qi An
- Department of Materials Science and Engineering, Iowa State University, Ames, IA, USA.
| |
Collapse
|
14
|
Fan V, Qian Y, Wang A, Wang A, Coley CW, Barzilay R. OpenChemIE: An Information Extraction Toolkit for Chemistry Literature. J Chem Inf Model 2024; 64:5521-5534. [PMID: 38950894 DOI: 10.1021/acs.jcim.4c00572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/03/2024]
Abstract
Information extraction from chemistry literature is vital for constructing up-to-date reaction databases for data-driven chemistry. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this paper, we present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities and then integrating the results to obtain a final list of reactions. For the first step, we employ specialized neural models that each address a specific task for chemistry information extraction, such as parsing molecules or reactions from text or figures. We then integrate the information from these modules using chemistry-informed algorithms, allowing for the extraction of fine-grained reaction data from reaction condition and substrate scope investigations. Our machine learning models attain state-of-the-art performance when evaluated individually, and we meticulously annotate a challenging dataset of reaction schemes with R-groups to evaluate our pipeline as a whole, achieving an F1 score of 69.5%. Additionally, the reaction extraction results of OpenChemIE attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. OpenChemIE is most suited for information extraction on organic chemistry literature, where molecules are generally depicted as planar graphs or written in text and can be consolidated into a SMILES format. We provide OpenChemIE freely to the public as an open-source package, as well as through a web interface.
Collapse
Affiliation(s)
- Vincent Fan
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Yujie Qian
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Alex Wang
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Amber Wang
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
15
|
Singh S, Hernández-Lobato JM. Deep Kernel learning for reaction outcome prediction and optimization. Commun Chem 2024; 7:136. [PMID: 38877182 PMCID: PMC11178803 DOI: 10.1038/s42004-024-01219-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2024] [Accepted: 06/05/2024] [Indexed: 06/16/2024] Open
Abstract
Recent years have seen a rapid growth in the application of various machine learning methods for reaction outcome prediction. Deep learning models have gained popularity due to their ability to learn representations directly from the molecular structure. Gaussian processes (GPs), on the other hand, provide reliable uncertainty estimates but are unable to learn representations from the data. We combine the feature learning ability of neural networks (NNs) with uncertainty quantification of GPs in a deep kernel learning (DKL) framework to predict the reaction outcome. The DKL model is observed to obtain very good predictive performance across different input representations. It significantly outperforms standard GPs and provides comparable performance to graph neural networks, but with uncertainty estimation. Additionally, the uncertainty estimates on predictions provided by the DKL model facilitated its incorporation as a surrogate model for Bayesian optimization (BO). The proposed method, therefore, has a great potential towards accelerating reaction discovery by integrating accurate predictive models that provide reliable uncertainty estimates with BO.
Collapse
Affiliation(s)
- Sukriti Singh
- Department of Engineering, University of Cambridge, Cambridge, UK.
| | | |
Collapse
|
16
|
Rezaee M, Ekrami S, Hashemianzadeh SM. Comparing ANI-2x, ANI-1ccx neural networks, force field, and DFT methods for predicting conformational potential energy of organic molecules. Sci Rep 2024; 14:11791. [PMID: 38783010 PMCID: PMC11116541 DOI: 10.1038/s41598-024-62242-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Accepted: 05/15/2024] [Indexed: 05/25/2024] Open
Abstract
In this study, the conformational potential energy surfaces of Amylmetacresol, Benzocaine, Dopamine, Betazole, and Betahistine molecules were scanned and analyzed using the neural network architecture ANI-2 × and ANI-1ccx, the force field method OPLS, and density functional theory with the exchange-correlation functional B3LYP and the basis set 6-31G(d). The ANI-1ccx and ANI-2 × methods demonstrated the highest accuracy in predicting torsional energy profiles, effectively capturing the minimum and maximum values of these profiles. Conformational potential energy values calculated by B3LYP and the OPLS force field method differ from those calculated by ANI-1ccx and ANI-2x, which account for non-bonded intramolecular interactions, since the B3LYP functional and OPLS force field weakly consider van der Waals and other intramolecular forces in torsional energy profiles. For a more comprehensive analysis, electronic parameters such as dipole moment, HOMO, and LUMO energies for different torsional angles were calculated at two levels of theory, B3LYP/6-31G(d) and ωB97X/6-31G(d). These calculations confirmed that ANI predictions are more accurate than density functional theory calculations with B3LYP functional and OPLS force field for determining potential energy surfaces. This research successfully addressed the challenges in determining conformational potential energy levels and shows how machine learning and deep neural networks offer a more accurate, cost-effective, and rapid alternative for predicting torsional energy profiles.
Collapse
Affiliation(s)
- Mozafar Rezaee
- Molecular Simulation Research Laboratory, Department of Chemistry, Iran University of Science and Technology, Tehran, Iran
| | - Saeid Ekrami
- CNRS, LCPME, Université de Lorraine, 54000, Nancy, France
| | - Seyed Majid Hashemianzadeh
- Molecular Simulation Research Laboratory, Department of Chemistry, Iran University of Science and Technology, Tehran, Iran.
| |
Collapse
|
17
|
Zhang J, Li L, Xie X, Song XQ, Schaefer HF. Biomimetic Frustrated Lewis Pair Catalysts for Hydrogenation of CO to Methanol at Low Temperatures. ACS ORGANIC & INORGANIC AU 2024; 4:258-267. [PMID: 38585511 PMCID: PMC10996047 DOI: 10.1021/acsorginorgau.3c00064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Revised: 01/12/2024] [Accepted: 01/16/2024] [Indexed: 04/09/2024]
Abstract
The industrial production of methanol through CO hydrogenation using the Cu/ZnO/Al2O3 catalyst requires harsh conditions, and the development of new catalysts with low operating temperatures is highly desirable. In this study, organic biomimetic FLP catalysts with good tolerance to CO poison are theoretically designed. The base-free catalytic reaction contains the 1,1-addition of CO into a formic acid intermediate and the hydrogenation of the formic acid intermediate into methanol. Low-energy spans (25.6, 22.1, and 20.6 kcal/mol) are achieved, indicating that CO can be hydrogenated into methanol at low temperatures. The new extended aromatization-dearomatization effect involving multiple rings is proposed to effectively facilitate the rate-determining CO 1,1-addition step, and a new CO activation model is proposed for organic catalysts.
Collapse
Affiliation(s)
- Jiejing Zhang
- College
of Pharmacy, Key Laboratory of Pharmaceutical Quality Control of Hebei
Province, Key Laboratory of Medicinal Chemistry and Molecular Diagnosis
of Ministry of Education, Hebei University, Baoding 071002, Hebei, P. R. China
| | - Longfei Li
- College
of Pharmacy, Key Laboratory of Pharmaceutical Quality Control of Hebei
Province, Key Laboratory of Medicinal Chemistry and Molecular Diagnosis
of Ministry of Education, Hebei University, Baoding 071002, Hebei, P. R. China
| | - Xiaofeng Xie
- College
of Pharmacy, Key Laboratory of Pharmaceutical Quality Control of Hebei
Province, Key Laboratory of Medicinal Chemistry and Molecular Diagnosis
of Ministry of Education, Hebei University, Baoding 071002, Hebei, P. R. China
| | - Xue-Qing Song
- College
of Pharmacy, Key Laboratory of Pharmaceutical Quality Control of Hebei
Province, Key Laboratory of Medicinal Chemistry and Molecular Diagnosis
of Ministry of Education, Hebei University, Baoding 071002, Hebei, P. R. China
| | - Henry F. Schaefer
- Center
for Computational Quantum Chemistry, University
of Georgia, Athens, Georgia 30602, United States
| |
Collapse
|
18
|
Pasquini M, Stenta M. LinChemIn: Route Arithmetic─Operations on Digital Synthetic Routes. J Chem Inf Model 2024; 64:1765-1771. [PMID: 38480486 DOI: 10.1021/acs.jcim.3c01819] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/26/2024]
Abstract
Computational tools are revolutionizing our understanding and prediction of chemical reactivity by combining traditional data analysis techniques with new predictive models. These tools extract additional value from the reaction data corpus, but to effectively convert this value into actionable knowledge, domain specialists need to interact easily with the computer-generated output. In this application note, we demonstrate the capabilities of the open-source Python toolkit LinChemIn, which simplifies the manipulation of reaction networks and provides advanced functionality for working with synthetic routes. LinChemIn ensures chemical consistency when merging, editing, mining, and analyzing reaction networks. Its flexible input interface can process routes from various sources, including predictive models and expert input. The toolkit also efficiently extracts individual routes from the combined synthetic tree, identifying alternative paths and reaction combinations. By reducing the operational barrier to accessing and analyzing synthetic routes from multiple sources, LinChemIn facilitates a constructive interplay between artificial intelligence and human expertise.
Collapse
Affiliation(s)
- Marta Pasquini
- Syngenta Crop Protection AG, Schaffhauserstrasse, 4332 Stein, AG, Switzerland
| | - Marco Stenta
- Syngenta Crop Protection AG, Schaffhauserstrasse, 4332 Stein, AG, Switzerland
| |
Collapse
|
19
|
Gallarati S, van Gerwen P, Laplaza R, Brey L, Makaveev A, Corminboeuf C. A genetic optimization strategy with generality in asymmetric organocatalysis as a primary target. Chem Sci 2024; 15:3640-3660. [PMID: 38455002 PMCID: PMC10915838 DOI: 10.1039/d3sc06208b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Accepted: 01/30/2024] [Indexed: 03/09/2024] Open
Abstract
A catalyst possessing a broad substrate scope, in terms of both turnover and enantioselectivity, is sometimes called "general". Despite their great utility in asymmetric synthesis, truly general catalysts are difficult or expensive to discover via traditional high-throughput screening and are, therefore, rare. Existing computational tools accelerate the evaluation of reaction conditions from a pre-defined set of experiments to identify the most general ones, but cannot generate entirely new catalysts with enhanced substrate breadth. For these reasons, we report an inverse design strategy based on the open-source genetic algorithm NaviCatGA and on the OSCAR database of organocatalysts to simultaneously probe the catalyst and substrate scope and optimize generality as a primary target. We apply this strategy to the Pictet-Spengler condensation, for which we curate a database of 820 reactions, used to train statistical models of selectivity and activity. Starting from OSCAR, we define a combinatorial space of millions of catalyst possibilities, and perform evolutionary experiments on a diverse substrate scope that is representative of the whole chemical space of tetrahydro-β-carboline products. While privileged catalysts emerge, we show how genetic optimization can address the broader question of generality in asymmetric synthesis, extracting structure-performance relationships from the challenging areas of chemical space.
Collapse
Affiliation(s)
- Simone Gallarati
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
| | - Puck van Gerwen
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
- National Center for Competence in Research - Catalysis (NCCR-Catalysis), Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
| | - Ruben Laplaza
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
- National Center for Competence in Research - Catalysis (NCCR-Catalysis), Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
| | - Lucien Brey
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
| | - Alexander Makaveev
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
| | - Clemence Corminboeuf
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
- National Center for Competence in Research - Catalysis (NCCR-Catalysis), Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
- National Center for Computational Design and Discovery of Novel Materials (MARVEL), Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
| |
Collapse
|
20
|
Sakai M, Kaneshige M, Yasuda K. Learning organo-transition metal catalyzed reactions by graph neural networks. J Comput Chem 2024; 45:341-351. [PMID: 37877461 DOI: 10.1002/jcc.27243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2023] [Revised: 09/21/2023] [Accepted: 10/04/2023] [Indexed: 10/26/2023]
Abstract
Chemical reaction outcome prediction presents a fundamental challenge in synthetic chemistry. Most existing machine learning (ML) approaches focus on chemical reactions of typical elements. We developed a simple ML model focused on organo-transition metal-catalyzed reactions (OMCRs). Instead of overall reactions observed in experiments, we let the ML model learn the sequence of simplified elementary reactions. This drastically reduced the complexity of the model and helped it find common patterns from distinct reactions. We let a graph neural network learn the reactivity index of a pair of atoms. The model was able to learn a wide variety of OMCRs, and the accuracy of reaction prediction reached 97%, even though the model has extremely fewer learnable parameters than other standards. The learned reactivity indices of bonds nicely summarize the knowledge of reactions in the dataset.
Collapse
Affiliation(s)
- Motoji Sakai
- Department of Informatics, Nagoya University, Nagoya, Japan
| | | | - Koji Yasuda
- Department of Informatics, Nagoya University, Nagoya, Japan
- Institute of Materials and Systems for Sustainability, Nagoya University, Nagoya, Japan
| |
Collapse
|
21
|
Nicolle A, Deng S, Ihme M, Kuzhagaliyeva N, Ibrahim EA, Farooq A. Mixtures Recomposition by Neural Nets: A Multidisciplinary Overview. J Chem Inf Model 2024; 64:597-620. [PMID: 38284618 DOI: 10.1021/acs.jcim.3c01633] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2024]
Abstract
Artificial Neural Networks (ANNs) are transforming how we understand chemical mixtures, providing an expressive view of the chemical space and multiscale processes. Their hybridization with physical knowledge can bridge the gap between predictivity and understanding of the underlying processes. This overview explores recent progress in ANNs, particularly their potential in the 'recomposition' of chemical mixtures. Graph-based representations reveal patterns among mixture components, and deep learning models excel in capturing complexity and symmetries when compared to traditional Quantitative Structure-Property Relationship models. Key components, such as Hamiltonian networks and convolution operations, play a central role in representing multiscale mixtures. The integration of ANNs with Chemical Reaction Networks and Physics-Informed Neural Networks for inverse chemical kinetic problems is also examined. The combination of sensors with ANNs shows promise in optical and biomimetic applications. A common ground is identified in the context of statistical physics, where ANN-based methods iteratively adapt their models by blending their initial states with training data. The concept of mixture recomposition unveils a reciprocal inspiration between ANNs and reactive mixtures, highlighting learning behaviors influenced by the training environment.
Collapse
Affiliation(s)
- Andre Nicolle
- Aramco Fuel Research Center, Rueil-Malmaison 92852, France
| | - Sili Deng
- Massachusetts Institute of Technology, Cambridge 02139, Massachusetts, United States
| | - Matthias Ihme
- Stanford University, Stanford 94305, California, United States
| | | | - Emad Al Ibrahim
- King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
| | - Aamir Farooq
- King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
| |
Collapse
|
22
|
Voinarovska V, Kabeshov M, Dudenko D, Genheden S, Tetko IV. When Yield Prediction Does Not Yield Prediction: An Overview of the Current Challenges. J Chem Inf Model 2024; 64:42-56. [PMID: 38116926 PMCID: PMC10778086 DOI: 10.1021/acs.jcim.3c01524] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 11/29/2023] [Accepted: 11/30/2023] [Indexed: 12/21/2023]
Abstract
Machine Learning (ML) techniques face significant challenges when predicting advanced chemical properties, such as yield, feasibility of chemical synthesis, and optimal reaction conditions. These challenges stem from the high-dimensional nature of the prediction task and the myriad essential variables involved, ranging from reactants and reagents to catalysts, temperature, and purification processes. Successfully developing a reliable predictive model not only holds the potential for optimizing high-throughput experiments but can also elevate existing retrosynthetic predictive approaches and bolster a plethora of applications within the field. In this review, we systematically evaluate the efficacy of current ML methodologies in chemoinformatics, shedding light on their milestones and inherent limitations. Additionally, a detailed examination of a representative case study provides insights into the prevailing issues related to data availability and transferability in the discipline.
Collapse
Affiliation(s)
- Varvara Voinarovska
- Molecular
AI, Discovery Sciences R&D, AstraZeneca, 431 83 Gothenburg, Sweden
- TUM
Graduate School, Faculty of Chemistry, Technical
University of Munich, 85748 Garching, Germany
| | - Mikhail Kabeshov
- Molecular
AI, Discovery Sciences R&D, AstraZeneca, 431 83 Gothenburg, Sweden
| | - Dmytro Dudenko
- Enamine
Ltd., 78 Chervonotkatska str., 02094 Kyiv, Ukraine
| | - Samuel Genheden
- Molecular
AI, Discovery Sciences R&D, AstraZeneca, 431 83 Gothenburg, Sweden
| | - Igor V. Tetko
- Molecular
Targets and Therapeutics Center, Helmholtz Munich − Deutsches
Forschungszentrum für Gesundheit und Umwelt (GmbH), Institute of Structural Biology, 85764 Neuherberg, Germany
| |
Collapse
|
23
|
Sadeghi S, Bateni F, Kim T, Son DY, Bennett JA, Orouji N, Punati VS, Stark C, Cerra TD, Awad R, Delgado-Licona F, Xu J, Mukhin N, Dickerson H, Reyes KG, Abolhasani M. Autonomous nanomanufacturing of lead-free metal halide perovskite nanocrystals using a self-driving fluidic lab. NANOSCALE 2024; 16:580-591. [PMID: 38116636 DOI: 10.1039/d3nr05034c] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2023]
Abstract
Lead-based metal halide perovskite (MHP) nanocrystals (NCs) have emerged as a promising class of semiconducting nanomaterials for a wide range of optoelectronic and photoelectronic applications. However, the intrinsic lead toxicity of MHP NCs has significantly hampered their large-scale device applications. Copper-base MHP NCs with composition-tunable optical properties have emerged as a prominent lead-free MHP NC candidate. However, comprehensive synthesis space exploration, development, and synthesis science studies of copper-based MHP NCs have been limited by the manual nature of flask-based synthesis and characterization methods. In this study, we present an autonomous approach for the development of lead-free MHP NCs via seamless integration of a modular microfluidic platform with machine learning-assisted NC synthesis modeling and experiment selection to establish a self-driving fluidic lab for accelerated NC synthesis science studies. For the first time, a successful and reproducible in-flow synthesis of Cs3Cu2I5 NCs is presented. Autonomous experimentation is then employed for rapid in-flow synthesis science studies of Cs3Cu2I5 NCs. The autonomously generated experimental NC synthesis dataset is then utilized for fast-tracked synthetic route optimization of high-performing Cs3Cu2I5 NCs.
Collapse
Affiliation(s)
- Sina Sadeghi
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Fazel Bateni
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Taekhoon Kim
- Synthesis Technical Unit, Material Research Center, Samsung Advanced Institute of Technology, SEC, 130, Samsung-ro, Yeongtong-gu, Suwon-si, Gyeonggi-do, Republic of Korea
| | - Dae Yong Son
- Synthesis Technical Unit, Material Research Center, Samsung Advanced Institute of Technology, SEC, 130, Samsung-ro, Yeongtong-gu, Suwon-si, Gyeonggi-do, Republic of Korea
| | - Jeffrey A Bennett
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Negin Orouji
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Venkat S Punati
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Christine Stark
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Teagan D Cerra
- Department of Physics, Weber State University, Ogden, UT 84408, USA
| | - Rami Awad
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Fernando Delgado-Licona
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Jinge Xu
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Nikolai Mukhin
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Hannah Dickerson
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Kristofer G Reyes
- Department of Materials Design and Innovation, University at Buffalo, Buffalo, NY 14260, USA
| | - Milad Abolhasani
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| |
Collapse
|
24
|
Xie Z, Evangelopoulos X, Omar ÖH, Troisi A, Cooper AI, Chen L. Fine-tuning GPT-3 for machine learning electronic and functional properties of organic molecules. Chem Sci 2024; 15:500-510. [PMID: 38179524 PMCID: PMC10762956 DOI: 10.1039/d3sc04610a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Accepted: 12/04/2023] [Indexed: 01/06/2024] Open
Abstract
We evaluate the effectiveness of fine-tuning GPT-3 for the prediction of electronic and functional properties of organic molecules. Our findings show that fine-tuned GPT-3 can successfully identify and distinguish between chemically meaningful patterns, and discern subtle differences among them, exhibiting robust predictive performance for the prediction of molecular properties. We focus on assessing the fine-tuned models' resilience to information loss, resulting from the absence of atoms or chemical groups, and to noise that we introduce via random alterations in atomic identities. We discuss the challenges and limitations inherent to the use of GPT-3 in molecular machine-learning tasks and suggest potential directions for future research and improvements to address these issues.
Collapse
Affiliation(s)
- Zikai Xie
- Leverhulme Research Centre for Functional Materials Design, Materials Innovation Factory and Department of Chemistry, University of Liverpool Liverpool L7 3NY UK
| | - Xenophon Evangelopoulos
- Leverhulme Research Centre for Functional Materials Design, Materials Innovation Factory and Department of Chemistry, University of Liverpool Liverpool L7 3NY UK
| | - Ömer H Omar
- Department of Chemistry, University of Liverpool Liverpool L69 3BX UK
| | - Alessandro Troisi
- Department of Chemistry, University of Liverpool Liverpool L69 3BX UK
| | - Andrew I Cooper
- Leverhulme Research Centre for Functional Materials Design, Materials Innovation Factory and Department of Chemistry, University of Liverpool Liverpool L7 3NY UK
| | - Linjiang Chen
- School of Chemistry, School of Computer Science, University of Birmingham Birmingham B15 2TT UK
| |
Collapse
|
25
|
Raghavan P, Haas BC, Ruos ME, Schleinitz J, Doyle AG, Reisman SE, Sigman MS, Coley CW. Dataset Design for Building Models of Chemical Reactivity. ACS CENTRAL SCIENCE 2023; 9:2196-2204. [PMID: 38161380 PMCID: PMC10755851 DOI: 10.1021/acscentsci.3c01163] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Revised: 11/06/2023] [Accepted: 11/15/2023] [Indexed: 01/03/2024]
Abstract
Models can codify our understanding of chemical reactivity and serve a useful purpose in the development of new synthetic processes via, for example, evaluating hypothetical reaction conditions or in silico substrate tolerance. Perhaps the most determining factor is the composition of the training data and whether it is sufficient to train a model that can make accurate predictions over the full domain of interest. Here, we discuss the design of reaction datasets in ways that are conducive to data-driven modeling, emphasizing the idea that training set diversity and model generalizability rely on the choice of molecular or reaction representation. We additionally discuss the experimental constraints associated with generating common types of chemistry datasets and how these considerations should influence dataset design and model building.
Collapse
Affiliation(s)
- Priyanka Raghavan
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Brittany C. Haas
- Department
of Chemistry, University of Utah, Salt Lake City, Utah 84112, United States
| | - Madeline E. Ruos
- Department
of Chemistry & Biochemistry, University
of California, Los Angeles, Los Angeles, California 90095, United States
| | - Jules Schleinitz
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Abigail G. Doyle
- Department
of Chemistry & Biochemistry, University
of California, Los Angeles, Los Angeles, California 90095, United States
| | - Sarah E. Reisman
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Matthew S. Sigman
- Department
of Chemistry, University of Utah, Salt Lake City, Utah 84112, United States
| | - Connor W. Coley
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
26
|
Dolfus U, Briem H, Gutermuth T, Rarey M. Full Modification Control over Retrosynthetic Routes for Guided Optimization of Lead Structures. J Chem Inf Model 2023; 63:6587-6597. [PMID: 37910814 DOI: 10.1021/acs.jcim.3c01155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2023]
Abstract
Synthesizability is essential for compounds designed in silico. Regardless, synthetic accessibility is often considered only as an afterthought in the design and optimization process. In addition, the trend with modern computer-aided drug design methods is going toward full automation and away from the possibility of incorporating user knowledge. With this work, we present the second major release of our software tool, Synthesia, for synthesis-aware lead structure modification, where the user's expertise is now fully utilized. A provided retrosynthetic route is used as a pathway to guide structural modifications that introduce desired structural changes in the target compound. Moreover, the approach allows the user to define the exact position or component in the retrosynthetic route, which should be modified, further integrating the user's expert knowledge. This paper describes the functionality of Synthesia, its basic concepts, and several application scenarios ranging from simple examples to a comparison of the effects of the different exchange functions to an analysis of a set of bioisosteric linker structures, highlighting potential synthetically feasible replacements.
Collapse
Affiliation(s)
- Uschi Dolfus
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraβe 43, 20146 Hamburg, Germany
| | - Hans Briem
- Bayer AG, Research & Development, Pharmaceuticals, Computational Molecular Design Berlin, Building S110, 711, 13342 Berlin, Germany
| | - Torben Gutermuth
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraβe 43, 20146 Hamburg, Germany
| | - Matthias Rarey
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraβe 43, 20146 Hamburg, Germany
| |
Collapse
|
27
|
Schrier J, Norquist AJ, Buonassisi T, Brgoch J. In Pursuit of the Exceptional: Research Directions for Machine Learning in Chemical and Materials Science. J Am Chem Soc 2023; 145:21699-21716. [PMID: 37754929 DOI: 10.1021/jacs.3c04783] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/28/2023]
Abstract
Exceptional molecules and materials with one or more extraordinary properties are both technologically valuable and fundamentally interesting, because they often involve new physical phenomena or new compositions that defy expectations. Historically, exceptionality has been achieved through serendipity, but recently, machine learning (ML) and automated experimentation have been widely proposed to accelerate target identification and synthesis planning. In this Perspective, we argue that the data-driven methods commonly used today are well-suited for optimization but not for the realization of new exceptional materials or molecules. Finding such outliers should be possible using ML, but only by shifting away from using traditional ML approaches that tweak the composition, crystal structure, or reaction pathway. We highlight case studies of high-Tc oxide superconductors and superhard materials to demonstrate the challenges of ML-guided discovery and discuss the limitations of automation for this task. We then provide six recommendations for the development of ML methods capable of exceptional materials discovery: (i) Avoid the tyranny of the middle and focus on extrema; (ii) When data are limited, qualitative predictions that provide direction are more valuable than interpolative accuracy; (iii) Sample what can be made and how to make it and defer optimization; (iv) Create room (and look) for the unexpected while pursuing your goal; (v) Try to fill-in-the-blanks of input and output space; (vi) Do not confuse human understanding with model interpretability. We conclude with a description of how these recommendations can be integrated into automated discovery workflows, which should enable the discovery of exceptional molecules and materials.
Collapse
Affiliation(s)
- Joshua Schrier
- Department of Chemistry, Fordham University, The Bronx, New York 10458, United States
| | - Alexander J Norquist
- Department of Chemistry, Haverford College, Haverford, Pennsylvania 19041, United States
| | - Tonio Buonassisi
- Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Jakoah Brgoch
- Department of Chemistry and Texas Center for Superconductivity, University of Houston, Houston, Texas 77204, United States
| |
Collapse
|
28
|
Hagg A, Kirschner KN. Open-Source Machine Learning in Computational Chemistry. J Chem Inf Model 2023; 63:4505-4532. [PMID: 37466636 PMCID: PMC10430767 DOI: 10.1021/acs.jcim.3c00643] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Indexed: 07/20/2023]
Abstract
The field of computational chemistry has seen a significant increase in the integration of machine learning concepts and algorithms. In this Perspective, we surveyed 179 open-source software projects, with corresponding peer-reviewed papers published within the last 5 years, to better understand the topics within the field being investigated by machine learning approaches. For each project, we provide a short description, the link to the code, the accompanying license type, and whether the training data and resulting models are made publicly available. Based on those deposited in GitHub repositories, the most popular employed Python libraries are identified. We hope that this survey will serve as a resource to learn about machine learning or specific architectures thereof by identifying accessible codes with accompanying papers on a topic basis. To this end, we also include computational chemistry open-source software for generating training data and fundamental Python libraries for machine learning. Based on our observations and considering the three pillars of collaborative machine learning work, open data, open source (code), and open models, we provide some suggestions to the community.
Collapse
Affiliation(s)
- Alexander Hagg
- Institute
of Technology, Resource and Energy-Efficient Engineering (TREE), University of Applied Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
- Department
of Electrical Engineering, Mechanical Engineering and Technical Journalism, University of Applied Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
| | - Karl N. Kirschner
- Institute
of Technology, Resource and Energy-Efficient Engineering (TREE), University of Applied Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
- Department
of Computer Science, University of Applied
Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
| |
Collapse
|
29
|
Karl TM, Bouayad-Gervais S, Hueffel JA, Sperger T, Wellig S, Kaldas SJ, Dabranskaya U, Ward JS, Rissanen K, Tizzard GJ, Schoenebeck F. Machine Learning-Guided Development of Trialkylphosphine Ni (I) Dimers and Applications in Site-Selective Catalysis. J Am Chem Soc 2023. [PMID: 37411044 DOI: 10.1021/jacs.3c03403] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/08/2023]
Abstract
Owing to the unknown correlation of a metal's ligand and its resulting preferred speciation in terms of oxidation state, geometry, and nuclearity, a rational design of multinuclear catalysts remains challenging. With the goal to accelerate the identification of suitable ligands that form trialkylphosphine-derived dihalogen-bridged Ni(I) dimers, we herein employed an assumption-based machine learning approach. The workflow offers guidance in ligand space for a desired speciation without (or only minimal) prior experimental data points. We experimentally verified the predictions and synthesized numerous novel Ni(I) dimers as well as explored their potential in catalysis. We demonstrate C-I selective arylations of polyhalogenated arenes bearing competing C-Br and C-Cl sites in under 5 min at room temperature using 0.2 mol % of the newly developed dimer, [Ni(I)(μ-Br)PAd2(n-Bu)]2, which is so far unmet with alternative dinuclear or mononuclear Ni or Pd catalysts.
Collapse
Affiliation(s)
- Teresa M Karl
- Institute of Organic Chemistry, RWTH Aachen University, Landoltweg 1, 52074 Aachen, Germany
| | - Samir Bouayad-Gervais
- Institute of Organic Chemistry, RWTH Aachen University, Landoltweg 1, 52074 Aachen, Germany
| | - Julian A Hueffel
- Institute of Organic Chemistry, RWTH Aachen University, Landoltweg 1, 52074 Aachen, Germany
| | - Theresa Sperger
- Institute of Organic Chemistry, RWTH Aachen University, Landoltweg 1, 52074 Aachen, Germany
| | - Sebastian Wellig
- Institute of Organic Chemistry, RWTH Aachen University, Landoltweg 1, 52074 Aachen, Germany
| | - Sherif J Kaldas
- Institute of Organic Chemistry, RWTH Aachen University, Landoltweg 1, 52074 Aachen, Germany
| | | | - Jas S Ward
- Department of Chemistry, University of Jyvaskyla, FIN40014 Jyväskylä, Finland
| | - Kari Rissanen
- Department of Chemistry, University of Jyvaskyla, FIN40014 Jyväskylä, Finland
| | - Graham J Tizzard
- UK National Crystallography Service, School of Chemistry, University of Southampton, SO17 1BJ Southhampton, U.K
| | - Franziska Schoenebeck
- Institute of Organic Chemistry, RWTH Aachen University, Landoltweg 1, 52074 Aachen, Germany
| |
Collapse
|
30
|
Shim E, Tewari A, Cernak T, Zimmerman PM. Machine Learning Strategies for Reaction Development: Toward the Low-Data Limit. J Chem Inf Model 2023; 63:3659-3668. [PMID: 37312524 PMCID: PMC11163943 DOI: 10.1021/acs.jcim.3c00577] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Machine learning models are increasingly being utilized to predict outcomes of organic chemical reactions. A large amount of reaction data is used to train these models, which is in stark contrast to how expert chemists discover and develop new reactions by leveraging information from a small number of relevant transformations. Transfer learning and active learning are two strategies that can operate in low-data situations, which may help fill this gap and promote the use of machine learning for tackling real-world challenges in organic synthesis. This Perspective introduces active and transfer learning and connects these to potential opportunities and directions for further research, especially in the area of prospective development of chemical transformations.
Collapse
Affiliation(s)
- Eunjae Shim
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Ambuj Tewari
- Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109, United States
- Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Tim Cernak
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Paul M Zimmerman
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| |
Collapse
|
31
|
Pasquini M, Stenta M. LinChemIn: SynGraph-a data model and a toolkit to analyze and compare synthetic routes. J Cheminform 2023; 15:41. [PMID: 37005691 PMCID: PMC10067316 DOI: 10.1186/s13321-023-00714-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Accepted: 03/20/2023] [Indexed: 04/04/2023] Open
Abstract
BACKGROUND The increasing amount of chemical reaction data makes traditional ways to navigate its corpus less effective, while the demand for novel approaches and instruments is rising. Recent data science and machine learning techniques support the development of new ways to extract value from the available reaction data. On the one side, Computer-Aided Synthesis Planning tools can predict synthetic routes in a model-driven approach; on the other side, experimental routes can be extracted from the Network of Organic Chemistry, in which reaction data are linked in a network. In this context, the need to combine, compare and analyze synthetic routes generated by different sources arises naturally. RESULTS Here we present LinChemIn, a python toolkit that allows chemoinformatics operations on synthetic routes and reaction networks. Wrapping some third-party packages for handling graph arithmetic and chemoinformatics and implementing new data models and functionalities, LinChemIn allows the interconversion between data formats and data models and enables route-level analysis and operations, including route comparison and descriptors calculation. Object-Oriented Design principles inspire the software architecture, and the modules are structured to maximize code reusability and support code testing and refactoring. The code structure should facilitate external contributions, thus encouraging open and collaborative software development. CONCLUSIONS The current version of LinChemIn allows users to combine synthetic routes generated from various tools and analyze them, and constitutes an open and extensible framework capable of incorporating contributions from the community and fostering scientific discussion. Our roadmap envisages the development of sophisticated metrics for routes evaluation, a multi-parameter scoring system, and the implementation of an entire "ecosystem" of functionalities operating on synthetic routes. LinChemIn is freely available at https://github.com/syngenta/linchemin.
Collapse
Affiliation(s)
- Marta Pasquini
- Syngenta Crop Protection AG, Schaffhauserstrasse, 4332, Stein, AG, Switzerland.
| | - Marco Stenta
- Syngenta Crop Protection AG, Schaffhauserstrasse, 4332, Stein, AG, Switzerland
| |
Collapse
|