1
|
Maziarz K, Tripp A, Liu G, Stanley M, Xie S, Gaiński P, Seidl P, Segler MHS. Re-evaluating retrosynthesis algorithms with Syntheseus. Faraday Discuss 2024. [PMID: 39485491 DOI: 10.1039/d4fd00093e] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2024]
Abstract
Automated synthesis planning has recently re-emerged as a research area at the intersection of chemistry and machine learning. Despite the appearance of steady progress, we argue that imperfect benchmarks and inconsistent comparisons mask systematic shortcomings of existing techniques, and unnecessarily hamper progress. To remedy this, we present a synthesis planning library with an extensive benchmarking framework, called SYNTHESEUS, which promotes best practice by default, enabling consistent meaningful evaluation of single-step and multi-step synthesis planning algorithms. We demonstrate the capabilities of SYNTHESEUS by re-evaluating several previous retrosynthesis algorithms, and find that the ranking of state-of-the-art models changes in controlled evaluation experiments. We end with guidance for future works in this area, and call on the community to engage in the discussion on how to improve benchmarks for synthesis planning.
Collapse
|
2
|
Cheng AH, Ser CT, Skreta M, Guzmán-Cordero A, Thiede L, Burger A, Aldossary A, Leong SX, Pablo-García S, Strieth-Kalthoff F, Aspuru-Guzik A. Spiers Memorial Lecture: How to do impactful research in artificial intelligence for chemistry and materials science. Faraday Discuss 2024. [PMID: 39400305 DOI: 10.1039/d4fd00153b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2024]
Abstract
Machine learning has been pervasively touching many fields of science. Chemistry and materials science are no exception. While machine learning has been making a great impact, it is still not reaching its full potential or maturity. In this perspective, we first outline current applications across a diversity of problems in chemistry. Then, we discuss how machine learning researchers view and approach problems in the field. Finally, we provide our considerations for maximizing impact when researching machine learning for chemistry.
Collapse
Affiliation(s)
- Austin H Cheng
- Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada.
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | - Cher Tian Ser
- Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada.
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | - Marta Skreta
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | - Andrés Guzmán-Cordero
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
- Tinbergen Institute, University of Amsterdam, Amsterdam, Netherlands
| | - Luca Thiede
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | - Andreas Burger
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | | | - Shi Xuan Leong
- Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada.
- School of Chemistry, Chemical Engineering and Biotechnology, Nanyang Technological University, Singapore 63737, Singapore
| | | | | | - Alán Aspuru-Guzik
- Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada.
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
- Acceleration Consortium, Toronto, Ontario M5G 1X6, Canada
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Canada
- Department of Materials Science and Engineering, University of Toronto, Canada
- Lebovic Fellow, Canadian Institute for Advanced Research (CIFAR), Canada
| |
Collapse
|
3
|
Chen LY, Li YP. Machine learning-guided strategies for reaction conditions design and optimization. Beilstein J Org Chem 2024; 20:2476-2492. [PMID: 39376489 PMCID: PMC11457048 DOI: 10.3762/bjoc.20.212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Accepted: 09/19/2024] [Indexed: 10/09/2024] Open
Abstract
This review surveys the recent advances and challenges in predicting and optimizing reaction conditions using machine learning techniques. The paper emphasizes the importance of acquiring and processing large and diverse datasets of chemical reactions, and the use of both global and local models to guide the design of synthetic processes. Global models exploit the information from comprehensive databases to suggest general reaction conditions for new reactions, while local models fine-tune the specific parameters for a given reaction family to improve yield and selectivity. The paper also identifies the current limitations and opportunities in this field, such as the data quality and availability, and the integration of high-throughput experimentation. The paper demonstrates how the combination of chemical engineering, data science, and ML algorithms can enhance the efficiency and effectiveness of reaction conditions design, and enable novel discoveries in synthetic chemistry.
Collapse
Affiliation(s)
- Lung-Yi Chen
- Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan
| | - Yi-Pei Li
- Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan
- Taiwan International Graduate Program on Sustainable Chemical Science and Technology (TIGP-SCST), No. 128, Sec. 2, Academia Road, Taipei 11529, Taiwan
| |
Collapse
|
4
|
Han Y, Deng M, Liu K, Chen J, Wang Y, Xu YN, Dian L. Computer-Aided Synthesis Planning (CASP) and Machine Learning: Optimizing Chemical Reaction Conditions. Chemistry 2024; 30:e202401626. [PMID: 39083362 DOI: 10.1002/chem.202401626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2024] [Revised: 07/27/2024] [Accepted: 07/28/2024] [Indexed: 08/02/2024]
Abstract
Computer-aided synthesis planning (CASP) has garnered increasing attention in light of recent advancements in machine learning models. While the focus is on reverse synthesis or forward outcome prediction, optimizing reaction conditions remains a significant challenge. For datasets with multiple variables, the choice of descriptors and models is pivotal. This selection dictates the effective extraction of conditional features and the achievement of higher prediction accuracy. This review delineates the origins of data in conditional optimization, the criteria for descriptor selection, the response models, and the metrics for outcome evaluation, aiming to acquaint readers with the latest research trends and facilitate more informed research in this domain.
Collapse
Affiliation(s)
- Yu Han
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Mingjing Deng
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Ke Liu
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Jia Chen
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Yuting Wang
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Yu-Ning Xu
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
| | - Longyang Dian
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, No. 72 Binhai Avenue, Qingdao, 266237, P. R. China
- Suzhou Institute of Shandong University, No. 388 Ruoshui Road, Suzhou Industrial Park, Suzhou, 215123, P. R. China
| |
Collapse
|
5
|
Kevlishvili I, St Michel RG, Garrison AG, Toney JW, Adamji H, Jia H, Román-Leshkov Y, Kulik HJ. Leveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexes. Faraday Discuss 2024. [PMID: 39301698 DOI: 10.1039/d4fd00087k] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/22/2024]
Abstract
The breadth of transition metal chemical space covered by databases such as the Cambridge Structural Database and the derived computational database tmQM is not conducive to application-specific modeling and the development of structure-property relationships. Here, we employ both supervised and unsupervised natural language processing (NLP) techniques to link experimentally synthesized compounds in the tmQM database to their respective applications. Leveraging NLP models, we curate four distinct datasets: tmCAT for catalysis, tmPHOTO for photophysical activity, tmBIO for biological relevance, and tmSCO for magnetism. Analyzing the chemical substructures within each dataset reveals common chemical motifs in each of the designated applications. We then use these common chemical structures to augment our initial datasets for each application, yielding a total of 21 631 compounds in tmCAT, 4599 in tmPHOTO, 2782 in tmBIO, and 983 in tmSCO. These datasets are expected to accelerate the more targeted computational screening and development of refined structure-property relationships with machine learning.
Collapse
Affiliation(s)
- Ilia Kevlishvili
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Roland G St Michel
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
- Department of Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Aaron G Garrison
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Jacob W Toney
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Husain Adamji
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Haojun Jia
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Yuriy Román-Leshkov
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Heather J Kulik
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| |
Collapse
|
6
|
Ai Q, Meng F, Shi J, Pelkie B, Coley CW. Extracting structured data from organic synthesis procedures using a fine-tuned large language model. DIGITAL DISCOVERY 2024; 3:1822-1831. [PMID: 39157760 PMCID: PMC11322921 DOI: 10.1039/d4dd00091a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Accepted: 07/30/2024] [Indexed: 08/20/2024]
Abstract
The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and despite the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we fine-tune a large language model (LLM) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD "messages" (e.g., full compound, workups, or condition definitions) and 92.25% for individual data fields (e.g., compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.
Collapse
Affiliation(s)
- Qianxiang Ai
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| | - Fanwang Meng
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| | - Jiale Shi
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| | - Brenden Pelkie
- Department of Chemical Engineering, University of Washington Seattle WA USA
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| |
Collapse
|
7
|
Schmid SP, Schlosser L, Glorius F, Jorner K. Catalysing (organo-)catalysis: Trends in the application of machine learning to enantioselective organocatalysis. Beilstein J Org Chem 2024; 20:2280-2304. [PMID: 39290209 PMCID: PMC11406055 DOI: 10.3762/bjoc.20.196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2024] [Accepted: 08/09/2024] [Indexed: 09/19/2024] Open
Abstract
Organocatalysis has established itself as a third pillar of homogeneous catalysis, besides transition metal catalysis and biocatalysis, as its use for enantioselective reactions has gathered significant interest over the last decades. Concurrent to this development, machine learning (ML) has been increasingly applied in the chemical domain to efficiently uncover hidden patterns in data and accelerate scientific discovery. While the uptake of ML in organocatalysis has been comparably slow, the last two decades have showed an increased interest from the community. This review gives an overview of the work in the field of ML in organocatalysis. The review starts by giving a short primer on ML for experimental chemists, before discussing its application for predicting the selectivity of organocatalytic transformations. Subsequently, we review ML employed for privileged catalysts, before focusing on its application for catalyst and reaction design. Concluding, we give our view on current challenges and future directions for this field, drawing inspiration from the application of ML to other scientific domains.
Collapse
Affiliation(s)
- Stefan P Schmid
- Institute of Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich CH-8093, Switzerland
| | - Leon Schlosser
- Organisch-Chemisches Institut, Universität Münster, 48149 Münster, Germany
| | - Frank Glorius
- Organisch-Chemisches Institut, Universität Münster, 48149 Münster, Germany
| | - Kjell Jorner
- Institute of Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich CH-8093, Switzerland
- National Centre of Competence in Research (NCCR) Catalysis, ETH Zurich, Zurich CH-8093, Switzerland
| |
Collapse
|
8
|
Zhang X, Li Y, Li C, Zhu J, Gan Z, Wang L, Sun X, You H. A chemical reaction entity recognition method based on a natural language data augmentation strategy. Chem Commun (Camb) 2024; 60:9610-9613. [PMID: 39148332 DOI: 10.1039/d4cc01471e] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
Impressive applications of artificial intelligence in the field of chemical reaction prediction heavily depend on abundant reliable datasets. The automated extraction of reaction procedures to build structured chemical databases is of growing importance. Here, we propose a novel model named DACRER for large-scale reaction extraction, in which transfer learning and a data augmentation strategy were employed. This model was evaluated for chemical datasets and shows good performance in identifying and processing chemical texts.
Collapse
Affiliation(s)
- Xiaowen Zhang
- School of Science, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China.
| | - Yang Li
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, Anhui, China
| | - Chaoyi Li
- School of Science, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China.
| | - Jingyuan Zhu
- School of Science, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China.
| | - Zhiqiang Gan
- School of Science, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China.
| | - Lei Wang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang 277160, Shandong, China
| | - Xiaofei Sun
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang 277160, Shandong, China
| | - Hengzhi You
- School of Science, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China.
- Green Pharmaceutical Engineering Research Center, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, Guangdong, China
| |
Collapse
|
9
|
Meza-González B, Ramírez-Palma DI, Carpio-Martínez P, Vázquez-Cuevas D, Martínez-Mayorga K, Cortés-Guzmán F. Quantum Topological Atomic Properties of 44K molecules. Sci Data 2024; 11:945. [PMID: 39209874 PMCID: PMC11362522 DOI: 10.1038/s41597-024-03723-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Accepted: 07/29/2024] [Indexed: 09/04/2024] Open
Abstract
We present a data set of quantum topological properties of atoms of 44K randomly selected molecules from the GDB-9 data set. These atomic properties were obtained as defined by the quantum theory of atoms in molecules (QTAIM) within an atomic basin, a region of real space bounded by zero-flux surfaces in the electron density gradient vector field. The wave function files were generated through DFT static calculations (B3LYP/6-31G), and the atomic properties were calculated using QTAIM. The calculated atomic properties include the energy of the atomic basin, the electronic population, the magnitude of the total dipole moment, and the magnitude of the total quadrupole moment. The atomic properties allow one to understand the chemical structure, reactivity, and molecular recognition. They can be incorporated into force fields for molecular dynamics or for predicting reactive sites. We believe that this data set could facilitate new studies in chemical informatics, machine learning applied to chemistry, and computational molecular design.
Collapse
Affiliation(s)
- Brandon Meza-González
- Facultad de Química, Universidad Nacional Autónoma de México, Ciudad de Méxinclude thexico, Mexico City, Mexico
| | - David I Ramírez-Palma
- Instituto de Química, Unidad Mérida, Universidad Nacional Autónoma de México, Mérida, Yucatán, Mexico
| | - Pablo Carpio-Martínez
- Centro Conjunto de Investigación en Química Sustentable UAEM-UNAM, Carretera Toluca-Atlacomulco, km. 14.5, Toluca, Estado de México, C.P. 50200, Mexico
| | - David Vázquez-Cuevas
- Instituto de Química, Unidad Mérida, Universidad Nacional Autónoma de México, Mérida, Yucatán, Mexico
| | - Karina Martínez-Mayorga
- Instituto de Química, Unidad Mérida, Universidad Nacional Autónoma de México, Mérida, Yucatán, Mexico
| | - Fernando Cortés-Guzmán
- Facultad de Química, Universidad Nacional Autónoma de México, Ciudad de Méxinclude thexico, Mexico City, Mexico.
| |
Collapse
|
10
|
Tom G, Schmid SP, Baird SG, Cao Y, Darvish K, Hao H, Lo S, Pablo-García S, Rajaonson EM, Skreta M, Yoshikawa N, Corapi S, Akkoc GD, Strieth-Kalthoff F, Seifrid M, Aspuru-Guzik A. Self-Driving Laboratories for Chemistry and Materials Science. Chem Rev 2024; 124:9633-9732. [PMID: 39137296 PMCID: PMC11363023 DOI: 10.1021/acs.chemrev.4c00055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/15/2024]
Abstract
Self-driving laboratories (SDLs) promise an accelerated application of the scientific method. Through the automation of experimental workflows, along with autonomous experimental planning, SDLs hold the potential to greatly accelerate research in chemistry and materials discovery. This review provides an in-depth analysis of the state-of-the-art in SDL technology, its applications across various scientific disciplines, and the potential implications for research and industry. This review additionally provides an overview of the enabling technologies for SDLs, including their hardware, software, and integration with laboratory infrastructure. Most importantly, this review explores the diverse range of scientific domains where SDLs have made significant contributions, from drug discovery and materials science to genomics and chemistry. We provide a comprehensive review of existing real-world examples of SDLs, their different levels of automation, and the challenges and limitations associated with each domain.
Collapse
Affiliation(s)
- Gary Tom
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
| | - Stefan P. Schmid
- Department
of Chemistry and Applied Biosciences, ETH
Zurich, Vladimir-Prelog-Weg 1, CH-8093 Zurich, Switzerland
| | - Sterling G. Baird
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
| | - Yang Cao
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
| | - Kourosh Darvish
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
| | - Han Hao
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
| | - Stanley Lo
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
| | - Sergio Pablo-García
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
| | - Ella M. Rajaonson
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
| | - Marta Skreta
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
| | - Naruki Yoshikawa
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
| | - Samantha Corapi
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
| | - Gun Deniz Akkoc
- Forschungszentrum
Jülich GmbH, Helmholtz Institute
for Renewable Energy Erlangen-Nürnberg, Cauerstr. 1, 91058 Erlangen, Germany
- Department
of Chemical and Biological Engineering, Friedrich-Alexander Universität Erlangen-Nürnberg, Egerlandstr. 3, 91058 Erlangen, Germany
| | - Felix Strieth-Kalthoff
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- School of
Mathematics and Natural Sciences, University
of Wuppertal, Gaußstraße
20, 42119 Wuppertal, Germany
| | - Martin Seifrid
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Department
of Materials Science and Engineering, North
Carolina State University, Raleigh, North Carolina 27695, United States of America
| | - Alán Aspuru-Guzik
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
- Department
of Chemical Engineering & Applied Chemistry, University of Toronto, Toronto, Ontario M5S 3E5, Canada
- Department
of Materials Science & Engineering, University of Toronto, Toronto, Ontario M5S 3E4, Canada
- Lebovic
Fellow, Canadian Institute for Advanced
Research (CIFAR), 661
University Ave, Toronto, Ontario M5G 1M1, Canada
| |
Collapse
|
11
|
Gricourt G, Meyer P, Duigou T, Faulon JL. Artificial Intelligence Methods and Models for Retro-Biosynthesis: A Scoping Review. ACS Synth Biol 2024; 13:2276-2294. [PMID: 39047143 PMCID: PMC11334239 DOI: 10.1021/acssynbio.4c00091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Revised: 06/14/2024] [Accepted: 06/14/2024] [Indexed: 07/27/2024]
Abstract
Retrosynthesis aims to efficiently plan the synthesis of desirable chemicals by strategically breaking down molecules into readily available building block compounds. Having a long history in chemistry, retro-biosynthesis has also been used in the fields of biocatalysis and synthetic biology. Artificial intelligence (AI) is driving us toward new frontiers in synthesis planning and the exploration of chemical spaces, arriving at an opportune moment for promoting bioproduction that would better align with green chemistry, enhancing environmental practices. In this review, we summarize the recent advancements in the application of AI methods and models for retrosynthetic and retro-biosynthetic pathway design. These techniques can be based either on reaction templates or generative models and require scoring functions and planning strategies to navigate through the retrosynthetic graph of possibilities. We finally discuss limitations and promising research directions in this field.
Collapse
Affiliation(s)
- Guillaume Gricourt
- Université
Paris-Saclay, INRAE, AgroParisTech, Micalis
Institute, 78350 Jouy-en-Josas, France
| | - Philippe Meyer
- Université
Paris-Saclay, INRAE, AgroParisTech, Micalis
Institute, 78350 Jouy-en-Josas, France
| | - Thomas Duigou
- Université
Paris-Saclay, INRAE, AgroParisTech, Micalis
Institute, 78350 Jouy-en-Josas, France
| | - Jean-Loup Faulon
- Université
Paris-Saclay, INRAE, AgroParisTech, Micalis
Institute, 78350 Jouy-en-Josas, France
- The
University of Manchester, Manchester Institute
of Biotechnology, Manchester M1 7DN, U.K.
| |
Collapse
|
12
|
Wiest O, Bauer C, Helquist P, Norrby PO, Genheden S. Finding Relevant Retrosynthetic Disconnections for Stereocontrolled Reactions. J Chem Inf Model 2024; 64:5796-5805. [PMID: 38995078 DOI: 10.1021/acs.jcim.4c00370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2024]
Abstract
Machine learning-driven computer-aided synthesis planning (CASP) tools have become important tools for idea generation in the design of complex molecule synthesis but do not adequately address the stereochemical features of the target compounds. A novel approach to automated extraction of templates used in CASP that includes stereochemical information included in the US Patent and Trademark Office (USPTO) and an internal AstraZeneca database containing reactions from Reaxys, Pistachio, and AstraZeneca electronic lab notebooks is implemented in the freely available AiZynthFinder software. Three hundred sixty-seven templates covering reagent- and substrate-controlled as well as stereospecific reactions were extracted from the USPTO, while 20,724 templates were from the AstraZeneca database. The performance of these templates in multistep CASP is evaluated for 936 targets from the ChEMBL database and an in-house selection of 791 AZ designs. The potential and limitations are discussed for four case studies from ChEMBL and examples of FDA-approved drugs.
Collapse
Affiliation(s)
- Olaf Wiest
- Department of Chemistry and Biochemistry, University of Notre Dame, Notre Dame, Indiana 46556, United States
| | - Christoph Bauer
- Data Science and Modelling, Pharmaceutical Sciences, R&D, AstraZeneca, Gothenburg, Pepparedsleden 1, SE-431 83 Mölndal, Sweden
| | - Paul Helquist
- Department of Chemistry and Biochemistry, University of Notre Dame, Notre Dame, Indiana 46556, United States
| | - Per-Ola Norrby
- Data Science and Modelling, Pharmaceutical Sciences, R&D, AstraZeneca, Gothenburg, Pepparedsleden 1, SE-431 83 Mölndal, Sweden
| | - Samuel Genheden
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Pepparedsleden 1, SE-431 83 Mölndal, Sweden
| |
Collapse
|
13
|
Atz K, Nippa DF, Müller AT, Jost V, Anelli A, Reutlinger M, Kramer C, Martin RE, Grether U, Schneider G, Wuitschik G. Geometric deep learning-guided Suzuki reaction conditions assessment for applications in medicinal chemistry. RSC Med Chem 2024; 15:2310-2321. [PMID: 39026644 PMCID: PMC11253849 DOI: 10.1039/d4md00196f] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Accepted: 05/25/2024] [Indexed: 07/20/2024] Open
Abstract
Suzuki cross-coupling reactions are considered a valuable tool for constructing carbon-carbon bonds in small molecule drug discovery. However, the synthesis of chemical matter often represents a time-consuming and labour-intensive bottleneck. We demonstrate how machine learning methods trained on high-throughput experimentation (HTE) data can be leveraged to enable fast reaction condition selection for novel coupling partners. We show that the trained models support chemists in determining suitable catalyst-solvent-base combinations for individual transformations including an evaluation of the need for HTE screening. We introduce an algorithm for designing 96-well plates optimized towards reaction yields and discuss the model performance of zero- and few-shot machine learning. The best-performing machine learning model achieved a three-category classification accuracy of 76.3% (±0.2%) and an F 1-score for a binary classification of 79.1% (±0.9%). Validation on eight reactions revealed a receiver operating characteristic (ROC) curve (AUC) value of 0.82 (±0.07) for few-shot machine learning. On the other hand, zero-shot machine learning models achieved a mean ROC-AUC value of 0.63 (±0.16). This study positively advocates the application of few-shot machine learning-guided reaction condition selection for HTE campaigns in medicinal chemistry and highlights practical applications as well as challenges associated with zero-shot machine learning.
Collapse
Affiliation(s)
- Kenneth Atz
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| | - David F Nippa
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| | - Alex T Müller
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| | - Vera Jost
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| | - Andrea Anelli
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| | - Michael Reutlinger
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| | - Christian Kramer
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| | - Rainer E Martin
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| | - Uwe Grether
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| | - Gisbert Schneider
- Department of Chemistry and Applied Biosciences, ETH Zurich Vladimir-Prelog-Weg 4 8093 Zurich Switzerland
| | - Georg Wuitschik
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124 4070 Basel Switzerland
| |
Collapse
|
14
|
Zhang W, Wang Q, Kong X, Xiong J, Ni S, Cao D, Niu B, Chen M, Li Y, Zhang R, Wang Y, Zhang L, Li X, Xiong Z, Shi Q, Huang Z, Fu Z, Zheng M. Fine-tuning large language models for chemical text mining. Chem Sci 2024; 15:10600-10611. [PMID: 38994403 PMCID: PMC11234886 DOI: 10.1039/d4sc00924j] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Accepted: 06/02/2024] [Indexed: 07/13/2024] Open
Abstract
Extracting knowledge from complex and diverse chemical texts is a pivotal task for both experimental and computational chemists. The task is still considered to be extremely challenging due to the complexity of the chemical language and scientific literature. This study explored the power of fine-tuned large language models (LLMs) on five intricate chemical text mining tasks: compound entity recognition, reaction role labelling, metal-organic framework (MOF) synthesis information extraction, nuclear magnetic resonance spectroscopy (NMR) data extraction, and the conversion of reaction paragraphs to action sequences. The fine-tuned LLMs demonstrated impressive performance, significantly reducing the need for repetitive and extensive prompt engineering experiments. For comparison, we guided ChatGPT (GPT-3.5-turbo) and GPT-4 with prompt engineering and fine-tuned GPT-3.5-turbo as well as other open-source LLMs such as Mistral, Llama3, Llama2, T5, and BART. The results showed that the fine-tuned ChatGPT models excelled in all tasks. They achieved exact accuracy levels ranging from 69% to 95% on these tasks with minimal annotated data. They even outperformed those task-adaptive pre-training and fine-tuning models that were based on a significantly larger amount of in-domain data. Notably, fine-tuned Mistral and Llama3 show competitive abilities. Given their versatility, robustness, and low-code capability, leveraging fine-tuned LLMs as flexible and effective toolkits for automated data acquisition could revolutionize chemical knowledge extraction.
Collapse
Affiliation(s)
- Wei Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Qinggong Wang
- Nanjing University of Chinese Medicine 138 Xianlin Road Nanjing 210023 China
| | - Xiangtai Kong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Jiacheng Xiong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Shengkun Ni
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Duanhua Cao
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou Zhejiang 310058 China
| | - Buying Niu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Mingan Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- School of Physical Science and Technology, ShanghaiTech University Shanghai 201210 China
- Lingang Laboratory Shanghai 200031 China
| | - Yameng Li
- ProtonUnfold Technology Co., Ltd Suzhou China
| | - Runze Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Yitian Wang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Lehan Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | - Xutong Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
| | | | - Qian Shi
- Lingang Laboratory Shanghai 200031 China
| | - Ziming Huang
- Medizinische Klinik und Poliklinik I, Klinikum der Universität München, Ludwig-Maximilians-Universität Munich Germany
| | - Zunyun Fu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
- University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China
- Nanjing University of Chinese Medicine 138 Xianlin Road Nanjing 210023 China
| |
Collapse
|
15
|
Kalikadien AV, Mirza A, Hossaini AN, Sreenithya A, Pidko EA. Paving the road towards automated homogeneous catalyst design. Chempluschem 2024; 89:e202300702. [PMID: 38279609 DOI: 10.1002/cplu.202300702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 12/20/2023] [Indexed: 01/28/2024]
Abstract
In the past decade, computational tools have become integral to catalyst design. They continue to offer significant support to experimental organic synthesis and catalysis researchers aiming for optimal reaction outcomes. More recently, data-driven approaches utilizing machine learning have garnered considerable attention for their expansive capabilities. This Perspective provides an overview of diverse initiatives in the realm of computational catalyst design and introduces our automated tools tailored for high-throughput in silico exploration of the chemical space. While valuable insights are gained through methods for high-throughput in silico exploration and analysis of chemical space, their degree of automation and modularity are key. We argue that the integration of data-driven, automated and modular workflows is key to enhancing homogeneous catalyst design on an unprecedented scale, contributing to the advancement of catalysis research.
Collapse
Affiliation(s)
- Adarsh V Kalikadien
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology, Van der Maasweg 9, 2629 HZ, Delft, The Netherlands
| | - Adrian Mirza
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology, Van der Maasweg 9, 2629 HZ, Delft, The Netherlands
| | - Aydin Najl Hossaini
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology, Van der Maasweg 9, 2629 HZ, Delft, The Netherlands
| | - Avadakkam Sreenithya
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology, Van der Maasweg 9, 2629 HZ, Delft, The Netherlands
| | - Evgeny A Pidko
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology, Van der Maasweg 9, 2629 HZ, Delft, The Netherlands
| |
Collapse
|
16
|
Chen LY, Li YP. AutoTemplate: enhancing chemical reaction datasets for machine learning applications in organic chemistry. J Cheminform 2024; 16:74. [PMID: 38937840 PMCID: PMC11212196 DOI: 10.1186/s13321-024-00869-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Accepted: 06/09/2024] [Indexed: 06/29/2024] Open
Abstract
This paper presents AutoTemplate, an innovative data preprocessing protocol, addressing the crucial need for high-quality chemical reaction datasets in the realm of machine learning applications in organic chemistry. Recent advances in artificial intelligence have expanded the application of machine learning in chemistry, particularly in yield prediction, retrosynthesis, and reaction condition prediction. However, the effectiveness of these models hinges on the integrity of chemical reaction datasets, which are often plagued by inconsistencies like missing reactants, incorrect atom mappings, and outright erroneous reactions. AutoTemplate introduces a two-stage approach to refine these datasets. The first stage involves extracting meaningful reaction transformation rules and formulating generic reaction templates using a simplified SMARTS representation. This simplification broadens the applicability of templates across various chemical reactions. The second stage is template-guided reaction curation, where these templates are systematically applied to validate and correct the reaction data. This process effectively amends missing reactant information, rectifies atom-mapping errors, and eliminates incorrect data entries. A standout feature of AutoTemplate is its capability to concurrently identify and correct false chemical reactions. It operates on the premise that most reactions in datasets are accurate, using these as templates to guide the correction of flawed entries. The protocol demonstrates its efficacy across a range of chemical reactions, significantly enhancing dataset quality. This advancement provides a more robust foundation for developing reliable machine learning models in chemistry, thereby improving the accuracy of forward and retrosynthetic predictions. AutoTemplate marks a significant progression in the preprocessing of chemical reaction datasets, bridging a vital gap and facilitating more precise and efficient machine learning applications in organic synthesis. SCIENTIFIC CONTRIBUTION: The proposed automated preprocessing tool for chemical reaction data aims to identify errors within chemical databases. Specifically, if the errors involve atom mapping or the absence of reactant types, corrections can be systematically applied using reaction templates, ultimately elevating the overall quality of the database.
Collapse
Affiliation(s)
- Lung-Yi Chen
- Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, 10617, Taiwan
| | - Yi-Pei Li
- Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, 10617, Taiwan.
- Taiwan International Graduate Program on Sustainable Chemical Science and Technology (TIGP-SCST), No. 128, Sec. 2, Academia Road, Taipei, 11529, Taiwan.
| |
Collapse
|
17
|
Raghavan P, Rago AJ, Verma P, Hassan MM, Goshu GM, Dombrowski AW, Pandey A, Coley CW, Wang Y. Incorporating Synthetic Accessibility in Drug Design: Predicting Reaction Yields of Suzuki Cross-Couplings by Leveraging AbbVie's 15-Year Parallel Library Data Set. J Am Chem Soc 2024; 146:15070-15084. [PMID: 38768950 PMCID: PMC11157529 DOI: 10.1021/jacs.4c00098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 04/24/2024] [Accepted: 04/25/2024] [Indexed: 05/22/2024]
Abstract
Despite the increased use of computational tools to supplement medicinal chemists' expertise and intuition in drug design, predicting synthetic yields in medicinal chemistry endeavors remains an unsolved challenge. Existing design workflows could profoundly benefit from reaction yield prediction, as precious material waste could be reduced, and a greater number of relevant compounds could be delivered to advance the design, make, test, analyze (DMTA) cycle. In this work, we detail the evaluation of AbbVie's medicinal chemistry library data set to build machine learning models for the prediction of Suzuki coupling reaction yields. The combination of density functional theory (DFT)-derived features and Morgan fingerprints was identified to perform better than one-hot encoded baseline modeling, furnishing encouraging results. Overall, we observe modest generalization to unseen reactant structures within the 15-year retrospective library data set. Additionally, we compare predictions made by the model to those made by expert medicinal chemists, finding that the model can often predict both reaction success and reaction yields with greater accuracy. Finally, we demonstrate the application of this approach to suggest structurally and electronically similar building blocks to replace those predicted or observed to be unsuccessful prior to or after synthesis, respectively. The yield prediction model was used to select similar monomers predicted to have higher yields, resulting in greater synthesis efficiency of relevant drug-like molecules.
Collapse
Affiliation(s)
- Priyanka Raghavan
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, 77 Massachusetts Ave, Cambridge, Massachusetts 02139, United States
| | - Alexander J. Rago
- Advanced
Chemistry Technologies Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| | - Pritha Verma
- Advanced
Chemistry Technologies Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| | - Majdi M. Hassan
- RAIDERS
Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| | - Gashaw M. Goshu
- Advanced
Chemistry Technologies Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| | - Amanda W. Dombrowski
- Advanced
Chemistry Technologies Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| | - Abhishek Pandey
- RAIDERS
Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| | - Connor W. Coley
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, 77 Massachusetts Ave, Cambridge, Massachusetts 02139, United States
| | - Ying Wang
- Advanced
Chemistry Technologies Group, AbbVie, Inc., 1 N Waukegan Rd, North Chicago, Illinois 60064, United States
| |
Collapse
|
18
|
Wigh D, Arrowsmith J, Pomberger A, Felton KC, Lapkin AA. ORDerly: Data Sets and Benchmarks for Chemical Reaction Data. J Chem Inf Model 2024; 64:3790-3798. [PMID: 38648077 PMCID: PMC11094788 DOI: 10.1021/acs.jcim.4c00292] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 04/03/2024] [Accepted: 04/04/2024] [Indexed: 04/25/2024]
Abstract
Machine learning has the potential to provide tremendous value to life sciences by providing models that aid in the discovery of new molecules and reduce the time for new products to come to market. Chemical reactions play a significant role in these fields, but there is a lack of high-quality open-source chemical reaction data sets for training machine learning models. Herein, we present ORDerly, an open-source Python package for the customizable and reproducible preparation of reaction data stored in accordance with the increasingly popular Open Reaction Database (ORD) schema. We use ORDerly to clean United States patent data stored in ORD and generate data sets for forward prediction, retrosynthesis, as well as the first benchmark for reaction condition prediction. We train neural networks on data sets generated with ORDerly for condition prediction and show that data sets missing key cleaning steps can lead to silently overinflated performance metrics. Additionally, we train transformers for forward and retrosynthesis prediction and demonstrate how non-patent data can be used to evaluate model generalization. By providing a customizable open-source solution for cleaning and preparing large chemical reaction data, ORDerly is poised to push forward the boundaries of machine learning applications in chemistry.
Collapse
Affiliation(s)
- Daniel
S. Wigh
- Department of Chemical Engineering
and Biotechnology, University of Cambridge, Cambridge CB3 0AS, U.K.
| | - Joe Arrowsmith
- Department of Chemical Engineering
and Biotechnology, University of Cambridge, Cambridge CB3 0AS, U.K.
| | - Alexander Pomberger
- Department of Chemical Engineering
and Biotechnology, University of Cambridge, Cambridge CB3 0AS, U.K.
| | - Kobi C. Felton
- Department of Chemical Engineering
and Biotechnology, University of Cambridge, Cambridge CB3 0AS, U.K.
| | - Alexei A. Lapkin
- Department of Chemical Engineering
and Biotechnology, University of Cambridge, Cambridge CB3 0AS, U.K.
| |
Collapse
|
19
|
Duke R, McCoy R, Risko C, Bursten JRS. Promises and Perils of Big Data: Philosophical Constraints on Chemical Ontologies. J Am Chem Soc 2024; 146:11579-11591. [PMID: 38640489 DOI: 10.1021/jacs.3c11399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/21/2024]
Abstract
Chemistry is experiencing a paradigm shift in the way it interacts with data. So-called "big data" are collected and used at unprecedented scales with the idea that algorithms can be designed to aid in chemical discovery. As data-enabled practices become ever more ubiquitous, chemists must consider the organization and curation of their data, especially as it is presented to both humans and increasingly intelligent algorithms. One of the most promising organizational schemes for big data is a construct termed an ontology. In data science, ontologies are systems that represent relations among objects and properties in a domain of discourse. As chemistry encounters larger and larger data sets, the ontologies that support chemical research will likewise increase in complexity, and the future of chemistry will be shaped by the choices made in developing big data chemical ontologies. How such ontologies will work should therefore be a subject of significant attention in the chemical community. Now is the time for chemists to ask questions about ontology design and use: How should chemical data be organized? What can be reasonably expected from an organizational structure? Is a universal ontology tenable? As some of these questions may be new to chemists, we recommend an interdisciplinary approach that draws on the long history of philosophers of science asking questions about the organization of scientific concepts, constructs, models, and theories. This Perspective presents insights from these long-standing studies and initiates new conversations between chemists and philosophers.
Collapse
Affiliation(s)
- Rebekah Duke
- Department of Chemistry & Center for Applied Energy Research, University of Kentucky, Lexington, Kentucky 40506, United States
| | - Ryan McCoy
- Department of Philosophy, University of Kentucky, Lexington, Kentucky 40508, United States
| | - Chad Risko
- Department of Chemistry & Center for Applied Energy Research, University of Kentucky, Lexington, Kentucky 40506, United States
| | - Julia R S Bursten
- Department of Philosophy, University of Kentucky, Lexington, Kentucky 40508, United States
| |
Collapse
|
20
|
Zhang C, Arun A, Lapkin AA. Completing and Balancing Database Excerpted Chemical Reactions with a Hybrid Mechanistic-Machine Learning Approach. ACS OMEGA 2024; 9:18385-18399. [PMID: 38680356 PMCID: PMC11044172 DOI: 10.1021/acsomega.4c00262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Revised: 03/31/2024] [Accepted: 04/03/2024] [Indexed: 05/01/2024]
Abstract
Computer-aided synthesis planning (CASP) development of reaction routes requires an understanding of complete reaction structures. However, most reactions in the current databases are missing reaction coparticipants. Although reaction prediction and atom mapping tools can predict major reaction participants and trace atom rearrangements in reactions, they fail to identify the missing molecules to complete reactions. This is because these approaches are data-driven models trained on the current reaction databases, which comprise incomplete reactions. In this work, a workflow was developed to tackle the reaction completion challenge. This includes a heuristic-based method to identify balanced reactions from reaction databases and complete some imbalanced reactions by adding candidate molecules. A machine learning masked language model (MLM) was trained to learn from simplified molecular input line entry system (SMILES) sentences of these completed reactions. The model predicted missing molecules for the incomplete reactions, a workflow analogous to predicting missing words in sentences. The model is promising for the prediction of small- and middle-sized missing molecules in incomplete reaction records. The workflow combining both the heuristic and machine learning methods completed more than half of the entire reaction space.
Collapse
Affiliation(s)
- Chonghuan Zhang
- Department
of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K.
| | - Adarsh Arun
- Department
of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K.
- Cambridge
Centre for Advanced Research and Education in Singapore, CARES Ltd., 1 CREATE Way, CREATE Tower #05-05, Singapore 138602 Singapore
- Chemical
Data Intelligence (CDI) Pte., Ltd., 9 Raffles Place #26-01, Republic Plaza, Singapore 048619 Singapore
| | - Alexei A. Lapkin
- Department
of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K.
- Cambridge
Centre for Advanced Research and Education in Singapore, CARES Ltd., 1 CREATE Way, CREATE Tower #05-05, Singapore 138602 Singapore
- Chemical
Data Intelligence (CDI) Pte., Ltd., 9 Raffles Place #26-01, Republic Plaza, Singapore 048619 Singapore
| |
Collapse
|
21
|
Ding Y, Qiang B, Chen Q, Liu Y, Zhang L, Liu Z. Exploring Chemical Reaction Space with Machine Learning Models: Representation and Feature Perspective. J Chem Inf Model 2024; 64:2955-2970. [PMID: 38489239 DOI: 10.1021/acs.jcim.4c00004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/17/2024]
Abstract
Chemical reactions serve as foundational building blocks for organic chemistry and drug design. In the era of large AI models, data-driven approaches have emerged to innovate the design of novel reactions, optimize existing ones for higher yields, and discover new pathways for synthesizing chemical structures comprehensively. To effectively address these challenges with machine learning models, it is imperative to derive robust and informative representations or engage in feature engineering using extensive data sets of reactions. This work aims to provide a comprehensive review of established reaction featurization approaches, offering insights into the selection of representations and the design of features for a wide array of tasks. The advantages and limitations of employing SMILES, molecular fingerprints, molecular graphs, and physics-based properties are meticulously elaborated. Solutions to bridge the gap between different representations will also be critically evaluated. Additionally, we introduce a new frontier in chemical reaction pretraining, holding promise as an innovative yet unexplored avenue.
Collapse
Affiliation(s)
- Yuheng Ding
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Bo Qiang
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Qixuan Chen
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Yiqiao Liu
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Liangren Zhang
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Zhenming Liu
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| |
Collapse
|
22
|
Strieth-Kalthoff F, Szymkuć S, Molga K, Aspuru-Guzik A, Glorius F, Grzybowski BA. Artificial Intelligence for Retrosynthetic Planning Needs Both Data and Expert Knowledge. J Am Chem Soc 2024. [PMID: 38598363 DOI: 10.1021/jacs.4c00338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2024]
Abstract
Rapid advancements in artificial intelligence (AI) have enabled breakthroughs across many scientific disciplines. In organic chemistry, the challenge of planning complex multistep chemical syntheses should conceptually be well-suited for AI. Yet, the development of AI synthesis planners trained solely on reaction-example-data has stagnated and is not on par with the performance of "hybrid" algorithms combining AI with expert knowledge. This Perspective examines possible causes of these shortcomings, extending beyond the established reasoning of insufficient quantities of reaction data. Drawing attention to the intricacies and data biases that are specific to the domain of synthetic chemistry, we advocate augmenting the unique capabilities of AI with the knowledge base and the reasoning strategies of domain experts. By actively involving synthetic chemists, who are the end users of any synthesis planning software, into the development process, we envision to bridge the gap between computer algorithms and the intricate nature of chemical synthesis.
Collapse
Affiliation(s)
- Felix Strieth-Kalthoff
- University of Toronto, Department of Chemistry and Department of Computer Science, 80 St. George St., Toronto, Ontario M5S 3H6, Canada
- University of Toronto, Department of Computer Science, 10 King's College Road, Toronto, Ontario M5S 3G4, Canada
| | - Sara Szymkuć
- Allchemy, 2145 45th Street #201, Highland, Indiana 46322, United States
- Institute of Organic Chemistry, Polish Academy of Sciences, ul. Kasprzaka 44/52, Warsaw 01-224, Poland
| | - Karol Molga
- Allchemy, 2145 45th Street #201, Highland, Indiana 46322, United States
- Institute of Organic Chemistry, Polish Academy of Sciences, ul. Kasprzaka 44/52, Warsaw 01-224, Poland
| | - Alán Aspuru-Guzik
- University of Toronto, Department of Chemistry and Department of Computer Science, 80 St. George St., Toronto, Ontario M5S 3H6, Canada
- University of Toronto, Department of Computer Science, 10 King's College Road, Toronto, Ontario M5S 3G4, Canada
- Vector Institute for Artificial Intelligence, 661 University Ave., Toronto, Ontario M5G 1M1, Canada
- University of Toronto, Department of Chemical Engineering and Applied Chemistry, 200 College St., Toronto, Ontario M5S 3E5, Canada
- University of Toronto, Department of Materials Science and Engineering, 184 College St., Toronto, Ontario M5S 3E4, Canada
| | - Frank Glorius
- Universität Münster, Organisch-Chemisches Institut, Corrensstr. 36, 48149 Münster, Germany
| | - Bartosz A Grzybowski
- Institute of Organic Chemistry, Polish Academy of Sciences, ul. Kasprzaka 44/52, Warsaw 01-224, Poland
- IBS Center for Algorithmic and Robotized Synthesis, CARS, UNIST 50, UNIST-gil, Eonyang-eup, Ulju-gun, Ulsan 689-798, South Korea
- Department of Chemistry, UNIST, 50, UNIST-gil, Eonyang-eup, Ulju-gun, Ulsan 689-798, South Korea
| |
Collapse
|
23
|
Schrader ML, Schäfer FR, Schäfers F, Glorius F. Bridging the information gap in organic chemical reactions. Nat Chem 2024; 16:491-498. [PMID: 38548884 DOI: 10.1038/s41557-024-01470-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Accepted: 02/02/2024] [Indexed: 04/07/2024]
Abstract
The varying quality of scientific reports is a well-recognized problem and often results from a lack of standardization and transparency in scientific publications. This situation ultimately leads to prominent complications such as reproducibility issues and the slow uptake of newly developed synthetic methods for pharmaceutical and agrochemical applications. In recent years, various impactful approaches have been advocated to bridge information gaps and to improve the quality of experimental protocols in synthetic organic publications. Here we provide a critical overview of these strategies and present the reader with a versatile set of tools to augment their standard procedures. We formulate eight principles to improve data management in scientific publications relating to data standardization, reproducibility and evaluation, and encourage scientists to go beyond current publication standards. We are aware that this is a substantial effort, but we are convinced that the resulting improved data situation will greatly benefit the progress of chemistry.
Collapse
Affiliation(s)
- Malte L Schrader
- Organisch-Chemisches Institut, Universität Münster, Münster, Germany
| | - Felix R Schäfer
- Organisch-Chemisches Institut, Universität Münster, Münster, Germany
| | - Felix Schäfers
- Organisch-Chemisches Institut, Universität Münster, Münster, Germany
| | - Frank Glorius
- Organisch-Chemisches Institut, Universität Münster, Münster, Germany.
| |
Collapse
|
24
|
King-Smith E, Berritt S, Bernier L, Hou X, Klug-McLeod JL, Mustakis J, Sach NW, Tucker JW, Yang Q, Howard RM, Lee AA. Probing the chemical 'reactome' with high-throughput experimentation data. Nat Chem 2024; 16:633-643. [PMID: 38168924 PMCID: PMC10997498 DOI: 10.1038/s41557-023-01393-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Accepted: 11/06/2023] [Indexed: 01/05/2024]
Abstract
High-throughput experimentation (HTE) has the potential to improve our understanding of organic chemistry by systematically interrogating reactivity across diverse chemical spaces. Notable bottlenecks include few publicly available large-scale datasets and the need for facile interpretation of these data's hidden chemical insights. Here we report the development of a high-throughput experimentation analyser, a robust and statistically rigorous framework, which is applicable to any HTE dataset regardless of size, scope or target reaction outcome, which yields interpretable correlations between starting material(s), reagents and outcomes. We improve the HTE data landscape with the disclosure of 39,000+ previously proprietary HTE reactions that cover a breadth of chemistry, including cross-coupling reactions and chiral salt resolutions. The high-throughput experimentation analyser was validated on cross-coupling and hydrogenation datasets, showcasing the elucidation of statistically significant hidden relationships between reaction components and outcomes, as well as highlighting areas of dataset bias and the specific reaction spaces that necessitate further investigation.
Collapse
Affiliation(s)
- Emma King-Smith
- Cavendish Laboratory, University of Cambridge, Cambridge, UK
| | | | | | - Xinjun Hou
- Pfizer Research and Development, Cambridge, MA, USA
| | | | | | - Neal W Sach
- Pfizer Research and Development, La Jolla, CA, USA
| | | | - Qingyi Yang
- Pfizer Research and Development, Cambridge, MA, USA
| | | | - Alpha A Lee
- Cavendish Laboratory, University of Cambridge, Cambridge, UK.
| |
Collapse
|
25
|
Dobbelaere MR, Lengyel I, Stevens CV, Van Geem KM. Rxn-INSIGHT: fast chemical reaction analysis using bond-electron matrices. J Cheminform 2024; 16:37. [PMID: 38553720 PMCID: PMC10980627 DOI: 10.1186/s13321-024-00834-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 03/23/2024] [Indexed: 04/02/2024] Open
Abstract
The challenge of devising pathways for organic synthesis remains a central issue in the field of medicinal chemistry. Over the span of six decades, computer-aided synthesis planning has given rise to a plethora of potent tools for formulating synthetic routes. Nevertheless, a significant expert task still looms: determining the appropriate solvent, catalyst, and reagents when provided with a set of reactants to achieve and optimize the desired product for a specific step in the synthesis process. Typically, chemists identify key functional groups and rings that exert crucial influences at the reaction center, classify reactions into categories, and may assign them names. This research introduces Rxn-INSIGHT, an open-source algorithm based on the bond-electron matrix approach, with the purpose of automating this endeavor. Rxn-INSIGHT not only streamlines the process but also facilitates extensive querying of reaction databases, effectively replicating the thought processes of an organic chemist. The core functions of the algorithm encompass the classification and naming of reactions, extraction of functional groups, rings, and scaffolds from the involved chemical entities. The provision of reaction condition recommendations based on the similarity and prevalence of reactions eventually arises as a side application. The performance of our rule-based model has been rigorously assessed against a carefully curated benchmark dataset, exhibiting an accuracy rate exceeding 90% in reaction classification and surpassing 95% in reaction naming. Notably, it has been discerned that a pivotal factor in selecting analogous reactions lies in the analysis of ring structures participating in the reactions. An examination of ring structures within the USPTO chemical reaction database reveals that with just 35 unique rings, a remarkable 75% of all rings found in nearly 1 million products can be encompassed. Furthermore, Rxn-INSIGHT is proficient in suggesting appropriate choices for solvents, catalysts, and reagents in entirely novel reactions, all within the span of a second, utilizing nothing more than an everyday laptop.
Collapse
Affiliation(s)
- Maarten R Dobbelaere
- Laboratory for Chemical Technology, Department of Materials, Textiles and Chemical Engineering, Faculty of Engineering and Architecture, Ghent University, Technologiepark 125, 9052, Ghent, Belgium
| | - István Lengyel
- Laboratory for Chemical Technology, Department of Materials, Textiles and Chemical Engineering, Faculty of Engineering and Architecture, Ghent University, Technologiepark 125, 9052, Ghent, Belgium
- ChemInsights LLC, Dover, DE, 19901, USA
| | - Christian V Stevens
- SynBioC Research Group, Department of Green Chemistry and Technology, Faculty of Bioscience Engineering, Ghent University, Coupure Links 653, 9000, Ghent, Belgium
| | - Kevin M Van Geem
- Laboratory for Chemical Technology, Department of Materials, Textiles and Chemical Engineering, Faculty of Engineering and Architecture, Ghent University, Technologiepark 125, 9052, Ghent, Belgium.
| |
Collapse
|
26
|
Pasquini M, Stenta M. LinChemIn: Route Arithmetic─Operations on Digital Synthetic Routes. J Chem Inf Model 2024; 64:1765-1771. [PMID: 38480486 DOI: 10.1021/acs.jcim.3c01819] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/26/2024]
Abstract
Computational tools are revolutionizing our understanding and prediction of chemical reactivity by combining traditional data analysis techniques with new predictive models. These tools extract additional value from the reaction data corpus, but to effectively convert this value into actionable knowledge, domain specialists need to interact easily with the computer-generated output. In this application note, we demonstrate the capabilities of the open-source Python toolkit LinChemIn, which simplifies the manipulation of reaction networks and provides advanced functionality for working with synthetic routes. LinChemIn ensures chemical consistency when merging, editing, mining, and analyzing reaction networks. Its flexible input interface can process routes from various sources, including predictive models and expert input. The toolkit also efficiently extracts individual routes from the combined synthetic tree, identifying alternative paths and reaction combinations. By reducing the operational barrier to accessing and analyzing synthetic routes from multiple sources, LinChemIn facilitates a constructive interplay between artificial intelligence and human expertise.
Collapse
Affiliation(s)
- Marta Pasquini
- Syngenta Crop Protection AG, Schaffhauserstrasse, 4332 Stein, AG, Switzerland
| | - Marco Stenta
- Syngenta Crop Protection AG, Schaffhauserstrasse, 4332 Stein, AG, Switzerland
| |
Collapse
|
27
|
Tavakoli M, Miller RJ, Angel MC, Pfeiffer MA, Gutman ES, Mood AD, Van Vranken D, Baldi P. PMechDB: A Public Database of Elementary Polar Reaction Steps. J Chem Inf Model 2024; 64:1975-1983. [PMID: 38483315 PMCID: PMC10966657 DOI: 10.1021/acs.jcim.3c01810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 02/15/2024] [Accepted: 02/16/2024] [Indexed: 03/26/2024]
Abstract
Most online chemical reaction databases are not publicly accessible or are fully downloadable. These databases tend to contain reactions in noncanonicalized formats and often lack comprehensive information regarding reaction pathways, intermediates, and byproducts. Within the few publicly available databases, reactions are typically stored in the form of unbalanced, overall transformations with minimal interpretability of the underlying chemistry. These limitations present significant obstacles to data-driven applications including the development of machine learning models. As an effort to overcome these challenges, we introduce PMechDB, a publicly accessible platform designed to curate, aggregate, and share polar chemical reaction data in the form of elementary reaction steps. Our initial version of PMechDB consists of over 100,000 such steps. In the PMechDB, all reactions are stored as canonicalized and balanced elementary steps, featuring accurate atom mapping and arrow-pushing mechanisms. As an online interactive database, PMechDB provides multiple interfaces that enable users to search, download, and upload chemical reactions. We anticipate that the public availability of PMechDB and its standardized data representation will prove beneficial for chemoinformatics research and education and the development of data-driven, interpretable models for predicting reactions and pathways. PMechDB platform is accessible online at https://deeprxn.ics.uci.edu/pmechdb.
Collapse
Affiliation(s)
- Mohammadamin Tavakoli
- Department
of Computer Science, University of California,
Irvine, Irvine, California 92697, United States
| | - Ryan J. Miller
- Department
of Computer Science, University of California,
Irvine, Irvine, California 92697, United States
| | - Mirana Claire Angel
- Department
of Computer Science, University of California,
Irvine, Irvine, California 92697, United States
| | - Michael A. Pfeiffer
- Department
of Chemistry, University of California,
Irvine, Irvine, California 92697, United States
| | - Eugene S. Gutman
- Department
of Chemistry, University of California,
Irvine, Irvine, California 92697, United States
| | - Aaron D. Mood
- Department
of Chemistry, University of California,
Irvine, Irvine, California 92697, United States
| | - David Van Vranken
- Department
of Chemistry, University of California,
Irvine, Irvine, California 92697, United States
| | - Pierre Baldi
- Department
of Computer Science, University of California,
Irvine, Irvine, California 92697, United States
| |
Collapse
|
28
|
'Bandit' algorithms help chemists to discover generally applicable conditions for reactions. Nature 2024:10.1038/d41586-024-00446-5. [PMID: 38499804 DOI: 10.1038/d41586-024-00446-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]
|
29
|
Listgarten J. The perpetual motion machine of AI-generated data and the distraction of ChatGPT as a 'scientist'. Nat Biotechnol 2024; 42:371-373. [PMID: 38273064 DOI: 10.1038/s41587-023-02103-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2024]
Affiliation(s)
- Jennifer Listgarten
- Electrical Engineering and Computer Science, University of California, Berkeley, Berkeley, CA, USA.
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
| |
Collapse
|
30
|
Isbrandt ES, Chapple DE, Tu NTP, Dimakos V, Beardall AMM, Boyle PD, Rowley CN, Blacquiere JM, Newman SG. Controlling Reactivity and Selectivity in the Mizoroki-Heck Reaction: High Throughput Evaluation of 1,5-Diaza-3,7-diphosphacyclooctane Ligands. J Am Chem Soc 2024; 146:5650-5660. [PMID: 38359357 DOI: 10.1021/jacs.3c14612] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/17/2024]
Abstract
We report a high throughput evaluation of the Mizoroki-Heck reaction of diverse olefin coupling partners. Comparison of different ligands revealed the 1,5-diaza-3,7-diphosphacyclooctane (P2N2) scaffold to be more broadly applicable than common "gold standard" ligands, demonstrating that this family of readily accessible diphosphines has unrecognized potential in organic synthesis. In particular, two structurally related P2N2 ligands were identified to enable the regiodivergent arylation of styrenes. By simply altering the phosphorus substituent from a phenyl to tert-butyl group, both the linear and branched Mizoroki-Heck products can be obtained in high regioisomeric ratios. Experimental and computational mechanistic studies were performed to further probe the origin of selectivity, which suggests that both ligands coordinate to the metal in a similar manner but that rigid positioning of the phosphorus substituent forces contact with the incoming olefin in a π-π interaction (for P-Ph ligands) or with steric clash (for P-tBu ligands), dictating the regiocontrol.
Collapse
Affiliation(s)
- Eric S Isbrandt
- Centre for Catalysis Research and Innovation, Department of Chemistry and Biomolecular Sciences, University of Ottawa, 10 Marie Curie Private, Ottawa, Ontario K1N 6N5, Canada
| | - Devon E Chapple
- Department of Chemistry, Western University, 1151 Richmond Street, London, Ontario N6A 3K7, Canada
| | - Nguyen Thien Phuc Tu
- Department of Chemistry, Carleton University, 1125 Colonel By Drive, Ottawa, Ontario K1S 5B6, Canada
| | - Victoria Dimakos
- Centre for Catalysis Research and Innovation, Department of Chemistry and Biomolecular Sciences, University of Ottawa, 10 Marie Curie Private, Ottawa, Ontario K1N 6N5, Canada
| | - Anne Marie M Beardall
- Department of Chemistry, Western University, 1151 Richmond Street, London, Ontario N6A 3K7, Canada
| | - Paul D Boyle
- Department of Chemistry, Western University, 1151 Richmond Street, London, Ontario N6A 3K7, Canada
| | - Christopher N Rowley
- Department of Chemistry, Carleton University, 1125 Colonel By Drive, Ottawa, Ontario K1S 5B6, Canada
| | - Johanna M Blacquiere
- Department of Chemistry, Western University, 1151 Richmond Street, London, Ontario N6A 3K7, Canada
| | - Stephen G Newman
- Centre for Catalysis Research and Innovation, Department of Chemistry and Biomolecular Sciences, University of Ottawa, 10 Marie Curie Private, Ottawa, Ontario K1N 6N5, Canada
| |
Collapse
|
31
|
Nippa DF, Atz K, Hohler R, Müller AT, Marx A, Bartelmus C, Wuitschik G, Marzuoli I, Jost V, Wolfard J, Binder M, Stepan AF, Konrad DB, Grether U, Martin RE, Schneider G. Enabling late-stage drug diversification by high-throughput experimentation with geometric deep learning. Nat Chem 2024; 16:239-248. [PMID: 37996732 PMCID: PMC10849962 DOI: 10.1038/s41557-023-01360-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 10/03/2023] [Indexed: 11/25/2023]
Abstract
Late-stage functionalization is an economical approach to optimize the properties of drug candidates. However, the chemical complexity of drug molecules often makes late-stage diversification challenging. To address this problem, a late-stage functionalization platform based on geometric deep learning and high-throughput reaction screening was developed. Considering borylation as a critical step in late-stage functionalization, the computational model predicted reaction yields for diverse reaction conditions with a mean absolute error margin of 4-5%, while the reactivity of novel reactions with known and unknown substrates was classified with a balanced accuracy of 92% and 67%, respectively. The regioselectivity of the major products was accurately captured with a classifier F-score of 67%. When applied to 23 diverse commercial drug molecules, the platform successfully identified numerous opportunities for structural diversification. The influence of steric and electronic information on model performance was quantified, and a comprehensive simple user-friendly reaction format was introduced that proved to be a key enabler for seamlessly integrating deep learning and high-throughput experimentation for late-stage functionalization.
Collapse
Affiliation(s)
- David F Nippa
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
- Department of Pharmacy, Ludwig-Maximilians-Universität München, Munich, Germany
| | - Kenneth Atz
- Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland
| | - Remo Hohler
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Alex T Müller
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Andreas Marx
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Christian Bartelmus
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Georg Wuitschik
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Irene Marzuoli
- Process Chemistry and Catalysis (PCC), F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Vera Jost
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Jens Wolfard
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Martin Binder
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Antonia F Stepan
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - David B Konrad
- Department of Pharmacy, Ludwig-Maximilians-Universität München, Munich, Germany.
| | - Uwe Grether
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland.
| | - Rainer E Martin
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland.
| | - Gisbert Schneider
- Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland.
- ETH Singapore SEC Ltd, Singapore, Singapore.
| |
Collapse
|
32
|
Chen LY, Li YP. Enhancing chemical synthesis: a two-stage deep neural network for predicting feasible reaction conditions. J Cheminform 2024; 16:11. [PMID: 38268009 PMCID: PMC11301986 DOI: 10.1186/s13321-024-00805-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2023] [Accepted: 01/14/2024] [Indexed: 01/26/2024] Open
Abstract
In the field of chemical synthesis planning, the accurate recommendation of reaction conditions is essential for achieving successful outcomes. This work introduces an innovative deep learning approach designed to address the complex task of predicting appropriate reagents, solvents, and reaction temperatures for chemical reactions. Our proposed methodology combines a multi-label classification model with a ranking model to offer tailored reaction condition recommendations based on relevance scores derived from anticipated product yields. To tackle the challenge of limited data for unfavorable reaction contexts, we employed the technique of hard negative sampling to generate reaction conditions that might be mistakenly classified as suitable, forcing the model to refine its decision boundaries, especially in challenging cases. Our developed model excels in proposing conditions where an exact match to the recorded solvents and reagents is found within the top-10 predictions 73% of the time. It also predicts temperatures within ± 20 [Formula: see text] of the recorded temperature in 89% of test cases. Notably, the model demonstrates its capacity to recommend multiple viable reaction conditions, with accuracy varying based on the availability of condition records associated with each reaction. What sets this model apart is its ability to suggest alternative reaction conditions beyond the constraints of the dataset. This underscores its potential to inspire innovative approaches in chemical research, presenting a compelling opportunity for advancing chemical synthesis planning and elevating the field of reaction engineering. Scientific contribution: The combination of multi-label classification and ranking models provides tailored recommendations for reaction conditions based on the reaction yields. A novel approach is presented to address the issue of data scarcity in negative reaction conditions through data augmentation.
Collapse
Affiliation(s)
- Lung-Yi Chen
- Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, 10617, Taiwan
| | - Yi-Pei Li
- Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, 10617, Taiwan.
- Taiwan International Graduate Program on Sustainable Chemical Science and Technology (TIGP-SCST), No. 128, Sec. 2, Academia Road, Taipei, 11529, Taiwan.
| |
Collapse
|
33
|
Bai J, Mosbach S, Taylor CJ, Karan D, Lee KF, Rihm SD, Akroyd J, Lapkin AA, Kraft M. A dynamic knowledge graph approach to distributed self-driving laboratories. Nat Commun 2024; 15:462. [PMID: 38263405 PMCID: PMC10805810 DOI: 10.1038/s41467-023-44599-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 12/21/2023] [Indexed: 01/25/2024] Open
Abstract
The ability to integrate resources and share knowledge across organisations empowers scientists to expedite the scientific discovery process. This is especially crucial in addressing emerging global challenges that require global solutions. In this work, we develop an architecture for distributed self-driving laboratories within The World Avatar project, which seeks to create an all-encompassing digital twin based on a dynamic knowledge graph. We employ ontologies to capture data and material flows in design-make-test-analyse cycles, utilising autonomous agents as executable knowledge components to carry out the experimentation workflow. Data provenance is recorded to ensure its findability, accessibility, interoperability, and reusability. We demonstrate the practical application of our framework by linking two robots in Cambridge and Singapore for a collaborative closed-loop optimisation for a pharmaceutically-relevant aldol condensation reaction in real-time. The knowledge graph autonomously evolves toward the scientist's research goals, with the two robots effectively generating a Pareto front for cost-yield optimisation in three days.
Collapse
Affiliation(s)
- Jiaru Bai
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK
| | - Sebastian Mosbach
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK
- Cambridge Centre for Advanced Research and Education in Singapore (CARES), 1 Create Way, CREATE Tower, #05-05, Singapore, 138602, Singapore
| | - Connor J Taylor
- Astex Pharmaceuticals, 436 Cambridge Science Park Milton Road, Cambridge, CB4 0QA, UK
- Innovation Centre in Digital Molecular Technologies, Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
- Faculty of Engineering, University of Nottingham, University Park, Nottingham, NG7 2RD, UK
| | - Dogancan Karan
- Cambridge Centre for Advanced Research and Education in Singapore (CARES), 1 Create Way, CREATE Tower, #05-05, Singapore, 138602, Singapore
| | - Kok Foong Lee
- CMCL Innovations, Sheraton House, Cambridge, CB3 0AX, UK
| | - Simon D Rihm
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK
- Cambridge Centre for Advanced Research and Education in Singapore (CARES), 1 Create Way, CREATE Tower, #05-05, Singapore, 138602, Singapore
| | - Jethro Akroyd
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK
- Cambridge Centre for Advanced Research and Education in Singapore (CARES), 1 Create Way, CREATE Tower, #05-05, Singapore, 138602, Singapore
| | - Alexei A Lapkin
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK
- Cambridge Centre for Advanced Research and Education in Singapore (CARES), 1 Create Way, CREATE Tower, #05-05, Singapore, 138602, Singapore
- Innovation Centre in Digital Molecular Technologies, Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - Markus Kraft
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK.
- Cambridge Centre for Advanced Research and Education in Singapore (CARES), 1 Create Way, CREATE Tower, #05-05, Singapore, 138602, Singapore.
- School of Chemical and Biomedical Engineering, Nanyang Technological University, 62 Nanyang Drive, 637459, Singapore, Singapore.
- The Alan Turing Institute, London, NW1 2DB, UK.
| |
Collapse
|
34
|
Back S, Aspuru-Guzik A, Ceriotti M, Gryn'ova G, Grzybowski B, Gu GH, Hein J, Hippalgaonkar K, Hormázabal R, Jung Y, Kim S, Kim WY, Moosavi SM, Noh J, Park C, Schrier J, Schwaller P, Tsuda K, Vegge T, von Lilienfeld OA, Walsh A. Accelerated chemical science with AI. DIGITAL DISCOVERY 2024; 3:23-33. [PMID: 38239898 PMCID: PMC10793638 DOI: 10.1039/d3dd00213f] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Accepted: 12/06/2023] [Indexed: 01/22/2024]
Abstract
In light of the pressing need for practical materials and molecular solutions to renewable energy and health problems, to name just two examples, one wonders how to accelerate research and development in the chemical sciences, so as to address the time it takes to bring materials from initial discovery to commercialization. Artificial intelligence (AI)-based techniques, in particular, are having a transformative and accelerating impact on many if not most, technological domains. To shed light on these questions, the authors and participants gathered in person for the ASLLA Symposium on the theme of 'Accelerated Chemical Science with AI' at Gangneung, Republic of Korea. We present the findings, ideas, comments, and often contentious opinions expressed during four panel discussions related to the respective general topics: 'Data', 'New applications', 'Machine learning algorithms', and 'Education'. All discussions were recorded, transcribed into text using Open AI's Whisper, and summarized using LG AI Research's EXAONE LLM, followed by revision by all authors. For the broader benefit of current researchers, educators in higher education, and academic bodies such as associations, publishers, librarians, and companies, we provide chemistry-specific recommendations and summarize the resulting conclusions.
Collapse
Affiliation(s)
- Seoin Back
- Department of Chemical and Biomolecular Engineering, Institute of Emergent Materials, Sogang University Seoul Republic of Korea
| | - Alán Aspuru-Guzik
- Departments of Chemistry, Computer Science, University of Toronto St. George Campus Toronto ON Canada
- Acceleration Consortium and Vector Institute for Artificial Intelligence Toronto ON M5S 1M1 Canada
| | - Michele Ceriotti
- Laboratory of Computational Science and Modeling (COSMO), École Polytechnique Fédérale de Lausanne Lausanne Switzerland
| | - Ganna Gryn'ova
- Heidelberg Institute for Theoretical Studies (HITS gGmbH) 69118 Heidelberg Germany
- Interdisciplinary Center for Scientific Computing, Heidelberg University 69120 Heidelberg Germany
| | - Bartosz Grzybowski
- Center for Algorithmic and Robotized Synthesis (CARS), Institute for Basic Science (IBS) Ulsan Republic of Korea
- Institute of Organic Chemistry, Polish Academy of Sciences Warsaw Poland
- Department of Chemistry, Ulsan National Institute of Science and Technology Ulsan Republic of Korea
| | - Geun Ho Gu
- Department of Energy Engineering, Korea Institute of Energy Technology (KENTECH) Naju 58330 Republic of Korea
| | - Jason Hein
- Department of Chemistry, University of British Columbia Vancouver BC V6T 1Z1 Canada
| | - Kedar Hippalgaonkar
- School of Materials Science and Engineering, Nanyang Technological University 50 Nanyang Avenue Singapore 639798 Singapore
- Institute of Materials Research and Engineering, Agency for Science Technology and Research 2 Fusionopolis Way, 08-03 Singapore 138634 Singapore
| | | | - Yousung Jung
- Department of Chemical and Biomolecular Engineering, KAIST Daejeon Republic of Korea
- School of Chemical and Biological Engineering, Interdisciplinary Program in Artificial Intelligence, Seoul National University 1 Gwanak-ro, Gwanak-gu Seoul 08826 Republic of Korea
| | - Seonah Kim
- Department of Chemistry, Colorado State University 1301 Center Avenue Fort Collins CO 80523 USA
| | - Woo Youn Kim
- Department of Chemistry, KAIST Daejeon Republic of Korea
| | - Seyed Mohamad Moosavi
- Chemical Engineering & Applied Chemistry, University of Toronto Toronto Ontario M5S 3E5 Canada
| | - Juhwan Noh
- Chemical Data-Driven Research Center, Korea Research Institute of Chemical Technology Daejeon 34114 Republic of Korea
| | | | - Joshua Schrier
- Department of Chemistry, Fordham University The Bronx NY 10458 USA
| | - Philippe Schwaller
- Laboratory of Artificial Chemical Intelligence (LIAC) & National Centre of Competence in Research (NCCR) Catalysis, École Polytechnique Fédérale de Lausanne Lausanne Switzerland
| | - Koji Tsuda
- Graduate School of Frontier Sciences, The University of Tokyo Kashiwa Chiba 277-8561 Japan
- Center for Basic Research on Materials, National Institute for Materials Science Tsukuba Ibaraki 305-0044 Japan
- RIKEN Center for Advanced Intelligence Project Tokyo 103-0027 Japan
| | - Tejs Vegge
- Department of Energy Conversion and Storage, Technical University of Denmark 301 Anker Engelunds vej, Kongens Lyngby Copenhagen 2800 Denmark
| | - O Anatole von Lilienfeld
- Acceleration Consortium and Vector Institute for Artificial Intelligence Toronto ON M5S 1M1 Canada
- Departments of Chemistry, Materials Science and Engineering, and Physics, University of Toronto, St George Campus Toronto ON Canada
- Machine Learning Group, Technische Universität Berlin and Berlin Institute for the Foundations of Learning and Data 10587 Berlin Germany
| | - Aron Walsh
- Department of Materials, Imperial College London London SW7 2AZ UK
- Department of Physics, Ewha Women's University Seoul Republic of Korea
| |
Collapse
|
35
|
Voinarovska V, Kabeshov M, Dudenko D, Genheden S, Tetko IV. When Yield Prediction Does Not Yield Prediction: An Overview of the Current Challenges. J Chem Inf Model 2024; 64:42-56. [PMID: 38116926 PMCID: PMC10778086 DOI: 10.1021/acs.jcim.3c01524] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 11/29/2023] [Accepted: 11/30/2023] [Indexed: 12/21/2023]
Abstract
Machine Learning (ML) techniques face significant challenges when predicting advanced chemical properties, such as yield, feasibility of chemical synthesis, and optimal reaction conditions. These challenges stem from the high-dimensional nature of the prediction task and the myriad essential variables involved, ranging from reactants and reagents to catalysts, temperature, and purification processes. Successfully developing a reliable predictive model not only holds the potential for optimizing high-throughput experiments but can also elevate existing retrosynthetic predictive approaches and bolster a plethora of applications within the field. In this review, we systematically evaluate the efficacy of current ML methodologies in chemoinformatics, shedding light on their milestones and inherent limitations. Additionally, a detailed examination of a representative case study provides insights into the prevailing issues related to data availability and transferability in the discipline.
Collapse
Affiliation(s)
- Varvara Voinarovska
- Molecular
AI, Discovery Sciences R&D, AstraZeneca, 431 83 Gothenburg, Sweden
- TUM
Graduate School, Faculty of Chemistry, Technical
University of Munich, 85748 Garching, Germany
| | - Mikhail Kabeshov
- Molecular
AI, Discovery Sciences R&D, AstraZeneca, 431 83 Gothenburg, Sweden
| | - Dmytro Dudenko
- Enamine
Ltd., 78 Chervonotkatska str., 02094 Kyiv, Ukraine
| | - Samuel Genheden
- Molecular
AI, Discovery Sciences R&D, AstraZeneca, 431 83 Gothenburg, Sweden
| | - Igor V. Tetko
- Molecular
Targets and Therapeutics Center, Helmholtz Munich − Deutsches
Forschungszentrum für Gesundheit und Umwelt (GmbH), Institute of Structural Biology, 85764 Neuherberg, Germany
| |
Collapse
|
36
|
Djoumbou-Feunang Y, Wilmot J, Kinney J, Chanda P, Yu P, Sader A, Sharifi M, Smith S, Ou J, Hu J, Shipp E, Tomandl D, Kumpatla SP. Cheminformatics and artificial intelligence for accelerating agrochemical discovery. Front Chem 2023; 11:1292027. [PMID: 38093816 PMCID: PMC10716421 DOI: 10.3389/fchem.2023.1292027] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Accepted: 11/09/2023] [Indexed: 10/17/2024] Open
Abstract
The global cost-benefit analysis of pesticide use during the last 30 years has been characterized by a significant increase during the period from 1990 to 2007 followed by a decline. This observation can be attributed to several factors including, but not limited to, pest resistance, lack of novelty with respect to modes of action or classes of chemistry, and regulatory action. Due to current and projected increases of the global population, it is evident that the demand for food, and consequently, the usage of pesticides to improve yields will increase. Addressing these challenges and needs while promoting new crop protection agents through an increasingly stringent regulatory landscape requires the development and integration of infrastructures for innovative, cost- and time-effective discovery and development of novel and sustainable molecules. Significant advances in artificial intelligence (AI) and cheminformatics over the last two decades have improved the decision-making power of research scientists in the discovery of bioactive molecules. AI- and cheminformatics-driven molecule discovery offers the opportunity of moving experiments from the greenhouse to a virtual environment where thousands to billions of molecules can be investigated at a rapid pace, providing unbiased hypothesis for lead generation, optimization, and effective suggestions for compound synthesis and testing. To date, this is illustrated to a far lesser extent in the publicly available agrochemical research literature compared to drug discovery. In this review, we provide an overview of the crop protection discovery pipeline and how traditional, cheminformatics, and AI technologies can help to address the needs and challenges of agrochemical discovery towards rapidly developing novel and more sustainable products.
Collapse
Affiliation(s)
| | - Jeremy Wilmot
- Corteva Agriscience, Crop Protection Discovery and Development, Indianapolis, IN, United States
| | - John Kinney
- Corteva Agriscience, Farming Solutions and Digital, Indianapolis, IN, United States
| | - Pritam Chanda
- Corteva Agriscience, Farming Solutions and Digital, Indianapolis, IN, United States
| | - Pulan Yu
- Corteva Agriscience, Crop Protection Discovery and Development, Indianapolis, IN, United States
| | - Avery Sader
- Corteva Agriscience, Crop Protection Discovery and Development, Indianapolis, IN, United States
| | - Max Sharifi
- Corteva Agriscience, Regulatory and Stewardship, Indianapolis, IN, United States
| | - Scott Smith
- Corteva Agriscience, Farming Solutions and Digital, Indianapolis, IN, United States
| | - Junjun Ou
- Corteva Agriscience, Crop Protection Discovery and Development, Indianapolis, IN, United States
| | - Jie Hu
- Corteva Agriscience, Farming Solutions and Digital, Indianapolis, IN, United States
| | - Elizabeth Shipp
- Corteva Agriscience UK Limited, Regulation Innovation Center, Abingdon, United Kingdom
| | | | | |
Collapse
|
37
|
Nippa DF, Atz K, Müller AT, Wolfard J, Isert C, Binder M, Scheidegger O, Konrad DB, Grether U, Martin RE, Schneider G. Identifying opportunities for late-stage C-H alkylation with high-throughput experimentation and in silico reaction screening. Commun Chem 2023; 6:256. [PMID: 37985850 PMCID: PMC10661846 DOI: 10.1038/s42004-023-01047-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 10/30/2023] [Indexed: 11/22/2023] Open
Abstract
Enhancing the properties of advanced drug candidates is aided by the direct incorporation of specific chemical groups, avoiding the need to construct the entire compound from the ground up. Nevertheless, their chemical intricacy often poses challenges in predicting reactivity for C-H activation reactions and planning their synthesis. We adopted a reaction screening approach that combines high-throughput experimentation (HTE) at a nanomolar scale with computational graph neural networks (GNNs). This approach aims to identify suitable substrates for late-stage C-H alkylation using Minisci-type chemistry. GNNs were trained using experimentally generated reactions derived from in-house HTE and literature data. These trained models were then used to predict, in a forward-looking manner, the coupling of 3180 advanced heterocyclic building blocks with a diverse set of sp3-rich carboxylic acids. This predictive approach aimed to explore the substrate landscape for Minisci-type alkylations. Promising candidates were chosen, their production was scaled up, and they were subsequently isolated and characterized. This process led to the creation of 30 novel, functionally modified molecules that hold potential for further refinement. These results positively advocate the application of HTE-based machine learning to virtual reaction screening.
Collapse
Affiliation(s)
- David F Nippa
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Grenzacherstrasse 124, 4070, Basel, Switzerland
- Department of Pharmacy, Ludwig-Maximilians-Universität München, Butenandtstrasse 5, 81377, Munich, Germany
| | - Kenneth Atz
- Department of Chemistry and Applied Biosciences, ETH Zurich, Vladimir-Prelog-Weg 4, 8093, Zurich, Switzerland
| | - Alex T Müller
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Grenzacherstrasse 124, 4070, Basel, Switzerland
| | - Jens Wolfard
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Grenzacherstrasse 124, 4070, Basel, Switzerland
| | - Clemens Isert
- Department of Chemistry and Applied Biosciences, ETH Zurich, Vladimir-Prelog-Weg 4, 8093, Zurich, Switzerland
| | - Martin Binder
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Grenzacherstrasse 124, 4070, Basel, Switzerland
| | - Oliver Scheidegger
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Grenzacherstrasse 124, 4070, Basel, Switzerland
| | - David B Konrad
- Department of Pharmacy, Ludwig-Maximilians-Universität München, Butenandtstrasse 5, 81377, Munich, Germany.
| | - Uwe Grether
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Grenzacherstrasse 124, 4070, Basel, Switzerland.
| | - Rainer E Martin
- Roche Pharma Research and Early Development (pRED), Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Grenzacherstrasse 124, 4070, Basel, Switzerland.
| | - Gisbert Schneider
- Department of Chemistry and Applied Biosciences, ETH Zurich, Vladimir-Prelog-Weg 4, 8093, Zurich, Switzerland.
| |
Collapse
|
38
|
Wang X, Hsieh CY, Yin X, Wang J, Li Y, Deng Y, Jiang D, Wu Z, Du H, Chen H, Li Y, Liu H, Wang Y, Luo P, Hou T, Yao X. Generic Interpretable Reaction Condition Predictions with Open Reaction Condition Datasets and Unsupervised Learning of Reaction Center. RESEARCH (WASHINGTON, D.C.) 2023; 6:0231. [PMID: 37849643 PMCID: PMC10578430 DOI: 10.34133/research.0231] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Accepted: 08/29/2023] [Indexed: 10/19/2023]
Abstract
Effective synthesis planning powered by deep learning (DL) can significantly accelerate the discovery of new drugs and materials. However, most DL-assisted synthesis planning methods offer either none or very limited capability to recommend suitable reaction conditions (RCs) for their reaction predictions. Currently, the prediction of RCs with a DL framework is hindered by several factors, including: (a) lack of a standardized dataset for benchmarking, (b) lack of a general prediction model with powerful representation, and (c) lack of interpretability. To address these issues, we first created 2 standardized RC datasets covering a broad range of reaction classes and then proposed a powerful and interpretable Transformer-based RC predictor named Parrot. Through careful design of the model architecture, pretraining method, and training strategy, Parrot improved the overall top-3 prediction accuracy on catalysis, solvents, and other reagents by as much as 13.44%, compared to the best previous model on a newly curated dataset. Additionally, the mean absolute error of the predicted temperatures was reduced by about 4 °C. Furthermore, Parrot manifests strong generalization capacity with superior cross-chemical-space prediction accuracy. Attention analysis indicates that Parrot effectively captures crucial chemical information and exhibits a high level of interpretability in the prediction of RCs. The proposed model Parrot exemplifies how modern neural network architecture when appropriately pretrained can be versatile in making reliable, generalizable, and interpretable recommendation for RCs even when the underlying training dataset may still be limited in diversity.
Collapse
Affiliation(s)
- Xiaorui Wang
- Dr. Neher’s Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health,
Macau University of Science and Technology, Macao, 999078, China
- CarbonSilicon AI Technology Co.,
Ltd, Hangzhou, Zhejiang310018, China
| | - Chang-Yu Hsieh
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences,
Zhejiang University, Hangzhou, 310058, China
| | - Xiaodan Yin
- Dr. Neher’s Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health,
Macau University of Science and Technology, Macao, 999078, China
- CarbonSilicon AI Technology Co.,
Ltd, Hangzhou, Zhejiang310018, China
| | - Jike Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences,
Zhejiang University, Hangzhou, 310058, China
- CarbonSilicon AI Technology Co.,
Ltd, Hangzhou, Zhejiang310018, China
| | - Yuquan Li
- College of Chemistry and Chemical Engineering,
Lanzhou University, Lanzhou, 730000, China
| | - Yafeng Deng
- CarbonSilicon AI Technology Co.,
Ltd, Hangzhou, Zhejiang310018, China
| | - Dejun Jiang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences,
Zhejiang University, Hangzhou, 310058, China
- CarbonSilicon AI Technology Co.,
Ltd, Hangzhou, Zhejiang310018, China
| | - Zhenxing Wu
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences,
Zhejiang University, Hangzhou, 310058, China
- CarbonSilicon AI Technology Co.,
Ltd, Hangzhou, Zhejiang310018, China
| | - Hongyan Du
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences,
Zhejiang University, Hangzhou, 310058, China
| | - Hongming Chen
- Center of Chemistry and Chemical Biology,
Guangzhou Regenerative Medicine and Health Guangdong Laboratory, Guangzhou 510530, China
| | - Yun Li
- College of Chemistry and Chemical Engineering,
Lanzhou University, Lanzhou, 730000, China
| | - Huanxiang Liu
- Faculty of Applied Sciences,
Macao Polytechnic University, Macao, 999078, China
| | - Yuwei Wang
- College of Pharmacy,
Shaanxi University of Chinese Medicine, Xianyang, Shaanxi, 712044, China
| | - Pei Luo
- Dr. Neher’s Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health,
Macau University of Science and Technology, Macao, 999078, China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences,
Zhejiang University, Hangzhou, 310058, China
| | - Xiaojun Yao
- Faculty of Applied Sciences,
Macao Polytechnic University, Macao, 999078, China
| |
Collapse
|
39
|
Jablonka KM, Ai Q, Al-Feghali A, Badhwar S, Bocarsly JD, Bran AM, Bringuier S, Brinson LC, Choudhary K, Circi D, Cox S, de Jong WA, Evans ML, Gastellu N, Genzling J, Gil MV, Gupta AK, Hong Z, Imran A, Kruschwitz S, Labarre A, Lála J, Liu T, Ma S, Majumdar S, Merz GW, Moitessier N, Moubarak E, Mouriño B, Pelkie B, Pieler M, Ramos MC, Ranković B, Rodriques SG, Sanders JN, Schwaller P, Schwarting M, Shi J, Smit B, Smith BE, Van Herck J, Völker C, Ward L, Warren S, Weiser B, Zhang S, Zhang X, Zia GA, Scourtas A, Schmidt KJ, Foster I, White AD, Blaiszik B. 14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon. DIGITAL DISCOVERY 2023; 2:1233-1250. [PMID: 38013906 PMCID: PMC10561547 DOI: 10.1039/d3dd00113j] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Accepted: 08/08/2023] [Indexed: 11/04/2023]
Abstract
Large-language models (LLMs) such as GPT-4 caught the interest of many scientists. Recent studies suggested that these models could be useful in chemistry and materials science. To explore these possibilities, we organized a hackathon. This article chronicles the projects built as part of this hackathon. Participants employed LLMs for various applications, including predicting properties of molecules and materials, designing novel interfaces for tools, extracting knowledge from unstructured data, and developing new educational applications. The diverse topics and the fact that working prototypes could be generated in less than two days highlight that LLMs will profoundly impact the future of our fields. The rich collection of ideas and projects also indicates that the applications of LLMs are not limited to materials science and chemistry but offer potential benefits to a wide range of scientific disciplines.
Collapse
Affiliation(s)
- Kevin Maik Jablonka
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Sion Valais Switzerland
| | - Qianxiang Ai
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge Massachusetts 02139 USA
| | | | | | - Joshua D Bocarsly
- Yusuf Hamied Department of Chemistry, University of Cambridge Lensfield Road Cambridge CB2 1EW UK
| | - Andres M Bran
- Laboratory of Artificial Chemical Intelligence (LIAC), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne Switzerland
- National Centre of Competence in Research (NCCR) Catalysis, Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne Switzerland
| | | | | | - Kamal Choudhary
- Material Measurement Laboratory, National Institute of Standards and Technology Maryland 20899 USA
| | - Defne Circi
- Mechanical Engineering and Materials Science, Duke University USA
| | - Sam Cox
- Department of Chemical Engineering, University of Rochester USA
| | - Wibe A de Jong
- Applied Mathematics and Computational Research Division, Lawrence Berkeley National Laboratory Berkeley CA 94720 USA
| | - Matthew L Evans
- Institut de la Matière Condensée et des Nanosciences (IMCN), UCLouvain Chemin des Étoiles 8 Louvain-la-Neuve 1348 Belgium
- Matgenix SRL 185 Rue Armand Bury 6534 Gozée Belgium
| | - Nicolas Gastellu
- Department of Chemistry, McGill University Montreal Quebec Canada
| | - Jerome Genzling
- Department of Chemistry, McGill University Montreal Quebec Canada
| | - María Victoria Gil
- Instituto de Ciencia y Tecnología del Carbono (INCAR), CSIC Francisco Pintado Fe 26 33011 Oviedo Spain
| | - Ankur K Gupta
- Applied Mathematics and Computational Research Division, Lawrence Berkeley National Laboratory Berkeley CA 94720 USA
| | - Zhi Hong
- Department of Computer Science, University of Chicago Chicago Illinois 60637 USA
| | - Alishba Imran
- Computer Science, University of California Berkeley CA 94704 USA
| | - Sabine Kruschwitz
- Bundesanstalt für Materialforschung und -prüfung Unter den Eichen 87 12205 Berlin Germany
| | - Anne Labarre
- Department of Chemistry, McGill University Montreal Quebec Canada
| | - Jakub Lála
- Francis Crick Institute 1 Midland Rd London NW1 1AT UK
| | - Tao Liu
- Department of Chemistry, McGill University Montreal Quebec Canada
| | - Steven Ma
- Department of Chemistry, McGill University Montreal Quebec Canada
| | - Sauradeep Majumdar
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Sion Valais Switzerland
| | - Garrett W Merz
- American Family Insurance Data Science Institute, University of Wisconsin-Madison Madison WI 53706 USA
| | | | - Elias Moubarak
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Sion Valais Switzerland
| | - Beatriz Mouriño
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Sion Valais Switzerland
| | - Brenden Pelkie
- Department of Chemical Engineering, University of Washington Seattle WA 98105 USA
| | | | | | - Bojana Ranković
- Laboratory of Artificial Chemical Intelligence (LIAC), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne Switzerland
- National Centre of Competence in Research (NCCR) Catalysis, Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne Switzerland
| | | | - Jacob N Sanders
- Department of Chemistry and Biochemistry, University of California Los Angeles CA 90095 USA
| | - Philippe Schwaller
- Laboratory of Artificial Chemical Intelligence (LIAC), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne Switzerland
- National Centre of Competence in Research (NCCR) Catalysis, Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne Switzerland
| | - Marcus Schwarting
- Department of Computer Science, University of Chicago Chicago IL 60490 USA
| | - Jiale Shi
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge Massachusetts 02139 USA
| | - Berend Smit
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Sion Valais Switzerland
| | - Ben E Smith
- Yusuf Hamied Department of Chemistry, University of Cambridge Lensfield Road Cambridge CB2 1EW UK
| | - Joren Van Herck
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Sion Valais Switzerland
| | - Christoph Völker
- Bundesanstalt für Materialforschung und -prüfung Unter den Eichen 87 12205 Berlin Germany
| | - Logan Ward
- Data Science and Learning Division, Argonne National Lab USA
| | - Sean Warren
- Department of Chemistry, McGill University Montreal Quebec Canada
| | - Benjamin Weiser
- Department of Chemistry, McGill University Montreal Quebec Canada
| | - Sylvester Zhang
- Department of Chemistry, McGill University Montreal Quebec Canada
| | - Xiaoqi Zhang
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL) Sion Valais Switzerland
| | - Ghezal Ahmad Zia
- Bundesanstalt für Materialforschung und -prüfung Unter den Eichen 87 12205 Berlin Germany
| | - Aristana Scourtas
- Globus, University of Chicago, Data Science and Learning Division, Argonne National Lab USA
| | - K J Schmidt
- Globus, University of Chicago, Data Science and Learning Division, Argonne National Lab USA
| | - Ian Foster
- Department of Computer Science, University of Chicago, Data Science and Learning Division, Argonne National Lab USA
| | - Andrew D White
- Department of Chemical Engineering, University of Rochester USA
| | - Ben Blaiszik
- Globus, University of Chicago, Data Science and Learning Division, Argonne National Lab USA
| |
Collapse
|
40
|
Reid JP, Betinol IO, Kuang Y. Mechanism to model: a physical organic chemistry approach to reaction prediction. Chem Commun (Camb) 2023; 59:10711-10721. [PMID: 37552047 DOI: 10.1039/d3cc03229a] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/09/2023]
Abstract
The application of mechanistic generalizations is at the core of chemical reaction development and application. These strategies are rooted in physical organic chemistry where mechanistic understandings can be derived from one reaction and applied to explain another. Over time these techniques have evolved from rationalizing observed outcomes to leading experimental design through reaction prediction. In parallel, significant progression in asymmetric organocatalysis has expanded the reach of chiral transfer to new reactions with increased efficiency. However, the complex and diverse catalyst structures applied in this arena have rendered the generalization of asymmetric catalytic processes to be exceptionally challenging. Recognizing this, a portion of our research has been focused on understanding the transferability of chemical observations between similar reactions and exploiting this phenomenon as a platform for prediction. Through these experiences, we have relied on a working knowledge of reaction mechanism to guide the development and application of our models which have been advanced from simple qualitative rules to large statistical models for quantitative predictions. In this feature article, we describe the models acquired to generalize organocatalytic reaction mechanisms and demonstrate their use as a powerful approach for accelerating enantioselective synthesis.
Collapse
Affiliation(s)
- Jolene P Reid
- Department of Chemistry, University of British Columbia, 2036 Main Mall, Vancouver, British Columbia, V6T 1Z1, Canada.
| | - Isaiah O Betinol
- Department of Chemistry, University of British Columbia, 2036 Main Mall, Vancouver, British Columbia, V6T 1Z1, Canada.
| | - Yutao Kuang
- Department of Chemistry, University of British Columbia, 2036 Main Mall, Vancouver, British Columbia, V6T 1Z1, Canada.
| |
Collapse
|
41
|
Rajan K, Brinkhaus HO, Agea MI, Zielesny A, Steinbeck C. DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. Nat Commun 2023; 14:5045. [PMID: 37598180 PMCID: PMC10439916 DOI: 10.1038/s41467-023-40782-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Accepted: 08/09/2023] [Indexed: 08/21/2023] Open
Abstract
The number of publications describing chemical structures has increased steadily over the last decades. However, the majority of published chemical information is currently not available in machine-readable form in public databases. It remains a challenge to automate the process of information extraction in a way that requires less manual intervention - especially the mining of chemical structure depictions. As an open-source platform that leverages recent advancements in deep learning, computer vision, and natural language processing, DECIMER.ai (Deep lEarning for Chemical IMagE Recognition) strives to automatically segment, classify, and translate chemical structure depictions from the printed literature. The segmentation and classification tools are the only openly available packages of their kind, and the optical chemical structure recognition (OCSR) core application yields outstanding performance on all benchmark datasets. The source code, the trained models and the datasets developed in this work have been published under permissive licences. An instance of the DECIMER web application is available at https://decimer.ai .
Collapse
Affiliation(s)
- Kohulan Rajan
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743, Jena, Germany
| | - Henning Otto Brinkhaus
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743, Jena, Germany
| | - M Isabel Agea
- Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technicka 5, 166 28, Prague, Czech Republic
| | - Achim Zielesny
- Institute for Bioinformatics and Chemoinformatics, Westphalian University of Applied Sciences, August-Schmidt-Ring 10, 45665, Recklinghausen, Germany
| | - Christoph Steinbeck
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743, Jena, Germany.
| |
Collapse
|
42
|
Li B, Su S, Zhu C, Lin J, Hu X, Su L, Yu Z, Liao K, Chen H. A deep learning framework for accurate reaction prediction and its application on high-throughput experimentation data. J Cheminform 2023; 15:72. [PMID: 37568183 PMCID: PMC10422736 DOI: 10.1186/s13321-023-00732-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Accepted: 06/30/2023] [Indexed: 08/13/2023] Open
Abstract
In recent years, it has been seen that artificial intelligence (AI) starts to bring revolutionary changes to chemical synthesis. However, the lack of suitable ways of representing chemical reactions and the scarceness of reaction data has limited the wider application of AI to reaction prediction. Here, we introduce a novel reaction representation, GraphRXN, for reaction prediction. It utilizes a universal graph-based neural network framework to encode chemical reactions by directly taking two-dimension reaction structures as inputs. The GraphRXN model was evaluated by three publically available chemical reaction datasets and gave on-par or superior results compared with other baseline models. To further evaluate the effectiveness of GraphRXN, wet-lab experiments were carried out for the purpose of generating reaction data. GraphRXN model was then built on high-throughput experimentation data and a decent accuracy (R2 of 0.712) was obtained on our in-house data. This highlights that the GraphRXN model can be deployed in an integrated workflow which combines robotics and AI technologies for forward reaction prediction.
Collapse
Affiliation(s)
- Baiqing Li
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Shimin Su
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Chan Zhu
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Jie Lin
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Xinyue Hu
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Lebin Su
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Zhunzhun Yu
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Kuangbiao Liao
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China.
| | - Hongming Chen
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China.
| |
Collapse
|
43
|
Wang H, Fu T, Du Y, Gao W, Huang K, Liu Z, Chandak P, Liu S, Van Katwyk P, Deac A, Anandkumar A, Bergen K, Gomes CP, Ho S, Kohli P, Lasenby J, Leskovec J, Liu TY, Manrai A, Marks D, Ramsundar B, Song L, Sun J, Tang J, Veličković P, Welling M, Zhang L, Coley CW, Bengio Y, Zitnik M. Scientific discovery in the age of artificial intelligence. Nature 2023; 620:47-60. [PMID: 37532811 DOI: 10.1038/s41586-023-06221-2] [Citation(s) in RCA: 147] [Impact Index Per Article: 147.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Accepted: 05/16/2023] [Indexed: 08/04/2023]
Abstract
Artificial intelligence (AI) is being increasingly integrated into scientific discovery to augment and accelerate research, helping scientists to generate hypotheses, design experiments, collect and interpret large datasets, and gain insights that might not have been possible using traditional scientific methods alone. Here we examine breakthroughs over the past decade that include self-supervised learning, which allows models to be trained on vast amounts of unlabelled data, and geometric deep learning, which leverages knowledge about the structure of scientific data to enhance model accuracy and efficiency. Generative AI methods can create designs, such as small-molecule drugs and proteins, by analysing diverse data modalities, including images and sequences. We discuss how these methods can help scientists throughout the scientific process and the central issues that remain despite such advances. Both developers and users of AI toolsneed a better understanding of when such approaches need improvement, and challenges posed by poor data quality and stewardship remain. These issues cut across scientific disciplines and require developing foundational algorithmic approaches that can contribute to scientific understanding or acquire it autonomously, making them critical areas of focus for AI innovation.
Collapse
Affiliation(s)
- Hanchen Wang
- Department of Engineering, University of Cambridge, Cambridge, UK
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA
- Department of Research and Early Development, Genentech Inc, South San Francisco, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Tianfan Fu
- Department of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| | - Yuanqi Du
- Department of Computer Science, Cornell University, Ithaca, NY, USA
| | - Wenhao Gao
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Kexin Huang
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Ziming Liu
- Department of Physics, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Payal Chandak
- Harvard-MIT Program in Health Sciences and Technology, Cambridge, MA, USA
| | - Shengchao Liu
- Mila - Quebec AI Institute, Montreal, Quebec, Canada
- Université de Montréal, Montreal, Quebec, Canada
| | - Peter Van Katwyk
- Department of Earth, Environmental and Planetary Sciences, Brown University, Providence, RI, USA
- Data Science Institute, Brown University, Providence, RI, USA
| | - Andreea Deac
- Mila - Quebec AI Institute, Montreal, Quebec, Canada
- Université de Montréal, Montreal, Quebec, Canada
| | - Anima Anandkumar
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA
- NVIDIA, Santa Clara, CA, USA
| | - Karianne Bergen
- Department of Earth, Environmental and Planetary Sciences, Brown University, Providence, RI, USA
- Data Science Institute, Brown University, Providence, RI, USA
| | - Carla P Gomes
- Department of Computer Science, Cornell University, Ithaca, NY, USA
| | - Shirley Ho
- Center for Computational Astrophysics, Flatiron Institute, New York, NY, USA
- Department of Astrophysical Sciences, Princeton University, Princeton, NJ, USA
- Department of Physics, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Physics and Center for Data Science, New York University, New York, NY, USA
| | | | - Joan Lasenby
- Department of Engineering, University of Cambridge, Cambridge, UK
| | - Jure Leskovec
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | | | - Arjun Manrai
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Debora Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | - Le Song
- BioMap, Beijing, China
- Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
| | - Jimeng Sun
- University of Illinois at Urbana-Champaign, Champaign, IL, USA
| | - Jian Tang
- Mila - Quebec AI Institute, Montreal, Quebec, Canada
- HEC Montréal, Montreal, Quebec, Canada
- CIFAR AI Chair, Toronto, Ontario, Canada
| | - Petar Veličković
- Google DeepMind, London, UK
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
| | - Max Welling
- University of Amsterdam, Amsterdam, Netherlands
- Microsoft Research Amsterdam, Amsterdam, Netherlands
| | - Linfeng Zhang
- DP Technology, Beijing, China
- AI for Science Institute, Beijing, China
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Yoshua Bengio
- Mila - Quebec AI Institute, Montreal, Quebec, Canada
- Université de Montréal, Montreal, Quebec, Canada
| | - Marinka Zitnik
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Harvard Data Science Initiative, Cambridge, MA, USA.
- Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
44
|
Mahjour B, Zhang R, Shen Y, McGrath A, Zhao R, Mohamed OG, Lin Y, Zhang Z, Douthwaite JL, Tripathi A, Cernak T. Rapid planning and analysis of high-throughput experiment arrays for reaction discovery. Nat Commun 2023; 14:3924. [PMID: 37400469 DOI: 10.1038/s41467-023-39531-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Accepted: 06/13/2023] [Indexed: 07/05/2023] Open
Abstract
High-throughput experimentation (HTE) is an increasingly important tool in reaction discovery. While the hardware for running HTE in the chemical laboratory has evolved significantly in recent years, there remains a need for software solutions to navigate data-rich experiments. Here we have developed phactor™, a software that facilitates the performance and analysis of HTE in a chemical laboratory. phactor™ allows experimentalists to rapidly design arrays of chemical reactions or direct-to-biology experiments in 24, 96, 384, or 1,536 wellplates. Users can access online reagent data, such as a chemical inventory, to virtually populate wells with experiments and produce instructions to perform the reaction array manually, or with the assistance of a liquid handling robot. After completion of the reaction array, analytical results can be uploaded for facile evaluation, and to guide the next series of experiments. All chemical data, metadata, and results are stored in machine-readable formats that are readily translatable to various software. We also demonstrate the use of phactor™ in the discovery of several chemistries, including the identification of a low micromolar inhibitor of the SARS-CoV-2 main protease. Furthermore, phactor™ has been made available for free academic use in 24- and 96-well formats via an online interface.
Collapse
Affiliation(s)
- Babak Mahjour
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Rui Zhang
- Department of Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Yuning Shen
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Andrew McGrath
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Ruheng Zhao
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Osama G Mohamed
- Natural Products Discovery Core, Life Sciences Institute, University of Michigan, Ann Arbor, MI, USA
| | - Yingfu Lin
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Zirong Zhang
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - James L Douthwaite
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Ashootosh Tripathi
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
- Natural Products Discovery Core, Life Sciences Institute, University of Michigan, Ann Arbor, MI, USA
| | - Tim Cernak
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA.
- Department of Chemistry, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
45
|
Shim E, Tewari A, Cernak T, Zimmerman PM. Machine Learning Strategies for Reaction Development: Toward the Low-Data Limit. J Chem Inf Model 2023; 63:3659-3668. [PMID: 37312524 PMCID: PMC11163943 DOI: 10.1021/acs.jcim.3c00577] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Machine learning models are increasingly being utilized to predict outcomes of organic chemical reactions. A large amount of reaction data is used to train these models, which is in stark contrast to how expert chemists discover and develop new reactions by leveraging information from a small number of relevant transformations. Transfer learning and active learning are two strategies that can operate in low-data situations, which may help fill this gap and promote the use of machine learning for tackling real-world challenges in organic synthesis. This Perspective introduces active and transfer learning and connects these to potential opportunities and directions for further research, especially in the area of prospective development of chemical transformations.
Collapse
Affiliation(s)
- Eunjae Shim
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Ambuj Tewari
- Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109, United States
- Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Tim Cernak
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Paul M Zimmerman
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| |
Collapse
|
46
|
Kraka E, Antonio JJ, Freindorf M. Reaction mechanism - explored with the unified reaction valley approach. Chem Commun (Camb) 2023; 59:7151-7165. [PMID: 37233449 DOI: 10.1039/d3cc01576a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
One of the ultimate goals of chemistry is to understand and manipulate chemical reactions, which implies the ability to monitor the reaction and its underlying mechanism at an atomic scale. In this article, we introduce the Unified Reaction Valley Approach (URVA) as a tool for elucidating reaction mechanisms, complementing existing computational procedures. URVA combines the concept of the potential energy surface with vibrational spectroscopy and describes a chemical reaction via the reaction path and the surrounding reaction valley traced out by the reacting species on the potential energy surface on their way from the entrance to the exit channel, where the products are located. The key feature of URVA is the focus on the curving of the reaction path. Moving along the reaction path, any electronic structure change of the reacting species is registered by a change in the normal vibrational modes spanning the reaction valley and their coupling with the path, which recovers the curvature of the reaction path. This leads to a unique curvature profile for each chemical reaction, with curvature minima reflecting minimal change and curvature maxima indicating the location of important chemical events such as bond breaking/formation, charge polarization and transfer, rehybridization, etc. A decomposition of the path curvature into internal coordinate components or other coordinates of relevance for the reaction under consideration, provides comprehensive insight into the origin of the chemical changes taking place. After giving an overview of current experimental and computational efforts to gain insight into the mechanism of a chemical reaction and presenting the theoretical background of URVA, we illustrate how URVA works for three diverse processes, (i) [1,3] hydrogen transfer reactions; (ii) α-keto-amino inhibitor for SARS-CoV-2 Mpro; (iii) Rh-catalyzed cyanation. We hope that this article will inspire our computational colleagues to add URVA to their repertoire and will serve as an incubator for new reaction mechanisms to be studied in collaboration with our experimental experts in the field.
Collapse
Affiliation(s)
- Elfi Kraka
- Computational and Theoretical Chemistry Group (CATCO), Department of Chemistry, Southern Methodist University, 3215 Daniel Ave, Dallas, TX 75275-0314, USA.
| | - Juliana J Antonio
- Computational and Theoretical Chemistry Group (CATCO), Department of Chemistry, Southern Methodist University, 3215 Daniel Ave, Dallas, TX 75275-0314, USA.
| | - Marek Freindorf
- Computational and Theoretical Chemistry Group (CATCO), Department of Chemistry, Southern Methodist University, 3215 Daniel Ave, Dallas, TX 75275-0314, USA.
| |
Collapse
|
47
|
Chen Z, Ayinde OR, Fuchs JR, Sun H, Ning X. G 2Retro as a two-step graph generative models for retrosynthesis prediction. Commun Chem 2023; 6:102. [PMID: 37253928 DOI: 10.1038/s42004-023-00897-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 05/04/2023] [Indexed: 06/01/2023] Open
Abstract
Retrosynthesis is a procedure where a target molecule is transformed into potential reactants and thus the synthesis routes can be identified. Recently, computational approaches have been developed to accelerate the design of synthesis routes. In this paper,we develop a generative framework G2Retro for one-step retrosynthesis prediction. G2Retro imitates the reversed logic of synthetic reactions. It first predicts the reaction centers in the target molecules (products), identifies the synthons needed to assemble the products, and transforms these synthons into reactants. G2Retro defines a comprehensive set of reaction center types, and learns from the molecular graphs of the products to predict potential reaction centers. To complete synthons into reactants, G2Retro considers all the involved synthon structures and the product structures to identify the optimal completion paths, and accordingly attaches small substructures sequentially to the synthons. Here we show that G2Retro is able to better predict the reactants for given products in the benchmark dataset than the state-of-the-art methods.
Collapse
Affiliation(s)
- Ziqi Chen
- Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210, USA
| | - Oluwatosin R Ayinde
- Medicinal Chemistry and Pharmacognosy, College of Pharmacy, The Ohio State University, Columbus, OH, 43210, USA
| | - James R Fuchs
- Medicinal Chemistry and Pharmacognosy, College of Pharmacy, The Ohio State University, Columbus, OH, 43210, USA
| | - Huan Sun
- Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210, USA
- Translational Data Analytics Institute, The Ohio State University, Columbus, OH, 43210, USA
| | - Xia Ning
- Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210, USA.
- Translational Data Analytics Institute, The Ohio State University, Columbus, OH, 43210, USA.
- Biomedical Informatics, The Ohio State University, Columbus, OH, 43210, USA.
| |
Collapse
|
48
|
Williams JD, Kappe CO. Self-Optimizing Flow Reactors Get a Boost by Multitasking. ACS CENTRAL SCIENCE 2023; 9:864-866. [PMID: 37252365 PMCID: PMC10214518 DOI: 10.1021/acscentsci.3c00548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
|
49
|
Saebi M, Nan B, Herr JE, Wahlers J, Guo Z, Zurański AM, Kogej T, Norrby PO, Doyle AG, Chawla NV, Wiest O. On the use of real-world datasets for reaction yield prediction. Chem Sci 2023; 14:4997-5005. [PMID: 37206399 PMCID: PMC10189898 DOI: 10.1039/d2sc06041h] [Citation(s) in RCA: 19] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Accepted: 03/09/2023] [Indexed: 09/30/2023] Open
Abstract
The lack of publicly available, large, and unbiased datasets is a key bottleneck for the application of machine learning (ML) methods in synthetic chemistry. Data from electronic laboratory notebooks (ELNs) could provide less biased, large datasets, but no such datasets have been made publicly available. The first real-world dataset from the ELNs of a large pharmaceutical company is disclosed and its relationship to high-throughput experimentation (HTE) datasets is described. For chemical yield predictions, a key task in chemical synthesis, an attributed graph neural network (AGNN) performs as well as or better than the best previous models on two HTE datasets for the Suzuki-Miyaura and Buchwald-Hartwig reactions. However, training the AGNN on an ELN dataset does not lead to a predictive model. The implications of using ELN data for training ML-based models are discussed in the context of yield predictions.
Collapse
Affiliation(s)
- Mandana Saebi
- Department of Computer Science and Engineering and Lucy Family Institute for Data and Society, University of Notre Dame Notre Dame IN 46556 USA
| | - Bozhao Nan
- Department of Chemistry and Biochemistry, University of Notre Dame Notre Dame IN 46556 USA
| | - John E Herr
- Department of Chemistry and Biochemistry, University of Notre Dame Notre Dame IN 46556 USA
| | - Jessica Wahlers
- Department of Chemistry and Biochemistry, University of Notre Dame Notre Dame IN 46556 USA
| | - Zhichun Guo
- Department of Computer Science and Engineering and Lucy Family Institute for Data and Society, University of Notre Dame Notre Dame IN 46556 USA
| | - Andrzej M Zurański
- Department of Chemistry, Princeton University Princeton New Jersey 08544 USA
| | - Thierry Kogej
- Molecular AI, Discovery Sciences, R&D, AstraZeneca Pepparedsleden 1, SE-431 83 Mölndal Gothenburg Sweden
| | - Per-Ola Norrby
- Data Science and Modelling, Pharmaceutical Sciences, R&D, AstraZeneca Pepparedsleden 1, SE-431 83 Mölndal Gothenburg Sweden
| | - Abigail G Doyle
- Department of Chemistry, Princeton University Princeton New Jersey 08544 USA
- Department of Chemistry and Biochemistry, University of California Los Angeles California 90095 USA
| | - Nitesh V Chawla
- Department of Computer Science and Engineering and Lucy Family Institute for Data and Society, University of Notre Dame Notre Dame IN 46556 USA
| | - Olaf Wiest
- Department of Chemistry and Biochemistry, University of Notre Dame Notre Dame IN 46556 USA
| |
Collapse
|
50
|
Boobier S, Davies JC, Derbenev IN, Handley CM, Hirst JD. AI4Green: An Open-Source ELN for Green and Sustainable Chemistry. J Chem Inf Model 2023; 63:2895-2901. [PMID: 37155346 DOI: 10.1021/acs.jcim.3c00306] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
An Electronic Laboratory Notebook (ELN) combining features, including data archival, collaboration tools, and green and sustainability metrics for organic chemistry, is presented. AI4Green is a web-based application, available as open-source code and free to use. It offers the core functionality of an ELN, namely, the ability to store reactions securely and share them among different members of a research team. As users plan their reactions and record them in the ELN, green and sustainable chemistry is encouraged by automatically calculating green metrics and color-coding hazards, solvents, and reaction conditions. The interface links a database constructed from data extracted from PubChem, enabling the automatic collation of information for reactions. The application's design facilitates the development of auxiliary sustainability applications, such as our Solvent Guide. As more reaction data are captured, subsequent work will include providing "intelligent" sustainability suggestions to the user.
Collapse
Affiliation(s)
- Samuel Boobier
- School of Chemistry, University of Nottingham, University Park, Nottingham NG7 2RD, United Kingdom
| | - Joseph C Davies
- School of Chemistry, University of Nottingham, University Park, Nottingham NG7 2RD, United Kingdom
| | - Ivan N Derbenev
- Digital Research Service, University of Nottingham, Jubilee Campus, Nottingham NG8 1BB, United Kingdom
| | - Christopher M Handley
- Digital Research Service, University of Nottingham, Jubilee Campus, Nottingham NG8 1BB, United Kingdom
| | - Jonathan D Hirst
- School of Chemistry, University of Nottingham, University Park, Nottingham NG7 2RD, United Kingdom
| |
Collapse
|