1
|
Fan V, Qian Y, Wang A, Wang A, Coley CW, Barzilay R. OpenChemIE: An Information Extraction Toolkit for Chemistry Literature. J Chem Inf Model 2024. [PMID: 38950894 DOI: 10.1021/acs.jcim.4c00572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/03/2024]
Abstract
Information extraction from chemistry literature is vital for constructing up-to-date reaction databases for data-driven chemistry. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this paper, we present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities and then integrating the results to obtain a final list of reactions. For the first step, we employ specialized neural models that each address a specific task for chemistry information extraction, such as parsing molecules or reactions from text or figures. We then integrate the information from these modules using chemistry-informed algorithms, allowing for the extraction of fine-grained reaction data from reaction condition and substrate scope investigations. Our machine learning models attain state-of-the-art performance when evaluated individually, and we meticulously annotate a challenging dataset of reaction schemes with R-groups to evaluate our pipeline as a whole, achieving an F1 score of 69.5%. Additionally, the reaction extraction results of OpenChemIE attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. OpenChemIE is most suited for information extraction on organic chemistry literature, where molecules are generally depicted as planar graphs or written in text and can be consolidated into a SMILES format. We provide OpenChemIE freely to the public as an open-source package, as well as through a web interface.
Collapse
Affiliation(s)
- Vincent Fan
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Yujie Qian
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Alex Wang
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Amber Wang
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
2
|
Singh S, Hernández-Lobato JM. Deep Kernel learning for reaction outcome prediction and optimization. Commun Chem 2024; 7:136. [PMID: 38877182 PMCID: PMC11178803 DOI: 10.1038/s42004-024-01219-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2024] [Accepted: 06/05/2024] [Indexed: 06/16/2024] Open
Abstract
Recent years have seen a rapid growth in the application of various machine learning methods for reaction outcome prediction. Deep learning models have gained popularity due to their ability to learn representations directly from the molecular structure. Gaussian processes (GPs), on the other hand, provide reliable uncertainty estimates but are unable to learn representations from the data. We combine the feature learning ability of neural networks (NNs) with uncertainty quantification of GPs in a deep kernel learning (DKL) framework to predict the reaction outcome. The DKL model is observed to obtain very good predictive performance across different input representations. It significantly outperforms standard GPs and provides comparable performance to graph neural networks, but with uncertainty estimation. Additionally, the uncertainty estimates on predictions provided by the DKL model facilitated its incorporation as a surrogate model for Bayesian optimization (BO). The proposed method, therefore, has a great potential towards accelerating reaction discovery by integrating accurate predictive models that provide reliable uncertainty estimates with BO.
Collapse
Affiliation(s)
- Sukriti Singh
- Department of Engineering, University of Cambridge, Cambridge, UK.
| | | |
Collapse
|
3
|
Rezaee M, Ekrami S, Hashemianzadeh SM. Comparing ANI-2x, ANI-1ccx neural networks, force field, and DFT methods for predicting conformational potential energy of organic molecules. Sci Rep 2024; 14:11791. [PMID: 38783010 PMCID: PMC11116541 DOI: 10.1038/s41598-024-62242-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Accepted: 05/15/2024] [Indexed: 05/25/2024] Open
Abstract
In this study, the conformational potential energy surfaces of Amylmetacresol, Benzocaine, Dopamine, Betazole, and Betahistine molecules were scanned and analyzed using the neural network architecture ANI-2 × and ANI-1ccx, the force field method OPLS, and density functional theory with the exchange-correlation functional B3LYP and the basis set 6-31G(d). The ANI-1ccx and ANI-2 × methods demonstrated the highest accuracy in predicting torsional energy profiles, effectively capturing the minimum and maximum values of these profiles. Conformational potential energy values calculated by B3LYP and the OPLS force field method differ from those calculated by ANI-1ccx and ANI-2x, which account for non-bonded intramolecular interactions, since the B3LYP functional and OPLS force field weakly consider van der Waals and other intramolecular forces in torsional energy profiles. For a more comprehensive analysis, electronic parameters such as dipole moment, HOMO, and LUMO energies for different torsional angles were calculated at two levels of theory, B3LYP/6-31G(d) and ωB97X/6-31G(d). These calculations confirmed that ANI predictions are more accurate than density functional theory calculations with B3LYP functional and OPLS force field for determining potential energy surfaces. This research successfully addressed the challenges in determining conformational potential energy levels and shows how machine learning and deep neural networks offer a more accurate, cost-effective, and rapid alternative for predicting torsional energy profiles.
Collapse
Affiliation(s)
- Mozafar Rezaee
- Molecular Simulation Research Laboratory, Department of Chemistry, Iran University of Science and Technology, Tehran, Iran
| | - Saeid Ekrami
- CNRS, LCPME, Université de Lorraine, 54000, Nancy, France
| | - Seyed Majid Hashemianzadeh
- Molecular Simulation Research Laboratory, Department of Chemistry, Iran University of Science and Technology, Tehran, Iran.
| |
Collapse
|
4
|
Zhang J, Li L, Xie X, Song XQ, Schaefer HF. Biomimetic Frustrated Lewis Pair Catalysts for Hydrogenation of CO to Methanol at Low Temperatures. ACS ORGANIC & INORGANIC AU 2024; 4:258-267. [PMID: 38585511 PMCID: PMC10996047 DOI: 10.1021/acsorginorgau.3c00064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Revised: 01/12/2024] [Accepted: 01/16/2024] [Indexed: 04/09/2024]
Abstract
The industrial production of methanol through CO hydrogenation using the Cu/ZnO/Al2O3 catalyst requires harsh conditions, and the development of new catalysts with low operating temperatures is highly desirable. In this study, organic biomimetic FLP catalysts with good tolerance to CO poison are theoretically designed. The base-free catalytic reaction contains the 1,1-addition of CO into a formic acid intermediate and the hydrogenation of the formic acid intermediate into methanol. Low-energy spans (25.6, 22.1, and 20.6 kcal/mol) are achieved, indicating that CO can be hydrogenated into methanol at low temperatures. The new extended aromatization-dearomatization effect involving multiple rings is proposed to effectively facilitate the rate-determining CO 1,1-addition step, and a new CO activation model is proposed for organic catalysts.
Collapse
Affiliation(s)
- Jiejing Zhang
- College
of Pharmacy, Key Laboratory of Pharmaceutical Quality Control of Hebei
Province, Key Laboratory of Medicinal Chemistry and Molecular Diagnosis
of Ministry of Education, Hebei University, Baoding 071002, Hebei, P. R. China
| | - Longfei Li
- College
of Pharmacy, Key Laboratory of Pharmaceutical Quality Control of Hebei
Province, Key Laboratory of Medicinal Chemistry and Molecular Diagnosis
of Ministry of Education, Hebei University, Baoding 071002, Hebei, P. R. China
| | - Xiaofeng Xie
- College
of Pharmacy, Key Laboratory of Pharmaceutical Quality Control of Hebei
Province, Key Laboratory of Medicinal Chemistry and Molecular Diagnosis
of Ministry of Education, Hebei University, Baoding 071002, Hebei, P. R. China
| | - Xue-Qing Song
- College
of Pharmacy, Key Laboratory of Pharmaceutical Quality Control of Hebei
Province, Key Laboratory of Medicinal Chemistry and Molecular Diagnosis
of Ministry of Education, Hebei University, Baoding 071002, Hebei, P. R. China
| | - Henry F. Schaefer
- Center
for Computational Quantum Chemistry, University
of Georgia, Athens, Georgia 30602, United States
| |
Collapse
|
5
|
Pasquini M, Stenta M. LinChemIn: Route Arithmetic─Operations on Digital Synthetic Routes. J Chem Inf Model 2024; 64:1765-1771. [PMID: 38480486 DOI: 10.1021/acs.jcim.3c01819] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/26/2024]
Abstract
Computational tools are revolutionizing our understanding and prediction of chemical reactivity by combining traditional data analysis techniques with new predictive models. These tools extract additional value from the reaction data corpus, but to effectively convert this value into actionable knowledge, domain specialists need to interact easily with the computer-generated output. In this application note, we demonstrate the capabilities of the open-source Python toolkit LinChemIn, which simplifies the manipulation of reaction networks and provides advanced functionality for working with synthetic routes. LinChemIn ensures chemical consistency when merging, editing, mining, and analyzing reaction networks. Its flexible input interface can process routes from various sources, including predictive models and expert input. The toolkit also efficiently extracts individual routes from the combined synthetic tree, identifying alternative paths and reaction combinations. By reducing the operational barrier to accessing and analyzing synthetic routes from multiple sources, LinChemIn facilitates a constructive interplay between artificial intelligence and human expertise.
Collapse
Affiliation(s)
- Marta Pasquini
- Syngenta Crop Protection AG, Schaffhauserstrasse, 4332 Stein, AG, Switzerland
| | - Marco Stenta
- Syngenta Crop Protection AG, Schaffhauserstrasse, 4332 Stein, AG, Switzerland
| |
Collapse
|
6
|
Gallarati S, van Gerwen P, Laplaza R, Brey L, Makaveev A, Corminboeuf C. A genetic optimization strategy with generality in asymmetric organocatalysis as a primary target. Chem Sci 2024; 15:3640-3660. [PMID: 38455002 PMCID: PMC10915838 DOI: 10.1039/d3sc06208b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Accepted: 01/30/2024] [Indexed: 03/09/2024] Open
Abstract
A catalyst possessing a broad substrate scope, in terms of both turnover and enantioselectivity, is sometimes called "general". Despite their great utility in asymmetric synthesis, truly general catalysts are difficult or expensive to discover via traditional high-throughput screening and are, therefore, rare. Existing computational tools accelerate the evaluation of reaction conditions from a pre-defined set of experiments to identify the most general ones, but cannot generate entirely new catalysts with enhanced substrate breadth. For these reasons, we report an inverse design strategy based on the open-source genetic algorithm NaviCatGA and on the OSCAR database of organocatalysts to simultaneously probe the catalyst and substrate scope and optimize generality as a primary target. We apply this strategy to the Pictet-Spengler condensation, for which we curate a database of 820 reactions, used to train statistical models of selectivity and activity. Starting from OSCAR, we define a combinatorial space of millions of catalyst possibilities, and perform evolutionary experiments on a diverse substrate scope that is representative of the whole chemical space of tetrahydro-β-carboline products. While privileged catalysts emerge, we show how genetic optimization can address the broader question of generality in asymmetric synthesis, extracting structure-performance relationships from the challenging areas of chemical space.
Collapse
Affiliation(s)
- Simone Gallarati
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
| | - Puck van Gerwen
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
- National Center for Competence in Research - Catalysis (NCCR-Catalysis), Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
| | - Ruben Laplaza
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
- National Center for Competence in Research - Catalysis (NCCR-Catalysis), Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
| | - Lucien Brey
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
| | - Alexander Makaveev
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
| | - Clemence Corminboeuf
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
- National Center for Competence in Research - Catalysis (NCCR-Catalysis), Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
- National Center for Computational Design and Discovery of Novel Materials (MARVEL), Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne Switzerland
| |
Collapse
|
7
|
Sakai M, Kaneshige M, Yasuda K. Learning organo-transition metal catalyzed reactions by graph neural networks. J Comput Chem 2024; 45:341-351. [PMID: 37877461 DOI: 10.1002/jcc.27243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2023] [Revised: 09/21/2023] [Accepted: 10/04/2023] [Indexed: 10/26/2023]
Abstract
Chemical reaction outcome prediction presents a fundamental challenge in synthetic chemistry. Most existing machine learning (ML) approaches focus on chemical reactions of typical elements. We developed a simple ML model focused on organo-transition metal-catalyzed reactions (OMCRs). Instead of overall reactions observed in experiments, we let the ML model learn the sequence of simplified elementary reactions. This drastically reduced the complexity of the model and helped it find common patterns from distinct reactions. We let a graph neural network learn the reactivity index of a pair of atoms. The model was able to learn a wide variety of OMCRs, and the accuracy of reaction prediction reached 97%, even though the model has extremely fewer learnable parameters than other standards. The learned reactivity indices of bonds nicely summarize the knowledge of reactions in the dataset.
Collapse
Affiliation(s)
- Motoji Sakai
- Department of Informatics, Nagoya University, Nagoya, Japan
| | | | - Koji Yasuda
- Department of Informatics, Nagoya University, Nagoya, Japan
- Institute of Materials and Systems for Sustainability, Nagoya University, Nagoya, Japan
| |
Collapse
|
8
|
Nicolle A, Deng S, Ihme M, Kuzhagaliyeva N, Ibrahim EA, Farooq A. Mixtures Recomposition by Neural Nets: A Multidisciplinary Overview. J Chem Inf Model 2024; 64:597-620. [PMID: 38284618 DOI: 10.1021/acs.jcim.3c01633] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2024]
Abstract
Artificial Neural Networks (ANNs) are transforming how we understand chemical mixtures, providing an expressive view of the chemical space and multiscale processes. Their hybridization with physical knowledge can bridge the gap between predictivity and understanding of the underlying processes. This overview explores recent progress in ANNs, particularly their potential in the 'recomposition' of chemical mixtures. Graph-based representations reveal patterns among mixture components, and deep learning models excel in capturing complexity and symmetries when compared to traditional Quantitative Structure-Property Relationship models. Key components, such as Hamiltonian networks and convolution operations, play a central role in representing multiscale mixtures. The integration of ANNs with Chemical Reaction Networks and Physics-Informed Neural Networks for inverse chemical kinetic problems is also examined. The combination of sensors with ANNs shows promise in optical and biomimetic applications. A common ground is identified in the context of statistical physics, where ANN-based methods iteratively adapt their models by blending their initial states with training data. The concept of mixture recomposition unveils a reciprocal inspiration between ANNs and reactive mixtures, highlighting learning behaviors influenced by the training environment.
Collapse
Affiliation(s)
- Andre Nicolle
- Aramco Fuel Research Center, Rueil-Malmaison 92852, France
| | - Sili Deng
- Massachusetts Institute of Technology, Cambridge 02139, Massachusetts, United States
| | - Matthias Ihme
- Stanford University, Stanford 94305, California, United States
| | | | - Emad Al Ibrahim
- King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
| | - Aamir Farooq
- King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
| |
Collapse
|
9
|
Voinarovska V, Kabeshov M, Dudenko D, Genheden S, Tetko IV. When Yield Prediction Does Not Yield Prediction: An Overview of the Current Challenges. J Chem Inf Model 2024; 64:42-56. [PMID: 38116926 PMCID: PMC10778086 DOI: 10.1021/acs.jcim.3c01524] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 11/29/2023] [Accepted: 11/30/2023] [Indexed: 12/21/2023]
Abstract
Machine Learning (ML) techniques face significant challenges when predicting advanced chemical properties, such as yield, feasibility of chemical synthesis, and optimal reaction conditions. These challenges stem from the high-dimensional nature of the prediction task and the myriad essential variables involved, ranging from reactants and reagents to catalysts, temperature, and purification processes. Successfully developing a reliable predictive model not only holds the potential for optimizing high-throughput experiments but can also elevate existing retrosynthetic predictive approaches and bolster a plethora of applications within the field. In this review, we systematically evaluate the efficacy of current ML methodologies in chemoinformatics, shedding light on their milestones and inherent limitations. Additionally, a detailed examination of a representative case study provides insights into the prevailing issues related to data availability and transferability in the discipline.
Collapse
Affiliation(s)
- Varvara Voinarovska
- Molecular
AI, Discovery Sciences R&D, AstraZeneca, 431 83 Gothenburg, Sweden
- TUM
Graduate School, Faculty of Chemistry, Technical
University of Munich, 85748 Garching, Germany
| | - Mikhail Kabeshov
- Molecular
AI, Discovery Sciences R&D, AstraZeneca, 431 83 Gothenburg, Sweden
| | - Dmytro Dudenko
- Enamine
Ltd., 78 Chervonotkatska str., 02094 Kyiv, Ukraine
| | - Samuel Genheden
- Molecular
AI, Discovery Sciences R&D, AstraZeneca, 431 83 Gothenburg, Sweden
| | - Igor V. Tetko
- Molecular
Targets and Therapeutics Center, Helmholtz Munich − Deutsches
Forschungszentrum für Gesundheit und Umwelt (GmbH), Institute of Structural Biology, 85764 Neuherberg, Germany
| |
Collapse
|
10
|
Sadeghi S, Bateni F, Kim T, Son DY, Bennett JA, Orouji N, Punati VS, Stark C, Cerra TD, Awad R, Delgado-Licona F, Xu J, Mukhin N, Dickerson H, Reyes KG, Abolhasani M. Autonomous nanomanufacturing of lead-free metal halide perovskite nanocrystals using a self-driving fluidic lab. NANOSCALE 2024; 16:580-591. [PMID: 38116636 DOI: 10.1039/d3nr05034c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2023]
Abstract
Lead-based metal halide perovskite (MHP) nanocrystals (NCs) have emerged as a promising class of semiconducting nanomaterials for a wide range of optoelectronic and photoelectronic applications. However, the intrinsic lead toxicity of MHP NCs has significantly hampered their large-scale device applications. Copper-base MHP NCs with composition-tunable optical properties have emerged as a prominent lead-free MHP NC candidate. However, comprehensive synthesis space exploration, development, and synthesis science studies of copper-based MHP NCs have been limited by the manual nature of flask-based synthesis and characterization methods. In this study, we present an autonomous approach for the development of lead-free MHP NCs via seamless integration of a modular microfluidic platform with machine learning-assisted NC synthesis modeling and experiment selection to establish a self-driving fluidic lab for accelerated NC synthesis science studies. For the first time, a successful and reproducible in-flow synthesis of Cs3Cu2I5 NCs is presented. Autonomous experimentation is then employed for rapid in-flow synthesis science studies of Cs3Cu2I5 NCs. The autonomously generated experimental NC synthesis dataset is then utilized for fast-tracked synthetic route optimization of high-performing Cs3Cu2I5 NCs.
Collapse
Affiliation(s)
- Sina Sadeghi
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Fazel Bateni
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Taekhoon Kim
- Synthesis Technical Unit, Material Research Center, Samsung Advanced Institute of Technology, SEC, 130, Samsung-ro, Yeongtong-gu, Suwon-si, Gyeonggi-do, Republic of Korea
| | - Dae Yong Son
- Synthesis Technical Unit, Material Research Center, Samsung Advanced Institute of Technology, SEC, 130, Samsung-ro, Yeongtong-gu, Suwon-si, Gyeonggi-do, Republic of Korea
| | - Jeffrey A Bennett
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Negin Orouji
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Venkat S Punati
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Christine Stark
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Teagan D Cerra
- Department of Physics, Weber State University, Ogden, UT 84408, USA
| | - Rami Awad
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Fernando Delgado-Licona
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Jinge Xu
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Nikolai Mukhin
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Hannah Dickerson
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Kristofer G Reyes
- Department of Materials Design and Innovation, University at Buffalo, Buffalo, NY 14260, USA
| | - Milad Abolhasani
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| |
Collapse
|
11
|
Xie Z, Evangelopoulos X, Omar ÖH, Troisi A, Cooper AI, Chen L. Fine-tuning GPT-3 for machine learning electronic and functional properties of organic molecules. Chem Sci 2024; 15:500-510. [PMID: 38179524 PMCID: PMC10762956 DOI: 10.1039/d3sc04610a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Accepted: 12/04/2023] [Indexed: 01/06/2024] Open
Abstract
We evaluate the effectiveness of fine-tuning GPT-3 for the prediction of electronic and functional properties of organic molecules. Our findings show that fine-tuned GPT-3 can successfully identify and distinguish between chemically meaningful patterns, and discern subtle differences among them, exhibiting robust predictive performance for the prediction of molecular properties. We focus on assessing the fine-tuned models' resilience to information loss, resulting from the absence of atoms or chemical groups, and to noise that we introduce via random alterations in atomic identities. We discuss the challenges and limitations inherent to the use of GPT-3 in molecular machine-learning tasks and suggest potential directions for future research and improvements to address these issues.
Collapse
Affiliation(s)
- Zikai Xie
- Leverhulme Research Centre for Functional Materials Design, Materials Innovation Factory and Department of Chemistry, University of Liverpool Liverpool L7 3NY UK
| | - Xenophon Evangelopoulos
- Leverhulme Research Centre for Functional Materials Design, Materials Innovation Factory and Department of Chemistry, University of Liverpool Liverpool L7 3NY UK
| | - Ömer H Omar
- Department of Chemistry, University of Liverpool Liverpool L69 3BX UK
| | - Alessandro Troisi
- Department of Chemistry, University of Liverpool Liverpool L69 3BX UK
| | - Andrew I Cooper
- Leverhulme Research Centre for Functional Materials Design, Materials Innovation Factory and Department of Chemistry, University of Liverpool Liverpool L7 3NY UK
| | - Linjiang Chen
- School of Chemistry, School of Computer Science, University of Birmingham Birmingham B15 2TT UK
| |
Collapse
|
12
|
Raghavan P, Haas BC, Ruos ME, Schleinitz J, Doyle AG, Reisman SE, Sigman MS, Coley CW. Dataset Design for Building Models of Chemical Reactivity. ACS CENTRAL SCIENCE 2023; 9:2196-2204. [PMID: 38161380 PMCID: PMC10755851 DOI: 10.1021/acscentsci.3c01163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Revised: 11/06/2023] [Accepted: 11/15/2023] [Indexed: 01/03/2024]
Abstract
Models can codify our understanding of chemical reactivity and serve a useful purpose in the development of new synthetic processes via, for example, evaluating hypothetical reaction conditions or in silico substrate tolerance. Perhaps the most determining factor is the composition of the training data and whether it is sufficient to train a model that can make accurate predictions over the full domain of interest. Here, we discuss the design of reaction datasets in ways that are conducive to data-driven modeling, emphasizing the idea that training set diversity and model generalizability rely on the choice of molecular or reaction representation. We additionally discuss the experimental constraints associated with generating common types of chemistry datasets and how these considerations should influence dataset design and model building.
Collapse
Affiliation(s)
- Priyanka Raghavan
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Brittany C. Haas
- Department
of Chemistry, University of Utah, Salt Lake City, Utah 84112, United States
| | - Madeline E. Ruos
- Department
of Chemistry & Biochemistry, University
of California, Los Angeles, Los Angeles, California 90095, United States
| | - Jules Schleinitz
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Abigail G. Doyle
- Department
of Chemistry & Biochemistry, University
of California, Los Angeles, Los Angeles, California 90095, United States
| | - Sarah E. Reisman
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Matthew S. Sigman
- Department
of Chemistry, University of Utah, Salt Lake City, Utah 84112, United States
| | - Connor W. Coley
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
13
|
Dolfus U, Briem H, Gutermuth T, Rarey M. Full Modification Control over Retrosynthetic Routes for Guided Optimization of Lead Structures. J Chem Inf Model 2023; 63:6587-6597. [PMID: 37910814 DOI: 10.1021/acs.jcim.3c01155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2023]
Abstract
Synthesizability is essential for compounds designed in silico. Regardless, synthetic accessibility is often considered only as an afterthought in the design and optimization process. In addition, the trend with modern computer-aided drug design methods is going toward full automation and away from the possibility of incorporating user knowledge. With this work, we present the second major release of our software tool, Synthesia, for synthesis-aware lead structure modification, where the user's expertise is now fully utilized. A provided retrosynthetic route is used as a pathway to guide structural modifications that introduce desired structural changes in the target compound. Moreover, the approach allows the user to define the exact position or component in the retrosynthetic route, which should be modified, further integrating the user's expert knowledge. This paper describes the functionality of Synthesia, its basic concepts, and several application scenarios ranging from simple examples to a comparison of the effects of the different exchange functions to an analysis of a set of bioisosteric linker structures, highlighting potential synthetically feasible replacements.
Collapse
Affiliation(s)
- Uschi Dolfus
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraβe 43, 20146 Hamburg, Germany
| | - Hans Briem
- Bayer AG, Research & Development, Pharmaceuticals, Computational Molecular Design Berlin, Building S110, 711, 13342 Berlin, Germany
| | - Torben Gutermuth
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraβe 43, 20146 Hamburg, Germany
| | - Matthias Rarey
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraβe 43, 20146 Hamburg, Germany
| |
Collapse
|
14
|
Schrier J, Norquist AJ, Buonassisi T, Brgoch J. In Pursuit of the Exceptional: Research Directions for Machine Learning in Chemical and Materials Science. J Am Chem Soc 2023; 145:21699-21716. [PMID: 37754929 DOI: 10.1021/jacs.3c04783] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/28/2023]
Abstract
Exceptional molecules and materials with one or more extraordinary properties are both technologically valuable and fundamentally interesting, because they often involve new physical phenomena or new compositions that defy expectations. Historically, exceptionality has been achieved through serendipity, but recently, machine learning (ML) and automated experimentation have been widely proposed to accelerate target identification and synthesis planning. In this Perspective, we argue that the data-driven methods commonly used today are well-suited for optimization but not for the realization of new exceptional materials or molecules. Finding such outliers should be possible using ML, but only by shifting away from using traditional ML approaches that tweak the composition, crystal structure, or reaction pathway. We highlight case studies of high-Tc oxide superconductors and superhard materials to demonstrate the challenges of ML-guided discovery and discuss the limitations of automation for this task. We then provide six recommendations for the development of ML methods capable of exceptional materials discovery: (i) Avoid the tyranny of the middle and focus on extrema; (ii) When data are limited, qualitative predictions that provide direction are more valuable than interpolative accuracy; (iii) Sample what can be made and how to make it and defer optimization; (iv) Create room (and look) for the unexpected while pursuing your goal; (v) Try to fill-in-the-blanks of input and output space; (vi) Do not confuse human understanding with model interpretability. We conclude with a description of how these recommendations can be integrated into automated discovery workflows, which should enable the discovery of exceptional molecules and materials.
Collapse
Affiliation(s)
- Joshua Schrier
- Department of Chemistry, Fordham University, The Bronx, New York 10458, United States
| | - Alexander J Norquist
- Department of Chemistry, Haverford College, Haverford, Pennsylvania 19041, United States
| | - Tonio Buonassisi
- Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Jakoah Brgoch
- Department of Chemistry and Texas Center for Superconductivity, University of Houston, Houston, Texas 77204, United States
| |
Collapse
|
15
|
Hagg A, Kirschner KN. Open-Source Machine Learning in Computational Chemistry. J Chem Inf Model 2023; 63:4505-4532. [PMID: 37466636 PMCID: PMC10430767 DOI: 10.1021/acs.jcim.3c00643] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Indexed: 07/20/2023]
Abstract
The field of computational chemistry has seen a significant increase in the integration of machine learning concepts and algorithms. In this Perspective, we surveyed 179 open-source software projects, with corresponding peer-reviewed papers published within the last 5 years, to better understand the topics within the field being investigated by machine learning approaches. For each project, we provide a short description, the link to the code, the accompanying license type, and whether the training data and resulting models are made publicly available. Based on those deposited in GitHub repositories, the most popular employed Python libraries are identified. We hope that this survey will serve as a resource to learn about machine learning or specific architectures thereof by identifying accessible codes with accompanying papers on a topic basis. To this end, we also include computational chemistry open-source software for generating training data and fundamental Python libraries for machine learning. Based on our observations and considering the three pillars of collaborative machine learning work, open data, open source (code), and open models, we provide some suggestions to the community.
Collapse
Affiliation(s)
- Alexander Hagg
- Institute
of Technology, Resource and Energy-Efficient Engineering (TREE), University of Applied Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
- Department
of Electrical Engineering, Mechanical Engineering and Technical Journalism, University of Applied Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
| | - Karl N. Kirschner
- Institute
of Technology, Resource and Energy-Efficient Engineering (TREE), University of Applied Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
- Department
of Computer Science, University of Applied
Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
| |
Collapse
|
16
|
Karl TM, Bouayad-Gervais S, Hueffel JA, Sperger T, Wellig S, Kaldas SJ, Dabranskaya U, Ward JS, Rissanen K, Tizzard GJ, Schoenebeck F. Machine Learning-Guided Development of Trialkylphosphine Ni (I) Dimers and Applications in Site-Selective Catalysis. J Am Chem Soc 2023. [PMID: 37411044 DOI: 10.1021/jacs.3c03403] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/08/2023]
Abstract
Owing to the unknown correlation of a metal's ligand and its resulting preferred speciation in terms of oxidation state, geometry, and nuclearity, a rational design of multinuclear catalysts remains challenging. With the goal to accelerate the identification of suitable ligands that form trialkylphosphine-derived dihalogen-bridged Ni(I) dimers, we herein employed an assumption-based machine learning approach. The workflow offers guidance in ligand space for a desired speciation without (or only minimal) prior experimental data points. We experimentally verified the predictions and synthesized numerous novel Ni(I) dimers as well as explored their potential in catalysis. We demonstrate C-I selective arylations of polyhalogenated arenes bearing competing C-Br and C-Cl sites in under 5 min at room temperature using 0.2 mol % of the newly developed dimer, [Ni(I)(μ-Br)PAd2(n-Bu)]2, which is so far unmet with alternative dinuclear or mononuclear Ni or Pd catalysts.
Collapse
Affiliation(s)
- Teresa M Karl
- Institute of Organic Chemistry, RWTH Aachen University, Landoltweg 1, 52074 Aachen, Germany
| | - Samir Bouayad-Gervais
- Institute of Organic Chemistry, RWTH Aachen University, Landoltweg 1, 52074 Aachen, Germany
| | - Julian A Hueffel
- Institute of Organic Chemistry, RWTH Aachen University, Landoltweg 1, 52074 Aachen, Germany
| | - Theresa Sperger
- Institute of Organic Chemistry, RWTH Aachen University, Landoltweg 1, 52074 Aachen, Germany
| | - Sebastian Wellig
- Institute of Organic Chemistry, RWTH Aachen University, Landoltweg 1, 52074 Aachen, Germany
| | - Sherif J Kaldas
- Institute of Organic Chemistry, RWTH Aachen University, Landoltweg 1, 52074 Aachen, Germany
| | | | - Jas S Ward
- Department of Chemistry, University of Jyvaskyla, FIN40014 Jyväskylä, Finland
| | - Kari Rissanen
- Department of Chemistry, University of Jyvaskyla, FIN40014 Jyväskylä, Finland
| | - Graham J Tizzard
- UK National Crystallography Service, School of Chemistry, University of Southampton, SO17 1BJ Southhampton, U.K
| | - Franziska Schoenebeck
- Institute of Organic Chemistry, RWTH Aachen University, Landoltweg 1, 52074 Aachen, Germany
| |
Collapse
|
17
|
Shim E, Tewari A, Cernak T, Zimmerman PM. Machine Learning Strategies for Reaction Development: Toward the Low-Data Limit. J Chem Inf Model 2023; 63:3659-3668. [PMID: 37312524 PMCID: PMC11163943 DOI: 10.1021/acs.jcim.3c00577] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Machine learning models are increasingly being utilized to predict outcomes of organic chemical reactions. A large amount of reaction data is used to train these models, which is in stark contrast to how expert chemists discover and develop new reactions by leveraging information from a small number of relevant transformations. Transfer learning and active learning are two strategies that can operate in low-data situations, which may help fill this gap and promote the use of machine learning for tackling real-world challenges in organic synthesis. This Perspective introduces active and transfer learning and connects these to potential opportunities and directions for further research, especially in the area of prospective development of chemical transformations.
Collapse
Affiliation(s)
- Eunjae Shim
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Ambuj Tewari
- Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109, United States
- Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Tim Cernak
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Paul M Zimmerman
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| |
Collapse
|
18
|
Pasquini M, Stenta M. LinChemIn: SynGraph-a data model and a toolkit to analyze and compare synthetic routes. J Cheminform 2023; 15:41. [PMID: 37005691 PMCID: PMC10067316 DOI: 10.1186/s13321-023-00714-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Accepted: 03/20/2023] [Indexed: 04/04/2023] Open
Abstract
BACKGROUND The increasing amount of chemical reaction data makes traditional ways to navigate its corpus less effective, while the demand for novel approaches and instruments is rising. Recent data science and machine learning techniques support the development of new ways to extract value from the available reaction data. On the one side, Computer-Aided Synthesis Planning tools can predict synthetic routes in a model-driven approach; on the other side, experimental routes can be extracted from the Network of Organic Chemistry, in which reaction data are linked in a network. In this context, the need to combine, compare and analyze synthetic routes generated by different sources arises naturally. RESULTS Here we present LinChemIn, a python toolkit that allows chemoinformatics operations on synthetic routes and reaction networks. Wrapping some third-party packages for handling graph arithmetic and chemoinformatics and implementing new data models and functionalities, LinChemIn allows the interconversion between data formats and data models and enables route-level analysis and operations, including route comparison and descriptors calculation. Object-Oriented Design principles inspire the software architecture, and the modules are structured to maximize code reusability and support code testing and refactoring. The code structure should facilitate external contributions, thus encouraging open and collaborative software development. CONCLUSIONS The current version of LinChemIn allows users to combine synthetic routes generated from various tools and analyze them, and constitutes an open and extensible framework capable of incorporating contributions from the community and fostering scientific discussion. Our roadmap envisages the development of sophisticated metrics for routes evaluation, a multi-parameter scoring system, and the implementation of an entire "ecosystem" of functionalities operating on synthetic routes. LinChemIn is freely available at https://github.com/syngenta/linchemin.
Collapse
Affiliation(s)
- Marta Pasquini
- Syngenta Crop Protection AG, Schaffhauserstrasse, 4332, Stein, AG, Switzerland.
| | - Marco Stenta
- Syngenta Crop Protection AG, Schaffhauserstrasse, 4332, Stein, AG, Switzerland
| |
Collapse
|