1
|
Zhang C, Zhai Y, Gong Z, Duan H, She YB, Yang YF, Su A. Transfer learning across different chemical domains: virtual screening of organic materials with deep learning models pretrained on small molecule and chemical reaction data. J Cheminform 2024; 16:89. [PMID: 39080777 PMCID: PMC11290278 DOI: 10.1186/s13321-024-00886-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Accepted: 07/21/2024] [Indexed: 08/02/2024] Open
Abstract
Machine learning is becoming a preferred method for the virtual screening of organic materials due to its cost-effectiveness over traditional computationally demanding techniques. However, the scarcity of labeled data for organic materials poses a significant challenge for training advanced machine learning models. This study showcases the potential of utilizing databases of drug-like small molecules and chemical reactions to pretrain the BERT model, enhancing its performance in the virtual screening of organic materials. By fine-tuning the BERT models with data from five virtual screening tasks, the version pretrained with the USPTO-SMILES dataset achieved R2 scores exceeding 0.94 for three tasks and over 0.81 for two others. This performance surpasses that of models pretrained on the small molecule or organic materials databases and outperforms three traditional machine learning models trained directly on virtual screening data. The success of the USPTO-SMILES pretrained BERT model can be attributed to the diverse array of organic building blocks in the USPTO database, offering a broader exploration of the chemical space. The study further suggests that accessing a reaction database with a wider range of reactions than the USPTO could further enhance model performance. Overall, this research validates the feasibility of applying transfer learning across different chemical domains for the efficient virtual screening of organic materials.Scientific contributionThis study verifies the feasibility of applying transfer learning to large language models in different chemical fields to help organic materials perform virtual screening. Through the comparison of transfer learning from different chemical fields to a variety of organic material molecules, the high precision virtual screening of organic materials is realized.
Collapse
Affiliation(s)
- Chengwei Zhang
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, 310014, Zhejiang, China
| | - Yushuang Zhai
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, 310014, Zhejiang, China
| | - Ziyang Gong
- Key Laboratory of Pharmaceutical Engineering of Zhejiang Province, Key Laboratory for Green Pharmaceutical Technologies and Related Equipment of Ministry of Education, Collaborative Innovation Center of Yangtze River Delta Region Green Pharmaceuticals, Zhejiang University of Technology, Hangzhou, 310014, People's Republic of China
| | - Hongliang Duan
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, 999078, China
| | - Yuan-Bin She
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, 310014, Zhejiang, China
| | - Yun-Fang Yang
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, 310014, Zhejiang, China
| | - An Su
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, 310014, Zhejiang, China.
- Key Laboratory of Pharmaceutical Engineering of Zhejiang Province, Key Laboratory for Green Pharmaceutical Technologies and Related Equipment of Ministry of Education, Collaborative Innovation Center of Yangtze River Delta Region Green Pharmaceuticals, Zhejiang University of Technology, Hangzhou, 310014, People's Republic of China.
| |
Collapse
|
2
|
Wu T, Zhou M, Zou J, Chen Q, Qian F, Kurths J, Liu R, Tang Y. AI-guided few-shot inverse design of HDP-mimicking polymers against drug-resistant bacteria. Nat Commun 2024; 15:6288. [PMID: 39060236 PMCID: PMC11282099 DOI: 10.1038/s41467-024-50533-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Accepted: 07/11/2024] [Indexed: 07/28/2024] Open
Abstract
Host defense peptide (HDP)-mimicking polymers are promising therapeutic alternatives to antibiotics and have large-scale untapped potential. Artificial intelligence (AI) exhibits promising performance on large-scale chemical-content design, however, existing AI methods face difficulties on scarcity data in each family of HDP-mimicking polymers (<102), much smaller than public polymer datasets (>105), and multi-constraints on properties and structures when exploring high-dimensional polymer space. Herein, we develop a universal AI-guided few-shot inverse design framework by designing multi-modal representations to enrich polymer information for predictions and creating a graph grammar distillation for chemical space restriction to improve the efficiency of multi-constrained polymer generation with reinforcement learning. Exampled with HDP-mimicking β-amino acid polymers, we successfully simulate predictions of over 105 polymers and identify 83 optimal polymers. Furthermore, we synthesize an optimal polymer DM0.8iPen0.2 and find that this polymer exhibits broad-spectrum and potent antibacterial activity against multiple clinically isolated antibiotic-resistant pathogens, validating the effectiveness of AI-guided design strategy.
Collapse
Affiliation(s)
- Tianyu Wu
- Key Laboratory of Smart Manufacturing in Energy Chemical Process, East China University of Science and Technology, Shanghai, 200237, China
| | - Min Zhou
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, Shanghai, 200237, China
| | - Jingcheng Zou
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Frontiers Science Center for Materiobiology and Dynamic Chemistry, Key Laboratory for Ultrafine Materials of Ministry of Education, Research Center for Biomedical Materials of Ministry of Education, School of Materials Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
| | - Qi Chen
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Frontiers Science Center for Materiobiology and Dynamic Chemistry, Key Laboratory for Ultrafine Materials of Ministry of Education, Research Center for Biomedical Materials of Ministry of Education, School of Materials Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
| | - Feng Qian
- Key Laboratory of Smart Manufacturing in Energy Chemical Process, East China University of Science and Technology, Shanghai, 200237, China
| | - Jürgen Kurths
- Potsdam Institute for Climate Impact Research (PIK), Potsdam, 14473, Germany
- Institut für Physik, Humboldt-Universität zu Berlin, Berlin, 10115, Germany
- The Research Institute of Intelligent Complex Systems, Fudan University, Shanghai, 200433, China
| | - Runhui Liu
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, Shanghai, 200237, China.
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Frontiers Science Center for Materiobiology and Dynamic Chemistry, Key Laboratory for Ultrafine Materials of Ministry of Education, Research Center for Biomedical Materials of Ministry of Education, School of Materials Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China.
| | - Yang Tang
- Key Laboratory of Smart Manufacturing in Energy Chemical Process, East China University of Science and Technology, Shanghai, 200237, China.
| |
Collapse
|
3
|
Wang S, Yue H, Yuan X. Accelerating Polymer Discovery with Uncertainty-Guided PGCNN: Explainable AI for Predicting Properties and Mechanistic Insights. J Chem Inf Model 2024; 64:5500-5509. [PMID: 38953249 DOI: 10.1021/acs.jcim.4c00555] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/03/2024]
Abstract
Deep learning holds great potential for expediting the discovery of new polymers from the vast chemical space. However, accurately predicting polymer properties for practical applications based on their monomer composition has long been a challenge. The main obstacles include insufficient data, ineffective representation encoding, and lack of explainability. To address these issues, we propose an interpretable model called the Polymer Graph Convolutional Neural Network (PGCNN) that can accurately predict various polymer properties. This model is trained using the RadonPy data set and validated using experimental data samples. By integrating evidential deep learning with the model, we can quantify the uncertainty of predictions and enable sample-efficient training through uncertainty-guided active learning. Additionally, we demonstrate that the global attention of the graph embedding can aid in discovering underlying physical principles by identifying important functional groups within polymers and associating them with specific material attributes. Lastly, we explore the high-throughput screening capability of our model by rapidly identifying thousands of promising candidates with low and high thermal conductivity from a pool of one million hypothetical polymers. In summary, our research not only advances our mechanistic understanding of polymers using explainable AI but also paves the way for data-driven trustworthy discovery of polymer materials.
Collapse
Affiliation(s)
- Shuyu Wang
- Department of Control Engineering, Northeastern University at Qinhuangdao, Qinhuangdao, Hebei 066000, China
| | - Hongxing Yue
- Department of Control Engineering, Northeastern University at Qinhuangdao, Qinhuangdao, Hebei 066000, China
| | - Xiaoming Yuan
- Xiaoming Yuan - Department of Computer Science and Engineering, Northeastern University at Qinhuangdao, Qinhuangdao, Hebei 066000, China
| |
Collapse
|
4
|
Huynh H, Le K, Vu L, Nguyen T, Holcomb M, Forli S, Phan H. Synergy of machine learning and density functional theory calculations for predicting experimental Lewis base affinity and Lewis polybase binding atoms. J Comput Chem 2024; 45:1552-1561. [PMID: 38500409 PMCID: PMC11099847 DOI: 10.1002/jcc.27329] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 01/26/2024] [Accepted: 01/31/2024] [Indexed: 03/20/2024]
Abstract
Investigation of Lewis acid-base interactions has been conducted by ab initio calculations and machine learning (ML) models. This study aims to resolve two critical tasks that have not been quantitatively investigated. First, ML models developed from density functional theory (DFT) calculations predict experimental BF3 affinity with Pearson correlation coefficients around 0.9 and mean absolute errors around 10 kJ mol-1. The ML models are trained by DFT-calculated BF3 affinity of more than 3000 adducts, with input features readily obtained by rdkit. Second, the ML models have the capability of predicting the relative strength of Lewis base binding atoms in Lewis polybases, which is either an extremely challenging task to conduct experimentally or a computationally expensive task for ab initio methods. The study demonstrates and solidifies the potential of combining DFT calculations and ML models to predict experimental properties, especially those that are scarce and impractical to empirically acquire.
Collapse
Affiliation(s)
- Hieu Huynh
- Fulbright University Vietnam, Ho Chi Minh city, Vietnam, Ho Chi Minh City 700000
| | - Khanh Le
- Fulbright University Vietnam, Ho Chi Minh city, Vietnam, Ho Chi Minh City 700000
| | - Linh Vu
- Fulbright University Vietnam, Ho Chi Minh city, Vietnam, Ho Chi Minh City 700000
| | - Trang Nguyen
- Fulbright University Vietnam, Ho Chi Minh city, Vietnam, Ho Chi Minh City 700000
| | - Matthew Holcomb
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037 USA
| | - Stefano Forli
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037 USA
| | - Hung Phan
- Fulbright University Vietnam, Ho Chi Minh city, Vietnam, Ho Chi Minh City 700000
- Soka University of America, Aliso Viejo, California, United States, CA 92656
| |
Collapse
|
5
|
Chen J, Schwaller P. Molecular hypergraph neural networks. J Chem Phys 2024; 160:144307. [PMID: 38597317 DOI: 10.1063/5.0193557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Accepted: 03/14/2024] [Indexed: 04/11/2024] Open
Abstract
Graph neural networks (GNNs) have demonstrated promising performance across various chemistry-related tasks. However, conventional graphs only model the pairwise connectivity in molecules, failing to adequately represent higher order connections, such as multi-center bonds and conjugated structures. To tackle this challenge, we introduce molecular hypergraphs and propose Molecular Hypergraph Neural Networks (MHNNs) to predict the optoelectronic properties of organic semiconductors, where hyperedges represent conjugated structures. A general algorithm is designed for irregular high-order connections, which can efficiently operate on molecular hypergraphs with hyperedges of various orders. The results show that MHNN outperforms all baseline models on most tasks of organic photovoltaic, OCELOT chromophore v1, and PCQM4Mv2 datasets. Notably, MHNN achieves this without any 3D geometric information, surpassing the baseline model that utilizes atom positions. Moreover, MHNN achieves better performance than pretrained GNNs under limited training data, underscoring its excellent data efficiency. This work provides a new strategy for more general molecular representations and property prediction tasks related to high-order connections.
Collapse
Affiliation(s)
- Junwu Chen
- Laboratory of Artificial Chemical Intelligence (LIAC), Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- National Centre of Competence in Research (NCCR) Catalysis, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Philippe Schwaller
- Laboratory of Artificial Chemical Intelligence (LIAC), Institute of Chemical Sciences and Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- National Centre of Competence in Research (NCCR) Catalysis, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| |
Collapse
|
6
|
Sanchez Medina E, Kunchapu S, Sundmacher K. Gibbs-Helmholtz Graph Neural Network for the Prediction of Activity Coefficients of Polymer Solutions at Infinite Dilution. J Phys Chem A 2023; 127:9863-9873. [PMID: 37943172 PMCID: PMC10683018 DOI: 10.1021/acs.jpca.3c05892] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 10/18/2023] [Accepted: 10/25/2023] [Indexed: 11/10/2023]
Abstract
Machine learning models have gained prominence for predicting pure-component properties, yet their application to mixture property prediction remains relatively limited. However, the significance of mixtures in our daily lives is undeniable, particularly in industries such as polymer processing. This study presents a modification of the Gibbs-Helmholtz graph neural network (GH-GNN) model for predicting weight-based activity coefficients at infinite dilution (Ωij∞) in polymer solutions. We evaluate various polymer representations ranging from monomer, repeating unit, periodic unit, and oligomer and observe that, in data-scarce scenarios of polymer-solvent mixtures, polymer representation specifics have a reduced impact compared to data-rich environments. Leveraging transfer learning, we harness richer activity coefficient data from small-size systems, enhancing model accuracy and reducing prediction variability. The modified GH-GNN model achieves remarkable prediction results in mixture interpolation and solvent extrapolation tasks having an overall mean absolute error of 0.15, showcasing the potential of graph-neural-network-based models for property prediction of polymer solutions. Comparative analysis with the established models UNIFAC-ZM and Entropic-FV suggests a promising avenue for future research on the use of data-driven models for the prediction of the thermodynamic properties of polymer solutions.
Collapse
Affiliation(s)
- Edgar
Ivan Sanchez Medina
- Chair
for Process Systems Engineering, Otto-von-Guericke
University, Universitätsplatz 2, Magdeburg 39106, Germany
| | - Sreekanth Kunchapu
- Chair
for Process Systems Engineering, Otto-von-Guericke
University, Universitätsplatz 2, Magdeburg 39106, Germany
| | - Kai Sundmacher
- Chair
for Process Systems Engineering, Otto-von-Guericke
University, Universitätsplatz 2, Magdeburg 39106, Germany
- Process
Systems Engineering, Max Planck Institute
for Dynamics of Complex Technical Systems, Sandtorstraße 1, Magdeburg 39106, Germany
| |
Collapse
|
7
|
Hu J, Li Z, Lin J, Zhang L. Prediction and Interpretability of Glass Transition Temperature of Homopolymers by Data-Augmented Graph Convolutional Neural Networks. ACS APPLIED MATERIALS & INTERFACES 2023; 15:54006-54017. [PMID: 37934171 DOI: 10.1021/acsami.3c13698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2023]
Abstract
Establishing the structure-property relationship by machine learning (ML) models is extremely valuable for accelerating the molecular design of polymers. However, existing ML models for the polymers are subject to scarcity issues of training data and fewer variations of graph structures of molecules. In addition, limited works have explored the interpretability of ML models to infer the latent knowledge in the field of polymer science that could inspire ML-assisted molecular design. In this contribution, we integrate graph convolutional neural networks (GCNs) with data augmentation strategy to predict the glass transition temperature Tg of polymers. It is demonstrated that the data-augmented GCN model outperforms the conventional models and achieves a higher accuracy for the prediction of Tg despite a small amount of training data. Furthermore, taking advantage of molecular graph representations, the data-augmented GCN model has the capability to infer the importance of atoms or substructures from the understanding of Tg, which generally agrees with the experimental findings in the field of polymer science. The inferred knowledge of the GCN model is used to advise on the design of functional polymers with specific Tg. The data-augmented GCN model possesses prominent superiorities in the establishment of structure-property relationship and also provides an efficient way for accelerating the rational design of polymer molecules.
Collapse
Affiliation(s)
- Junyang Hu
- Shanghai Key Laboratory of Advanced Polymeric Materials, School of Materials Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Zean Li
- Shanghai Key Laboratory of Advanced Polymeric Materials, School of Materials Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Jiaping Lin
- Shanghai Key Laboratory of Advanced Polymeric Materials, School of Materials Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Liangshun Zhang
- Shanghai Key Laboratory of Advanced Polymeric Materials, School of Materials Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| |
Collapse
|
8
|
Ochiai T, Inukai T, Akiyama M, Furui K, Ohue M, Matsumori N, Inuki S, Uesugi M, Sunazuka T, Kikuchi K, Kakeya H, Sakakibara Y. Variational autoencoder-based chemical latent space for large molecular structures with 3D complexity. Commun Chem 2023; 6:249. [PMID: 37973971 PMCID: PMC10654724 DOI: 10.1038/s42004-023-01054-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Accepted: 11/06/2023] [Indexed: 11/19/2023] Open
Abstract
The structural diversity of chemical libraries, which are systematic collections of compounds that have potential to bind to biomolecules, can be represented by chemical latent space. A chemical latent space is a projection of a compound structure into a mathematical space based on several molecular features, and it can express structural diversity within a compound library in order to explore a broader chemical space and generate novel compound structures for drug candidates. In this study, we developed a deep-learning method, called NP-VAE (Natural Product-oriented Variational Autoencoder), based on variational autoencoder for managing hard-to-analyze datasets from DrugBank and large molecular structures such as natural compounds with chirality, an essential factor in the 3D complexity of compounds. NP-VAE was successful in constructing the chemical latent space from large-sized compounds that were unable to be handled in existing methods, achieving higher reconstruction accuracy, and demonstrating stable performance as a generative model across various indices. Furthermore, by exploring the acquired latent space, we succeeded in comprehensively analyzing a compound library containing natural compounds and generating novel compound structures with optimized functions.
Collapse
Grants
- 22H04901 Ministry of Education, Culture, Sports, Science and Technology (MEXT)
- 17H06410 Ministry of Education, Culture, Sports, Science and Technology (MEXT)
- 23H04885 Ministry of Education, Culture, Sports, Science and Technology (MEXT)
- 23H04880 Ministry of Education, Culture, Sports, Science and Technology (MEXT)
- 23H04881 Ministry of Education, Culture, Sports, Science and Technology (MEXT)
- 23H04887 Ministry of Education, Culture, Sports, Science and Technology (MEXT)
Collapse
Affiliation(s)
- Toshiki Ochiai
- Department of Biosciences and Informatics, Keio University, Yokohama, Kanagawa, 223-8522, Japan
| | - Tensei Inukai
- Department of Biosciences and Informatics, Keio University, Yokohama, Kanagawa, 223-8522, Japan
| | - Manato Akiyama
- Department of Biosciences and Informatics, Keio University, Yokohama, Kanagawa, 223-8522, Japan
| | - Kairi Furui
- Department of Computer Science, School of Computing, Tokyo Institute of Technology, Yokohama, Kanagawa, 226-8501, Japan
| | - Masahito Ohue
- Department of Computer Science, School of Computing, Tokyo Institute of Technology, Yokohama, Kanagawa, 226-8501, Japan
| | - Nobuaki Matsumori
- Department of Chemistry, Graduate School of Science, Kyushu University, Fukuoka, Fukuoka, 819-0395, Japan
| | - Shinsuke Inuki
- Division of Medicinal Frontier Sciences, Graduate School of Pharmaceutical Sciences, Kyoto University, Kyoto, Kyoto, 606-8501, Japan
| | - Motonari Uesugi
- Institute for Chemical Research and WPI-iCeMS, Kyoto University, Uji, Kyoto, 611-0011, Japan
| | - Toshiaki Sunazuka
- Omura Satoshi Memorial Institute and Graduate School of Infection Control Sciences, Kitasato University, Minato-ku, Tokyo, 108-8641, Japan
| | - Kazuya Kikuchi
- Department of Applied Chemistry, Graduate School of Engineering, Osaka University, Suita, Osaka, 565-0871, Japan
- Immunology Frontier Research Centre, Osaka University, Suita, Osaka, 565-0871, Japan
| | - Hideaki Kakeya
- Division of Medicinal Frontier Sciences, Graduate School of Pharmaceutical Sciences, Kyoto University, Kyoto, Kyoto, 606-8501, Japan
| | - Yasubumi Sakakibara
- Department of Biosciences and Informatics, Keio University, Yokohama, Kanagawa, 223-8522, Japan.
- Department of Data Science, Kitasato University School of Frontier Engineering, Sagamihara, Kanagawa, 252-0373, Japan.
| |
Collapse
|
9
|
Wilson AN, St John PC, Marin DH, Hoyt CB, Rognerud EG, Nimlos MR, Cywar RM, Rorrer NA, Shebek KM, Broadbelt LJ, Beckham GT, Crowley MF. PolyID: Artificial Intelligence for Discovering Performance-Advantaged and Sustainable Polymers. Macromolecules 2023; 56:8547-8557. [PMID: 38024155 PMCID: PMC10653284 DOI: 10.1021/acs.macromol.3c00994] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2023] [Revised: 09/30/2023] [Indexed: 12/01/2023]
Abstract
A necessary transformation for a sustainable economy is the transition from fossil-derived plastics to polymers derived from biomass and waste resources. While renewable feedstocks can enhance material performance through unique chemical moieties, probing the vast material design space by experiment alone is not practically feasible. Here, we develop a machine-learning-based tool, PolyID, to reduce the design space of renewable feedstocks to enable efficient discovery of performance-advantaged, biobased polymers. PolyID is a multioutput, graph neural network specifically designed to increase accuracy and to enable quantitative structure-property relationship (QSPR) analysis for polymers. It includes a novel domain-of-validity method that was developed and applied to demonstrate how gaps in training data can be filled to improve accuracy. The model was benchmarked with both a 20% held-out subset of the original training data and 22 experimentally synthesized polymers. A mean absolute error for the glass transition temperatures of 19.8 and 26.4 °C was achieved for the test and experimental data sets, respectively. Predictions were made on polymers composed of monomers from four databases that contain biologically accessible small molecules: MetaCyc, MINEs, KEGG, and BiGG. From 1.4 × 106 accessible biobased polymers, we identified five poly(ethylene terephthalate) (PET) analogues with predicted improvements to thermal and transport performance. Experimental validation for one of the PET analogues demonstrated a glass transition temperature between 85 and 112 °C, which is higher than PET and within the predicted range of the PolyID tool. In addition to accurate predictions, we show how the model's predictions are explainable through analysis of individual bond importance for a biobased nylon. Overall, PolyID can aid the biobased polymer practitioner to navigate the vast number of renewable polymers to discover sustainable materials with enhanced performance.
Collapse
Affiliation(s)
- A. Nolan Wilson
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| | - Peter C. St John
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| | - Daniela H. Marin
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| | - Caroline B. Hoyt
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| | - Erik G. Rognerud
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| | - Mark R. Nimlos
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| | - Robin M. Cywar
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| | - Nicholas A. Rorrer
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| | - Kevin M. Shebek
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
- Department
of Chemical and Biological Engineering and Center for Synthetic Biology, Northwestern University, Evanston, Illinois 60208, United States
- Chemistry
of Life Processes Institute, Northwestern
University, Evanston, Illinois 60208, United States
| | - Linda J. Broadbelt
- Department
of Chemical and Biological Engineering and Center for Synthetic Biology, Northwestern University, Evanston, Illinois 60208, United States
| | - Gregg T. Beckham
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| | - Michael F. Crowley
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| |
Collapse
|
10
|
Ohno M, Hayashi Y, Zhang Q, Kaneko Y, Yoshida R. SMiPoly: Generation of a Synthesizable Polymer Virtual Library Using Rule-Based Polymerization Reactions. J Chem Inf Model 2023; 63:5539-5548. [PMID: 37604495 PMCID: PMC10498440 DOI: 10.1021/acs.jcim.3c00329] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2023] [Indexed: 08/23/2023]
Abstract
Recent advances in machine learning have led to the rapid adoption of various computational methods for de novo molecular design in polymer research, including high-throughput virtual screening and inverse molecular design. In such workflows, molecular generators play an essential role in creation or sequential modification of candidate polymer structures. Machine learning-assisted molecular design has made great technical progress over the past few years. However, the difficulty of identifying synthetic routes to such designed polymers remains unresolved. To address this technical limitation, we present Small Molecules into Polymers (SMiPoly), a Python library for virtual polymer generation that implements 22 chemical rules for commonly applied polymerization reactions. For given small organic molecules to form a candidate monomer set, the SMiPoly generator conducts possible polymerization reactions to generate an exhaustive list of potentially synthesizable polymers. In this study, using 1083 readily available monomers, we generated 169,347 unique polymers forming seven different molecular types: polyolefin, polyester, polyether, polyamide, polyimide, polyurethane, and polyoxazolidone. By comparing the distribution of the virtually created polymers with approximately 16,000 real polymers synthesized so far, it was found that the coverage and novelty of the SMiPoly-generated polymers can reach 48 and 53%, respectively. Incorporating the SMiPoly library into a molecular design workflow will accelerate the process of de novo polymer synthesis by shortening the step to select synthesizable candidate polymers.
Collapse
Affiliation(s)
- Mitsuru Ohno
- Daicel
Corporation, Kita-ku, 530-0011 Osaka, Japan
| | - Yoshihiro Hayashi
- The
Institute of Statistical Mathematics, Research Organization of Information
and Systems, Tachikawa, Tokyo 190-8562, Japan
- The
Graduate University for Advanced Studies, SOKENDAI, Tachikawa, Tokyo 190-8562, Japan
| | - Qi Zhang
- The
Institute of Statistical Mathematics, Research Organization of Information
and Systems, Tachikawa, Tokyo 190-8562, Japan
| | - Yu Kaneko
- Daicel
Corporation, Kita-ku, 530-0011 Osaka, Japan
| | - Ryo Yoshida
- The
Institute of Statistical Mathematics, Research Organization of Information
and Systems, Tachikawa, Tokyo 190-8562, Japan
- The
Graduate University for Advanced Studies, SOKENDAI, Tachikawa, Tokyo 190-8562, Japan
- National
Institute for Materials Science, 305-0047 Ibaraki, Japan
| |
Collapse
|
11
|
McDonald SM, Augustine EK, Lanners Q, Rudin C, Catherine Brinson L, Becker ML. Applied machine learning as a driver for polymeric biomaterials design. Nat Commun 2023; 14:4838. [PMID: 37563117 PMCID: PMC10415291 DOI: 10.1038/s41467-023-40459-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Accepted: 07/24/2023] [Indexed: 08/12/2023] Open
Abstract
Polymers are ubiquitous to almost every aspect of modern society and their use in medical products is similarly pervasive. Despite this, the diversity in commercial polymers used in medicine is stunningly low. Considerable time and resources have been extended over the years towards the development of new polymeric biomaterials which address unmet needs left by the current generation of medical-grade polymers. Machine learning (ML) presents an unprecedented opportunity in this field to bypass the need for trial-and-error synthesis, thus reducing the time and resources invested into new discoveries critical for advancing medical treatments. Current efforts pioneering applied ML in polymer design have employed combinatorial and high throughput experimental design to address data availability concerns. However, the lack of available and standardized characterization of parameters relevant to medicine, including degradation time and biocompatibility, represents a nearly insurmountable obstacle to ML-aided design of biomaterials. Herein, we identify a gap at the intersection of applied ML and biomedical polymer design, highlight current works at this junction more broadly and provide an outlook on challenges and future directions.
Collapse
Affiliation(s)
| | - Emily K Augustine
- Thomas Lord Department of Mechanical Engineering and Materials Science, Duke University, Durham, NC, USA
| | - Quinn Lanners
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA
| | - Cynthia Rudin
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA
| | - L Catherine Brinson
- Thomas Lord Department of Mechanical Engineering and Materials Science, Duke University, Durham, NC, USA
| | - Matthew L Becker
- Department of Chemistry, Duke University, Durham, NC, USA.
- Thomas Lord Department of Mechanical Engineering and Materials Science, Duke University, Durham, NC, USA.
| |
Collapse
|
12
|
Shah HA, Liu J, Yang Z, Yang F, Zhang Q, Feng J. DeepRT: Predicting compounds presence in pathway modules and classifying into module classes using deep neural networks based on molecular properties. J Bioinform Comput Biol 2023; 21:2350017. [PMID: 37632195 DOI: 10.1142/s0219720023500178] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/27/2023]
Abstract
Metabolic pathways play a crucial role in understanding the biochemistry of organisms. In metabolic pathways, modules refer to clusters of interconnected reactions or sub-networks representing specific functional units or biological processes within the overall pathway. In pathway modules, compounds are major elements and refer to the various molecules that participate in the biochemical reactions within the pathway modules. These molecules can include substrates, intermediates and final products. Determining the presence relation of compounds and pathway modules is essential for synthesizing new molecules and predicting hidden reactions. To date, several computational methods have been proposed to address this problem. However, all methods only predict the metabolic pathways and their types, not the pathway modules. To address this issue, we proposed a novel deep learning model, DeepRT that integrates message passing neural networks (MPNNs) and transformer encoder. This combination allows DeepRT to effectively extract global and local structure information from the molecular graph. The model is designed to perform two tasks: first, determining the present relation of the compound with the pathway module, and second, predicting the relation of query compound and module classes. The proposed DeepRT model evaluated on a dataset comprising compounds and pathway modules, and it outperforms existing approaches.
Collapse
Affiliation(s)
- Hayat Ali Shah
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, P. R. China
| | - Juan Liu
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, P. R. China
| | - Zhihui Yang
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, P. R. China
| | - Feng Yang
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, P. R. China
| | - Qiang Zhang
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, P. R. China
| | - Jing Feng
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, P. R. China
| |
Collapse
|
13
|
Ngo NK, Hy TS, Kondor R. Multiresolution graph transformers and wavelet positional encoding for learning long-range and hierarchical structures. J Chem Phys 2023; 159:034109. [PMID: 37466225 DOI: 10.1063/5.0152833] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2023] [Accepted: 06/29/2023] [Indexed: 07/20/2023] Open
Abstract
Contemporary graph learning algorithms are not well-suited for large molecules since they do not consider the hierarchical interactions among the atoms, which are essential to determining the molecular properties of macromolecules. In this work, we propose Multiresolution Graph Transformers (MGT), the first graph transformer architecture that can learn to represent large molecules at multiple scales. MGT can learn to produce representations for the atoms and group them into meaningful functional groups or repeating units. We also introduce Wavelet Positional Encoding (WavePE), a new positional encoding method that can guarantee localization in both spectral and spatial domains. Our proposed model achieves competitive results on three macromolecule datasets consisting of polymers, peptides, and protein-ligand complexes, along with one drug-like molecule dataset. Significantly, our model outperforms other state-of-the-art methods and achieves chemical accuracy in estimating molecular properties (e.g., highest occupied molecular orbital, lowest unoccupied molecular orbital, and their gap) calculated by Density Functional Theory in the polymers dataset. Furthermore, the visualizations, including clustering results on macromolecules and low-dimensional spaces of their representations, demonstrate the capability of our methodology in learning to represent long-range and hierarchical structures. Our PyTorch implementation is publicly available at https://github.com/HySonLab/Multires-Graph-Transformer.
Collapse
Affiliation(s)
| | - Truong Son Hy
- Halıcıoğlu Data Science Institute, University of California San Diego, La Jolla, California 92093, USA
| | - Risi Kondor
- Department of Computer Science, University of Chicago, Chicago, Illinois 60637, USA
| |
Collapse
|
14
|
Ryu JY, Elala E, Rhee JKK. Quantum Graph Neural Network Models for Materials Search. MATERIALS (BASEL, SWITZERLAND) 2023; 16:4300. [PMID: 37374486 DOI: 10.3390/ma16124300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/05/2023] [Revised: 06/03/2023] [Accepted: 06/05/2023] [Indexed: 06/29/2023]
Abstract
Inspired by classical graph neural networks, we discuss a novel quantum graph neural network (QGNN) model to predict the chemical and physical properties of molecules and materials. QGNNs were investigated to predict the energy gap between the highest occupied and lowest unoccupied molecular orbitals of small organic molecules. The models utilize the equivariantly diagonalizable unitary quantum graph circuit (EDU-QGC) framework to allow discrete link features and minimize quantum circuit embedding. The results show QGNNs can achieve lower test loss compared to classical models if a similar number of trainable variables are used, and converge faster in training. This paper also provides a review of classical graph neural network models for materials research and various QGNNs.
Collapse
Affiliation(s)
- Ju-Young Ryu
- School of Electrical Engineering & ITRC of Quantum Computing for AI, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
- Qunova Computing, Incorporated, 193 Munji-ro, Yuseong-gu, Daejeon 34051, Republic of Korea
| | - Eyuel Elala
- School of Electrical Engineering & ITRC of Quantum Computing for AI, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
- Qunova Computing, Incorporated, 193 Munji-ro, Yuseong-gu, Daejeon 34051, Republic of Korea
| | - June-Koo Kevin Rhee
- School of Electrical Engineering & ITRC of Quantum Computing for AI, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
- Qunova Computing, Incorporated, 193 Munji-ro, Yuseong-gu, Daejeon 34051, Republic of Korea
| |
Collapse
|
15
|
Ricci E, Vergadou N. Integrating Machine Learning in the Coarse-Grained Molecular Simulation of Polymers. J Phys Chem B 2023; 127:2302-2322. [PMID: 36888553 DOI: 10.1021/acs.jpcb.2c06354] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/09/2023]
Abstract
Machine learning (ML) is having an increasing impact on the physical sciences, engineering, and technology and its integration into molecular simulation frameworks holds great potential to expand their scope of applicability to complex materials and facilitate fundamental knowledge and reliable property predictions, contributing to the development of efficient materials design routes. The application of ML in materials informatics in general, and polymer informatics in particular, has led to interesting results, however great untapped potential lies in the integration of ML techniques into the multiscale molecular simulation methods for the study of macromolecular systems, specifically in the context of Coarse Grained (CG) simulations. In this Perspective, we aim at presenting the pioneering recent research efforts in this direction and discussing how these new ML-based techniques can contribute to critical aspects of the development of multiscale molecular simulation methods for bulk complex chemical systems, especially polymers. Prerequisites for the implementation of such ML-integrated methods and open challenges that need to be met toward the development of general systematic ML-based coarse graining schemes for polymers are discussed.
Collapse
Affiliation(s)
- Eleonora Ricci
- Institute of Nanoscience and Nanotechnology, National Center for Scientific Research "Demokritos", GR-15341 Agia Paraskevi, Athens, Greece
- Institute of Informatics and Telecommunications, National Center for Scientific Research "Demokritos", GR-15341 Agia Paraskevi, Athens, Greece
| | - Niki Vergadou
- Institute of Nanoscience and Nanotechnology, National Center for Scientific Research "Demokritos", GR-15341 Agia Paraskevi, Athens, Greece
| |
Collapse
|
16
|
Sikorski EL, Cusentino MA, McCarthy MJ, Tranchida J, Wood MA, Thompson AP. Machine learned interatomic potential for dispersion strengthened plasma facing components. J Chem Phys 2023; 158:114101. [PMID: 36948804 DOI: 10.1063/5.0135269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/17/2023] Open
Abstract
Tungsten (W) is a material of choice for the divertor material due to its high melting temperature, thermal conductivity, and sputtering threshold. However, W has a very high brittle-to-ductile transition temperature, and at fusion reactor temperatures (≥1000 K), it may undergo recrystallization and grain growth. Dispersion-strengthening W with zirconium carbide (ZrC) can improve ductility and limit grain growth, but much of the effects of the dispersoids on microstructural evolution and thermomechanical properties at high temperatures are still unknown. We present a machine learned Spectral Neighbor Analysis Potential for W-ZrC that can now be used to study these materials. In order to construct a potential suitable for large-scale atomistic simulations at fusion reactor temperatures, it is necessary to train on ab initio data generated for a diverse set of structures, chemical environments, and temperatures. Further accuracy and stability tests of the potential were achieved using objective functions for both material properties and high temperature stability. Validation of lattice parameters, surface energies, bulk moduli, and thermal expansion is confirmed on the optimized potential. Tensile tests of W/ZrC bicrystals show that although the W(110)-ZrC(111) C-terminated bicrystal has the highest ultimate tensile strength (UTS) at room temperature, observed strength decreases with increasing temperature. At 2500 K, the terminating C layer diffuses into the W, resulting in a weaker W-Zr interface. Meanwhile, the W(110)-ZrC(111) Zr-terminated bicrystal has the highest UTS at 2500 K.
Collapse
Affiliation(s)
- E L Sikorski
- Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico 87185, USA
| | - M A Cusentino
- Material, Physical, and Chemical Science Center, Sandia National Laboratories, Albuquerque, New Mexico 87185, USA
| | - M J McCarthy
- Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico 87185, USA
| | - J Tranchida
- CEA, DES/IRESNE/DEC, 13018 Saint Paul Lès Durance, France
| | - M A Wood
- Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico 87185, USA
| | - A P Thompson
- Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico 87185, USA
| |
Collapse
|
17
|
Nguyen T, Bavarian M. Machine learning approach to polymer reaction engineering: Determining monomers reactivity ratios. POLYMER 2023. [DOI: 10.1016/j.polymer.2023.125866] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/31/2023]
|
18
|
Xie S. Perspectives on development of biomedical polymer materials in artificial intelligence age. J Biomater Appl 2023; 37:1355-1375. [PMID: 36629787 DOI: 10.1177/08853282231151822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
Polymer materials are widely used in biomedicine, chemistry and material science, whose traditional preparations are mainly based on experience, intuition and conceptual insight, having been applied to the development of many new materials, but facing great challenges due to the vast design space for biomedical polymers. So far, the best way to solve these problems is to accelerate material design through artificial intelligence, especially machine learning. Herein, this paper will introduce several successful cases, and analyze the latest progress of machine learning in the field of biomedical polymers, then discuss the opportunities of this novel method. In particular, this paper summarizes the material database, open-source determination tools, molecular generation methods and machine learning models that have been used for biopolymer synthesis and property prediction. Overall, machine learning could be more effectively deployed on the material design of biomedical polymers, and it is expected to become an extensive driving force to meet the huge demand for customized designs.
Collapse
Affiliation(s)
- Shijin Xie
- 2281The University of Melbourne, Melbourne, VIC, Australia
| |
Collapse
|
19
|
Gurnani R, Kuenneth C, Toland A, Ramprasad R. Polymer Informatics at Scale with Multitask Graph Neural Networks. CHEMISTRY OF MATERIALS : A PUBLICATION OF THE AMERICAN CHEMICAL SOCIETY 2023; 35:1560-1567. [PMID: 36873627 PMCID: PMC9979603 DOI: 10.1021/acs.chemmater.2c02991] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Revised: 02/03/2023] [Indexed: 06/18/2023]
Abstract
Artificial intelligence-based methods are becoming increasingly effective at screening libraries of polymers down to a selection that is manageable for experimental inquiry. The vast majority of presently adopted approaches for polymer screening rely on handcrafted chemostructural features extracted from polymer repeat units-a burdensome task as polymer libraries, which approximate the polymer chemical search space, progressively grow over time. Here, we demonstrate that directly "machine learning" important features from a polymer repeat unit is a cheap and viable alternative to extracting expensive features by hand. Our approach-based on graph neural networks, multitask learning, and other advanced deep learning techniques-speeds up feature extraction by 1-2 orders of magnitude relative to presently adopted handcrafted methods without compromising model accuracy for a variety of polymer property prediction tasks. We anticipate that our approach, which unlocks the screening of truly massive polymer libraries at scale, will enable more sophisticated and large scale screening technologies in the field of polymer informatics.
Collapse
|
20
|
Volgin IV, Batyr PA, Matseevich AV, Dobrovskiy AY, Andreeva MV, Nazarychev VM, Larin SV, Goikhman MY, Vizilter YV, Askadskii AA, Lyulin SV. Machine Learning with Enormous "Synthetic" Data Sets: Predicting Glass Transition Temperature of Polyimides Using Graph Convolutional Neural Networks. ACS OMEGA 2022; 7:43678-43691. [PMID: 36506114 PMCID: PMC9730753 DOI: 10.1021/acsomega.2c04649] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Accepted: 10/28/2022] [Indexed: 06/17/2023]
Abstract
In the present work, we address the problem of utilizing machine learning (ML) methods to predict the thermal properties of polymers by establishing "structure-property" relationships. Having focused on a particular class of heterocyclic polymers, namely polyimides (PIs), we developed a graph convolutional neural network (GCNN), being one of the most promising tools for working with big data, to predict the PI glass transition temperature T g as an example of the fundamental property of polymers. To train the GCNN, we propose an original methodology based on using a "transfer learning" approach with an enormous "synthetic" data set for pretraining and a small experimental data set for its fine-tuning. The "synthetic" data set contains more than 6 million combinatorically generated repeating units of PIs and theoretical values of their T g values calculated using the well-established Askadskii's quantitative structure-property relationship (QSPR) computational scheme. Additionally, an experimental data set for 214 PIs was also collected from the literature for training, fine-tuning, and validation of the GCNN. Both "synthetic" and experimental data sets are included into a PolyAskInG database (Polymer Askadskii's Intelligent Gateway). By using the PolyAskInG database, we developed GCNN which allows estimation of T g of PI with a mean absolute error (MAE) of about 20 K, which is 1.5 times lower than in the case of Askadskii QSPR analysis (33 K). To prove the efficiency and usability of the proposed GCNN architecture and training methodology for predicting polymer properties, we also employed "transfer learning" to develop alternative GCNN pretrained on proxy-characteristics taken from the popular quantum-chemical QM9 database for small compounds and fine-tuned on an experimental T g values data set from PolyAskInG database. The obtained results indicate that pretraining of GCNN on the "synthetic" polymer data set provides MAE which is almost twice as low as that in the case of using the QM9 data set in the pretraining stage (∼41 K). Furthermore, we address the questions associated with the influence of the differences in the size of the experimental and "synthetic" data sets (so-called "reality gap" problem), as well as their chemical composition on the training quality. Our results state the overall priority of using polymer data sets for developing deep neural networks, and GCNN in particular, for efficient prediction of polymer properties. Moreover, our work opens up a challenge for the theoretically supported generation of large "synthetic" data sets of polymer properties for the training of the complex ML models. The proposed methodology is rather versatile and may be generalized for predicting other properties of different polymers and copolymers synthesized through the polycondensation reaction.
Collapse
Affiliation(s)
- Igor V. Volgin
- Institute
of Macromolecular Compounds of the Russian Academy of Sciences (IMC
RAS), St. Petersburg 199004, Russian Federation
| | - Pavel A. Batyr
- Federal
State Unitary Enterprise “State Research Institute of Aviation
Systems” (GosNIIAS), Moscow 125167, Russian Federation
| | - Andrey V. Matseevich
- A.N.
Nesmeyanov Institute of Organoelement Compounds of Russian Academy
of Sciences (INEOS RAS), Moscow 119991, Russian Federation
| | - Alexey Yu. Dobrovskiy
- Institute
of Macromolecular Compounds of the Russian Academy of Sciences (IMC
RAS), St. Petersburg 199004, Russian Federation
| | - Maria V. Andreeva
- Institute
of Macromolecular Compounds of the Russian Academy of Sciences (IMC
RAS), St. Petersburg 199004, Russian Federation
| | - Victor M. Nazarychev
- Institute
of Macromolecular Compounds of the Russian Academy of Sciences (IMC
RAS), St. Petersburg 199004, Russian Federation
| | - Sergey V. Larin
- Institute
of Macromolecular Compounds of the Russian Academy of Sciences (IMC
RAS), St. Petersburg 199004, Russian Federation
| | - Mikhail Ya. Goikhman
- Institute
of Macromolecular Compounds of the Russian Academy of Sciences (IMC
RAS), St. Petersburg 199004, Russian Federation
| | - Yury V. Vizilter
- Federal
State Unitary Enterprise “State Research Institute of Aviation
Systems” (GosNIIAS), Moscow 125167, Russian Federation
| | - Andrey A. Askadskii
- A.N.
Nesmeyanov Institute of Organoelement Compounds of Russian Academy
of Sciences (INEOS RAS), Moscow 119991, Russian Federation
- Moscow
State University of Civil Engineering (MGSU), Moscow 129337, Russian Federation
| | - Sergey V. Lyulin
- Institute
of Macromolecular Compounds of the Russian Academy of Sciences (IMC
RAS), St. Petersburg 199004, Russian Federation
| |
Collapse
|
21
|
Antoniuk ER, Li P, Kailkhura B, Hiszpanski AM. Representing Polymers as Periodic Graphs with Learned Descriptors for Accurate Polymer Property Predictions. J Chem Inf Model 2022; 62:5435-5445. [PMID: 36315033 DOI: 10.1021/acs.jcim.2c00875] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Accurately predicting new polymers' properties with machine learning models apriori to synthesis has potential to significantly accelerate new polymers' discovery and development. However, accurately and efficiently capturing polymers' complex, periodic structures in machine learning models remains a grand challenge for the polymer cheminformatics community. Specifically, there has yet to be an ideal solution for the problems of how to capture the periodicity of polymers, as well as how to optimally develop polymer descriptors without requiring human-based feature design. In this work, we tackle these problems by utilizing a periodic polymer graph representation that accounts for polymers' periodicity and coupling it with a message-passing neural network that leverages the power of graph deep learning to automatically learn chemically relevant polymer descriptors. Remarkably, this approach achieves state-of-the-art performance on 8 out of 10 distinct polymer property prediction tasks. These results highlight the advancement in predictive capability that is possible through learning descriptors that are specifically optimized for capturing the unique chemical structure of polymers.
Collapse
Affiliation(s)
- Evan R Antoniuk
- Materials Science Division, Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, California94550-5507, United States
| | - Peggy Li
- Global Security Computing Applications Division, Computing Directorate, Lawrence Livermore National Laboratory, Livermore, California94550-5507, United States
| | - Bhavya Kailkhura
- Machine Intelligence Group/Center for Applied Scientific Computing, Computing Directorate, Lawrence Livermore National Laboratory, Livermore, California94550-5507, United States
| | - Anna M Hiszpanski
- Materials Science Division, Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, California94550-5507, United States
| |
Collapse
|
22
|
Reiser P, Neubert M, Eberhard A, Torresi L, Zhou C, Shao C, Metni H, van Hoesel C, Schopmans H, Sommer T, Friederich P. Graph neural networks for materials science and chemistry. COMMUNICATIONS MATERIALS 2022; 3:93. [PMID: 36468086 PMCID: PMC9702700 DOI: 10.1038/s43246-022-00315-6] [Citation(s) in RCA: 68] [Impact Index Per Article: 34.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Accepted: 11/07/2022] [Indexed: 05/14/2023]
Abstract
Machine learning plays an increasingly important role in many areas of chemistry and materials science, being used to predict materials properties, accelerate simulations, design new structures, and predict synthesis routes of new materials. Graph neural networks (GNNs) are one of the fastest growing classes of machine learning models. They are of particular relevance for chemistry and materials science, as they directly work on a graph or structural representation of molecules and materials and therefore have full access to all relevant information required to characterize materials. In this Review, we provide an overview of the basic principles of GNNs, widely used datasets, and state-of-the-art architectures, followed by a discussion of a wide range of recent applications of GNNs in chemistry and materials science, and concluding with a road-map for the further development and application of GNNs.
Collapse
Affiliation(s)
- Patrick Reiser
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
- Institute of Nanotechnology, Karlsruhe Institute of Technology, Hermann-von-Helmholtz-Platz 1, 76344 Eggenstein-Leopoldshafen, Germany
| | - Marlen Neubert
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
| | - André Eberhard
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
| | - Luca Torresi
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
| | - Chen Zhou
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
| | - Chen Shao
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
- Present Address: Institute for Applied Informatics and Formal Description Systems, Karlsruhe Institute of Technology, Kaiserstr. 89, 76133 Karlsruhe, Germany
| | - Houssam Metni
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
- ECPM, Université de Strasbourg, 25 Rue Becquerel, 67087 Strasbourg, France
| | - Clint van Hoesel
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
- Department of Applied Physics, Eindhoven University of Technology, Groene Loper 19, 5612 AP Eindhoven, The Netherlands
| | - Henrik Schopmans
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
- Institute of Nanotechnology, Karlsruhe Institute of Technology, Hermann-von-Helmholtz-Platz 1, 76344 Eggenstein-Leopoldshafen, Germany
| | - Timo Sommer
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
- Institute for Theory of Condensed Matter, Karlsruhe Institute of Technology, Wolfgang-Gaede-Str. 1, 76131 Karlsruhe, Germany
- Present Address: School of Chemistry, Trinity College Dublin, College Green, Dublin 2, Ireland
| | - Pascal Friederich
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
- Institute of Nanotechnology, Karlsruhe Institute of Technology, Hermann-von-Helmholtz-Platz 1, 76344 Eggenstein-Leopoldshafen, Germany
| |
Collapse
|
23
|
Aldeghi M, Coley CW. A graph representation of molecular ensembles for polymer property prediction. Chem Sci 2022; 13:10486-10498. [PMID: 36277616 PMCID: PMC9473492 DOI: 10.1039/d2sc02839e] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Accepted: 08/15/2022] [Indexed: 12/02/2022] Open
Abstract
Synthetic polymers are versatile and widely used materials. Similar to small organic molecules, a large chemical space of such materials is hypothetically accessible. Computational property prediction and virtual screening can accelerate polymer design by prioritizing candidates expected to have favorable properties. However, in contrast to organic molecules, polymers are often not well-defined single structures but an ensemble of similar molecules, which poses unique challenges to traditional chemical representations and machine learning approaches. Here, we introduce a graph representation of molecular ensembles and an associated graph neural network architecture that is tailored to polymer property prediction. We demonstrate that this approach captures critical features of polymeric materials, like chain architecture, monomer stoichiometry, and degree of polymerization, and achieves superior accuracy to off-the-shelf cheminformatics methodologies. While doing so, we built a dataset of simulated electron affinity and ionization potential values for >40k polymers with varying monomer composition, stoichiometry, and chain architecture, which may be used in the development of other tailored machine learning approaches. The dataset and machine learning models presented in this work pave the path toward new classes of algorithms for polymer informatics and, more broadly, introduce a framework for the modeling of molecular ensembles.
Collapse
Affiliation(s)
- Matteo Aldeghi
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA 02139 USA
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA 02139 USA
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology Cambridge MA 02139 USA
| |
Collapse
|
24
|
Xie T, France-Lanord A, Wang Y, Lopez J, Stolberg MA, Hill M, Leverick GM, Gomez-Bombarelli R, Johnson JA, Shao-Horn Y, Grossman JC. Accelerating amorphous polymer electrolyte screening by learning to reduce errors in molecular dynamics simulated properties. Nat Commun 2022; 13:3415. [PMID: 35701416 PMCID: PMC9197847 DOI: 10.1038/s41467-022-30994-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Accepted: 03/02/2022] [Indexed: 12/03/2022] Open
Abstract
Polymer electrolytes are promising candidates for the next generation lithium-ion battery technology. Large scale screening of polymer electrolytes is hindered by the significant cost of molecular dynamics (MD) simulation in amorphous systems: the amorphous structure of polymers requires multiple, repeated sampling to reduce noise and the slow relaxation requires long simulation time for convergence. Here, we accelerate the screening with a multi-task graph neural network that learns from a large amount of noisy, unconverged, short MD data and a small number of converged, long MD data. We achieve accurate predictions of 4 different converged properties and screen a space of 6247 polymers that is orders of magnitude larger than previous computational studies. Further, we extract several design principles for polymer electrolytes and provide an open dataset for the community. Our approach could be applicable to a broad class of material discovery problems that involve the simulation of complex, amorphous materials. Screening polymer electrolytes for batteries is extremely expensive due to the complex structures and slow dynamics. Here the authors develop a machine learning scheme to accelerate the screening and explore a space much larger than past studies.
Collapse
Affiliation(s)
- Tian Xie
- Department of Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA. .,Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
| | - Arthur France-Lanord
- Department of Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.,Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Yanming Wang
- Department of Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.,Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Jeffrey Lopez
- Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Michael A Stolberg
- Department of Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.,Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Megan Hill
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Graham Michael Leverick
- Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Rafael Gomez-Bombarelli
- Department of Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Jeremiah A Johnson
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Yang Shao-Horn
- Department of Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.,Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Jeffrey C Grossman
- Department of Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA. .,Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
| |
Collapse
|
25
|
Flam-Shepherd D, Zhu K, Aspuru-Guzik A. Language models can learn complex molecular distributions. Nat Commun 2022; 13:3293. [PMID: 35672310 PMCID: PMC9174447 DOI: 10.1038/s41467-022-30839-x] [Citation(s) in RCA: 42] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2021] [Accepted: 05/16/2022] [Indexed: 11/09/2022] Open
Abstract
Deep generative models of molecules have grown immensely in popularity, trained on relevant datasets, these models are used to search through chemical space. The downstream utility of generative models for the inverse design of novel functional compounds, depends on their ability to learn a training distribution of molecules. The most simple example is a language model that takes the form of a recurrent neural network and generates molecules using a string representation. Since their initial use, subsequent work has shown that language models are very capable, in particular, recent research has demonstrated their utility in the low data regime. In this work, we investigate the capacity of simple language models to learn more complex distributions of molecules. For this purpose, we introduce several challenging generative modeling tasks by compiling larger, more complex distributions of molecules and we evaluate the ability of language models on each task. The results demonstrate that language models are powerful generative models, capable of adeptly learning complex molecular distributions. Language models can accurately generate: distributions of the highest scoring penalized LogP molecules in ZINC15, multi-modal molecular distributions as well as the largest molecules in PubChem. The results highlight the limitations of some of the most popular and recent graph generative models- many of which cannot scale to these molecular distributions.
Collapse
Affiliation(s)
- Daniel Flam-Shepherd
- Department of Computer Science, University of Toronto, Toronto, ON, M5S 2E4, Canada.
- Vector Institute for Artificial Intelligence, Toronto, ON, M5S 1M1, Canada.
| | - Kevin Zhu
- Department of Computer Science, University of Toronto, Toronto, ON, M5S 2E4, Canada
| | - Alán Aspuru-Guzik
- Department of Computer Science, University of Toronto, Toronto, ON, M5S 2E4, Canada.
- Vector Institute for Artificial Intelligence, Toronto, ON, M5S 1M1, Canada.
- Department of Chemistry, University of Toronto, Toronto, ON, M5G 1Z8, Canada.
- Canadian Institute for Advanced Research, Toronto, ON, M5G 1Z8, Canada.
| |
Collapse
|
26
|
Mohapatra S, An J, Gómez-Bombarelli R. Chemistry-informed macromolecule graph representation for similarity computation, unsupervised and supervised learning. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2022. [DOI: 10.1088/2632-2153/ac545e] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Abstract
The near-infinite chemical diversity of natural and artificial macromolecules arises from the vast range of possible component monomers, linkages, and polymers topologies. This enormous variety contributes to the ubiquity and indispensability of macromolecules but hinders the development of general machine learning methods with macromolecules as input. To address this, we developed a chemistry-informed graph representation of macromolecules that enables quantifying structural similarity, and interpretable supervised learning for macromolecules. Our work enables quantitative chemistry-informed decision-making and iterative design in the macromolecular chemical space.
Collapse
|
27
|
Miyake Y, Saeki A. Machine Learning-Assisted Development of Organic Solar Cell Materials: Issues, Analyses, and Outlooks. J Phys Chem Lett 2021; 12:12391-12401. [PMID: 34939806 DOI: 10.1021/acs.jpclett.1c03526] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Nonfullerene, a small molecular electron acceptor, has substantially improved the power conversion efficiency of organic photovoltaics (OPVs). However, the large structural freedom of π-conjugated polymers and molecules makes it difficult to explore with limited resources. Machine learning, which is based on rapidly growing artificial intelligence technology, is a high-throughput method to accelerate the speed of material design and process optimization; however, it suffers from limitations in terms of prediction accuracy, interpretability, data collection, and available data (particularly, experimental data). This recognition motivates the present Perspective, which focuses on utilizing the experimental data set for ML to efficiently aid OPV research. This Perspective discusses the trends in ML-OPV publications, the NFA category, and the effects of data size and explanatory variables (fingerprints or Mordred descriptors) on the prediction accuracy and explainability, which broadens the scope of ML and would be useful for the development of next-generation solar cell materials.
Collapse
Affiliation(s)
- Yuta Miyake
- Department of Applied Chemistry, Graduate School of Engineering, Osaka University, 2-1 Yamadaoka, Suita, Osaka 565-0871, Japan
| | - Akinori Saeki
- Department of Applied Chemistry, Graduate School of Engineering, Osaka University, 2-1 Yamadaoka, Suita, Osaka 565-0871, Japan
- Innovative Catalysis Science Division, Institute for Open and Transdisciplinary Research Initiatives (ICS-OTRI), Osaka University, 1-1 Yamadaoka, Suita, Osaka 565-0871, Japan
| |
Collapse
|
28
|
Werner M. Decoding Interaction Patterns from the Chemical Sequence of Polymers Using Neural Networks. ACS Macro Lett 2021; 10:1333-1338. [PMID: 35549009 DOI: 10.1021/acsmacrolett.1c00325] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
The relation between chemical sequences and the properties of polymers is considered using artificial neural networks with a low-dimensional bottleneck layer of neurons. These encoder-decoder architectures may compress the input information into a meaningful set of physical variables that describe the correlation between distinct types of data. In this work, neural networks were trained to translate a sequence of hydrophilic and hydrophobic segments into the effective free energy landscape of a copolymer interacting with a lipid membrane. The training data were obtained by the sampling of coarse-grained polymer conformations in a given membrane density field. Neural networks that were split into separate channels have learned to decompose the free energy into independent components that are explainable by known concepts from polymer physics. The semantic information in the hidden layers was employed to predict polymer translocation events through a membrane for a more detailed dynamic model via a transfer learning procedure. The search for minimal translocation times in the compressed chemical space underlined that nontrivial sequence motifs may lead to optimal properties.
Collapse
Affiliation(s)
- Marco Werner
- Leibniz-Institut für Polymerforschung Dresden e.V., Hohe Straße 6, 01069 Dresden, Germany
| |
Collapse
|
29
|
Joshi RP, Kumar N. Artificial Intelligence for Autonomous Molecular Design: A Perspective. Molecules 2021; 26:6761. [PMID: 34833853 PMCID: PMC8619999 DOI: 10.3390/molecules26226761] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Revised: 10/23/2021] [Accepted: 10/29/2021] [Indexed: 11/23/2022] Open
Abstract
Domain-aware artificial intelligence has been increasingly adopted in recent years to expedite molecular design in various applications, including drug design and discovery. Recent advances in areas such as physics-informed machine learning and reasoning, software engineering, high-end hardware development, and computing infrastructures are providing opportunities to build scalable and explainable AI molecular discovery systems. This could improve a design hypothesis through feedback analysis, data integration that can provide a basis for the introduction of end-to-end automation for compound discovery and optimization, and enable more intelligent searches of chemical space. Several state-of-the-art ML architectures are predominantly and independently used for predicting the properties of small molecules, their high throughput synthesis, and screening, iteratively identifying and optimizing lead therapeutic candidates. However, such deep learning and ML approaches also raise considerable conceptual, technical, scalability, and end-to-end error quantification challenges, as well as skepticism about the current AI hype to build automated tools. To this end, synergistically and intelligently using these individual components along with robust quantum physics-based molecular representation and data generation tools in a closed-loop holds enormous promise for accelerated therapeutic design to critically analyze the opportunities and challenges for their more widespread application. This article aims to identify the most recent technology and breakthrough achieved by each of the components and discusses how such autonomous AI and ML workflows can be integrated to radically accelerate the protein target or disease model-based probe design that can be iteratively validated experimentally. Taken together, this could significantly reduce the timeline for end-to-end therapeutic discovery and optimization upon the arrival of any novel zoonotic transmission event. Our article serves as a guide for medicinal, computational chemistry and biology, analytical chemistry, and the ML community to practice autonomous molecular design in precision medicine and drug discovery.
Collapse
Affiliation(s)
| | - Neeraj Kumar
- Computational Biology Group, Biological Science Division, Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, WA 99352, USA;
| |
Collapse
|
30
|
Guan Y, Shree Sowndarya SV, Gallegos LC, St John PC, Paton RS. Real-time prediction of 1H and 13C chemical shifts with DFT accuracy using a 3D graph neural network. Chem Sci 2021; 12:12012-12026. [PMID: 34667567 PMCID: PMC8457395 DOI: 10.1039/d1sc03343c] [Citation(s) in RCA: 51] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2021] [Accepted: 07/19/2021] [Indexed: 11/23/2022] Open
Abstract
Nuclear magnetic resonance (NMR) is one of the primary techniques used to elucidate the chemical structure, bonding, stereochemistry, and conformation of organic compounds. The distinct chemical shifts in an NMR spectrum depend upon each atom's local chemical environment and are influenced by both through-bond and through-space interactions with other atoms and functional groups. The in silico prediction of NMR chemical shifts using quantum mechanical (QM) calculations is now commonplace in aiding organic structural assignment since spectra can be computed for several candidate structures and then compared with experimental values to find the best possible match. However, the computational demands of calculating multiple structural- and stereo-isomers, each of which may typically exist as an ensemble of rapidly-interconverting conformations, are expensive. Additionally, the QM predictions themselves may lack sufficient accuracy to identify a correct structure. In this work, we address both of these shortcomings by developing a rapid machine learning (ML) protocol to predict 1H and 13C chemical shifts through an efficient graph neural network (GNN) using 3D structures as input. Transfer learning with experimental data is used to improve the final prediction accuracy of a model trained using QM calculations. When tested on the CHESHIRE dataset, the proposed model predicts observed 13C chemical shifts with comparable accuracy to the best-performing DFT functionals (1.5 ppm) in around 1/6000 of the CPU time. An automated prediction webserver and graphical interface are accessible online at http://nova.chem.colostate.edu/cascade/. We further demonstrate the model in three applications: first, we use the model to decide the correct organic structure from candidates through experimental spectra, including complex stereoisomers; second, we automatically detect and revise incorrect chemical shift assignments in a popular NMR database, the NMRShiftDB; and third, we use NMR chemical shifts as descriptors for determination of the sites of electrophilic aromatic substitution. From quantum chemical and experimental NMR data, a 3D graph neural network, CASCADE, has been developed to predict carbon and proton chemical shifts. Stereoisomers and conformers of organic molecules can be correctly distinguished.![]()
Collapse
Affiliation(s)
- Yanfei Guan
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| | - S V Shree Sowndarya
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| | - Liliana C Gallegos
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| | - Peter C St John
- Biosciences Center, National Renewable Energy Laboratory Golden CO 80401 USA
| | - Robert S Paton
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| |
Collapse
|
31
|
Sattari K, Xie Y, Lin J. Data-driven algorithms for inverse design of polymers. SOFT MATTER 2021; 17:7607-7622. [PMID: 34397078 DOI: 10.1039/d1sm00725d] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The ever-increasing demand for novel polymers with superior properties requires a deeper understanding and exploration of the chemical space. Recently, data-driven approaches to explore the chemical space for polymer design have emerged. Among them, inverse design strategies for designing polymers with specific properties have evolved to be a significant materials informatics platform by learning hidden knowledge from materials data as well as smartly navigating the chemical space in an optimized way. In this review, we first summarize the progress in the representation of polymers, a prerequisite step for the inverse design of polymers. Then, we systematically introduce three data-driven strategies implemented for the inverse design of polymers, i.e., high-throughput virtual screening, global optimization, and generative models. Finally, we discuss the challenges and opportunities of the data-driven strategies as well as optimization algorithms employed in the inverse design of polymers.
Collapse
Affiliation(s)
- Kianoosh Sattari
- Department of Mechanical and Aerospace Engineering, University of Missouri, Columbia, MO 65211, USA.
| | | | | |
Collapse
|
32
|
Ward L, Dandu N, Blaiszik B, Narayanan B, Assary RS, Redfern PC, Foster I, Curtiss LA. Graph-Based Approaches for Predicting Solvation Energy in Multiple Solvents: Open Datasets and Machine Learning Models. J Phys Chem A 2021; 125:5990-5998. [PMID: 34191512 DOI: 10.1021/acs.jpca.1c01960] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The solvation properties of molecules, often estimated using quantum chemical simulations, are important in the synthesis of energy storage materials, drugs, and industrial chemicals. Here, we develop machine learning models of solvation energies to replace expensive quantum chemistry calculations with inexpensive-to-compute message-passing neural network models that require only the molecular graph as inputs. Our models are trained on a new database of solvation energies for 130,258 molecules taken from the QM9 dataset computed in five solvents (acetone, ethanol, acetonitrile, dimethyl sulfoxide, and water) via an implicit solvent model. Our best model achieves a mean absolute error of 0.5 kcal/mol for molecules with nine or fewer non-hydrogen atoms and 1 kcal/mol for molecules with between 10 and 14 non-hydrogen atoms. We make the entire dataset of 651,290 computed entries openly available and provide simple web and programmatic interfaces to enable others to run our solvation energy model on new molecules. This model calculates the solvation energies for molecules using only the SMILES string and also provides an estimate of whether each molecule is within the domain of applicability of our model. We envision that the dataset and models will provide the functionality needed for the rapid screening of large chemical spaces to discover improved molecules for many applications.
Collapse
Affiliation(s)
- Logan Ward
- Data Science and Learning Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Naveen Dandu
- Materials Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Ben Blaiszik
- Data Science and Learning Division, Argonne National Laboratory, Lemont, Illinois 60439, United States.,Globus, University of Chicago, Chicago, Illinois 60637, United States
| | - Badri Narayanan
- Materials Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States.,Department of Mechanical Engineering, University of Louisville, Louisville, Kentucky 40292, United States
| | - Rajeev S Assary
- Materials Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Paul C Redfern
- Materials Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Ian Foster
- Data Science and Learning Division, Argonne National Laboratory, Lemont, Illinois 60439, United States.,Globus, University of Chicago, Chicago, Illinois 60637, United States.,Department of Computer Science, University of Chicago, Chicago, Illinois 60637, United States
| | - Larry A Curtiss
- Materials Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| |
Collapse
|
33
|
Bertoni M, Duran-Frigola M, Badia-I-Mompel P, Pauls E, Orozco-Ruiz M, Guitart-Pla O, Alcalde V, Diaz VM, Berenguer-Llergo A, Brun-Heath I, Villegas N, de Herreros AG, Aloy P. Bioactivity descriptors for uncharacterized chemical compounds. Nat Commun 2021; 12:3932. [PMID: 34168145 PMCID: PMC8225676 DOI: 10.1038/s41467-021-24150-4] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2020] [Accepted: 05/27/2021] [Indexed: 01/20/2023] Open
Abstract
Chemical descriptors encode the physicochemical and structural properties of small molecules, and they are at the core of chemoinformatics. The broad release of bioactivity data has prompted enriched representations of compounds, reaching beyond chemical structures and capturing their known biological properties. Unfortunately, bioactivity descriptors are not available for most small molecules, which limits their applicability to a few thousand well characterized compounds. Here we present a collection of deep neural networks able to infer bioactivity signatures for any compound of interest, even when little or no experimental information is available for them. Our signaturizers relate to bioactivities of 25 different types (including target profiles, cellular response and clinical outcomes) and can be used as drop-in replacements for chemical descriptors in day-to-day chemoinformatics tasks. Indeed, we illustrate how inferred bioactivity signatures are useful to navigate the chemical space in a biologically relevant manner, unveiling higher-order organization in natural product collections, and to enrich mostly uncharacterized chemical libraries for activity against the drug-orphan target Snail1. Moreover, we implement a battery of signature-activity relationship (SigAR) models and show a substantial improvement in performance, with respect to chemistry-based classifiers, across a series of biophysics and physiology activity prediction benchmarks.
Collapse
Affiliation(s)
- Martino Bertoni
- Joint IRB-BSC-CRG Programme in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain
| | - Miquel Duran-Frigola
- Joint IRB-BSC-CRG Programme in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain.
- Ersilia Open Source Initiative, Cambridge, UK.
| | - Pau Badia-I-Mompel
- Joint IRB-BSC-CRG Programme in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain
| | - Eduardo Pauls
- Joint IRB-BSC-CRG Programme in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain
| | - Modesto Orozco-Ruiz
- Joint IRB-BSC-CRG Programme in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain
| | - Oriol Guitart-Pla
- Joint IRB-BSC-CRG Programme in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain
| | - Víctor Alcalde
- Joint IRB-BSC-CRG Programme in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain
| | - Víctor M Diaz
- Programa de Recerca en Càncer, Institut Hospital del Mar d'Investigacions Mèdiques (IMIM) and Departament de Ciències de la Salut, Universitat Pompeu Fabra (UPF), Barcelona, Catalonia, Spain
- Faculty of Medicine and Health Sciences, International University of Catalonia, Barcelona, Catalonia, Spain
| | - Antoni Berenguer-Llergo
- Joint IRB-BSC-CRG Programme in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain
| | - Isabelle Brun-Heath
- Joint IRB-BSC-CRG Programme in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain
| | - Núria Villegas
- Joint IRB-BSC-CRG Programme in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain
| | - Antonio García de Herreros
- Programa de Recerca en Càncer, Institut Hospital del Mar d'Investigacions Mèdiques (IMIM) and Departament de Ciències de la Salut, Universitat Pompeu Fabra (UPF), Barcelona, Catalonia, Spain
| | - Patrick Aloy
- Joint IRB-BSC-CRG Programme in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain.
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Catalonia, Spain.
| |
Collapse
|
34
|
Westermayr J, Gastegger M, Schütt KT, Maurer RJ. Perspective on integrating machine learning into computational chemistry and materials science. J Chem Phys 2021; 154:230903. [PMID: 34241249 DOI: 10.1063/5.0047760] [Citation(s) in RCA: 67] [Impact Index Per Article: 22.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
Machine learning (ML) methods are being used in almost every conceivable area of electronic structure theory and molecular simulation. In particular, ML has become firmly established in the construction of high-dimensional interatomic potentials. Not a day goes by without another proof of principle being published on how ML methods can represent and predict quantum mechanical properties-be they observable, such as molecular polarizabilities, or not, such as atomic charges. As ML is becoming pervasive in electronic structure theory and molecular simulation, we provide an overview of how atomistic computational modeling is being transformed by the incorporation of ML approaches. From the perspective of the practitioner in the field, we assess how common workflows to predict structure, dynamics, and spectroscopy are affected by ML. Finally, we discuss how a tighter and lasting integration of ML methods with computational chemistry and materials science can be achieved and what it will mean for research practice, software development, and postgraduate training.
Collapse
Affiliation(s)
- Julia Westermayr
- Department of Chemistry, University of Warwick, Gibbet Hill Road, Coventry CV4 7AL, United Kingdom
| | - Michael Gastegger
- Machine Learning Group, Technische Universität Berlin, 10587 Berlin, Germany
| | - Kristof T Schütt
- Machine Learning Group, Technische Universität Berlin, 10587 Berlin, Germany
| | - Reinhard J Maurer
- Department of Chemistry, University of Warwick, Gibbet Hill Road, Coventry CV4 7AL, United Kingdom
| |
Collapse
|
35
|
Deng D, Chen X, Zhang R, Lei Z, Wang X, Zhou F. XGraphBoost: Extracting Graph Neural Network-Based Features for a Better Prediction of Molecular Properties. J Chem Inf Model 2021; 61:2697-2705. [PMID: 34009965 DOI: 10.1021/acs.jcim.0c01489] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Abstract
Determining the properties of chemical molecules is essential for screening candidates similar to a specific drug. These candidate molecules are further evaluated for their target binding affinities, side effects, target missing probabilities, etc. Conventional machine learning algorithms demonstrated satisfying prediction accuracies of molecular properties. A molecule cannot be directly loaded into a machine learning model, and a set of engineered features needs to be designed and calculated from a molecule. Such hand-crafted features rely heavily on the experiences of the investigating researchers. The concept of graph neural networks (GNNs) was recently introduced to describe the chemical molecules. The features may be automatically and objectively extracted from the molecules through various types of GNNs, e.g., GCN (graph convolution network), GGNN (gated graph neural network), DMPNN (directed message passing neural network), etc. However, the training of a stable GNN model requires a huge number of training samples and a large amount of computing power, compared with the conventional machine learning strategies. This study proposed the integrated framework XGraphBoost to extract the features using a GNN and build an accurate prediction model of molecular properties using the classifier XGBoost. The proposed framework XGraphBoost fully inherits the merits of the GNN-based automatic molecular feature extraction and XGBoost-based accurate prediction performance. Both classification and regression problems were evaluated using the framework XGraphBoost. The experimental results strongly suggest that XGraphBoost may facilitate the efficient and accurate predictions of various molecular properties. The source code is freely available to academic users at https://github.com/chenxiaowei-vincent/XGraphBoost.git.
Collapse
Affiliation(s)
- Daiguo Deng
- Fermion Technology Co., Ltd., Guangzhou, Guangdong 510000, P.R. China
| | - Xiaowei Chen
- Fermion Technology Co., Ltd., Guangzhou, Guangdong 510000, P.R. China
| | - Ruochi Zhang
- Fermion Technology Co., Ltd., Guangzhou, Guangdong 510000, P.R. China.,College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, P.R. China
| | - Zengrong Lei
- Fermion Technology Co., Ltd., Guangzhou, Guangdong 510000, P.R. China
| | - Xiaojian Wang
- State Key Laboratory of Bioactive Substances and Functions of Natural Medicines, Institute of Materia Medica, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing 100050, P.R. China
| | - Fengfeng Zhou
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, P.R. China
| |
Collapse
|
36
|
Mercado R, Rastemo T, Lindelöf E, Klambauer G, Engkvist O, Chen H, Bjerrum EJ. Practical notes on building molecular graph generative models. ACTA ACUST UNITED AC 2021. [DOI: 10.1002/ail2.18] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Affiliation(s)
- Rocío Mercado
- Molecular AI, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca Gothenburg Sweden
| | - Tobias Rastemo
- Molecular AI, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca Gothenburg Sweden
- Chalmers University of Technology Gothenburg Sweden
| | - Edvard Lindelöf
- Molecular AI, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca Gothenburg Sweden
- Chalmers University of Technology Gothenburg Sweden
| | - Günter Klambauer
- Institute of Bioinformatics, Johannes Kepler University Linz Austria
| | - Ola Engkvist
- Molecular AI, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca Gothenburg Sweden
| | - Hongming Chen
- Centre of Chemistry and Chemical Biology, Guangzhou Regenerative Medicine and Health, Guangdong Laboratory Guangzhou China
| | - Esben Jannik Bjerrum
- Molecular AI, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca Gothenburg Sweden
| |
Collapse
|
37
|
Xu P, Lu T, Ju L, Tian L, Li M, Lu W. Machine Learning Aided Design of Polymer with Targeted Band Gap Based on DFT Computation. J Phys Chem B 2021; 125:601-611. [DOI: 10.1021/acs.jpcb.0c08674] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Affiliation(s)
- Pengcheng Xu
- Materials Genome Institute, Shanghai University, and Shanghai Materials Genome Institute, Shanghai 200444, China
| | - Tian Lu
- Materials Genome Institute, Shanghai University, and Shanghai Materials Genome Institute, Shanghai 200444, China
| | - Lifei Ju
- Department of Chemistry, College of Sciences, Shanghai University, Shanghai 200444, China
| | - Lumin Tian
- Department of Chemistry, College of Sciences, Shanghai University, Shanghai 200444, China
| | - Minjie Li
- Department of Chemistry, College of Sciences, Shanghai University, Shanghai 200444, China
| | - Wencong Lu
- Materials Genome Institute, Shanghai University, and Shanghai Materials Genome Institute, Shanghai 200444, China
- Department of Chemistry, College of Sciences, Shanghai University, Shanghai 200444, China
| |
Collapse
|
38
|
Kim Y, Thomas AE, Robichaud DJ, Iisa K, St John PC, Etz BD, Fioroni GM, Dutta A, McCormick RL, Mukarakate C, Kim S. A perspective on biomass-derived biofuels: From catalyst design principles to fuel properties. JOURNAL OF HAZARDOUS MATERIALS 2020; 400:123198. [PMID: 32585513 DOI: 10.1016/j.jhazmat.2020.123198] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/22/2020] [Revised: 06/05/2020] [Accepted: 06/09/2020] [Indexed: 05/24/2023]
Abstract
The hazards to health and the environment associated with the transportation sector include smog, particulate matter, and greenhouse gas emissions. Conversion of lignocellulosic biomass into biofuels has the potential to provide significant amounts of infrastructure-compatible liquid transportation fuels that reduce those hazardous materials. However, the development of these technologies is inefficient, due to: (i) the lack of a priori fuel property consideration, (ii) poor shared vocabulary between process chemists and fuel engineers, and (iii) modern and future engines operating outside the range of traditional autoignition metrics such as octane or cetane numbers. In this perspective, we describe an approach where we follow a "fuel-property first" design methodology with a sequence of (i) identifying the desirable fuel properties for modern engines, (ii) defining molecules capable of delivering those properties, and (iii) designing catalysts and processes that can produce those molecules from a candidate feedstock in a specific conversion process. Computational techniques need to be leveraged to minimize expenses and experimental efforts on low-promise options. This concept is illustrated with current research information available for biomass conversion to fuels via catalytic fast pyrolysis and hydrotreating; outstanding challenges and research tools necessary for a successful outcome are presented.
Collapse
Affiliation(s)
- Yeonjoon Kim
- National Renewable Energy Laboratory, Golden, CO 80401, United States
| | - Anna E Thomas
- National Renewable Energy Laboratory, Golden, CO 80401, United States
| | - David J Robichaud
- National Renewable Energy Laboratory, Golden, CO 80401, United States
| | - Kristiina Iisa
- National Renewable Energy Laboratory, Golden, CO 80401, United States
| | - Peter C St John
- National Renewable Energy Laboratory, Golden, CO 80401, United States
| | - Brian D Etz
- National Renewable Energy Laboratory, Golden, CO 80401, United States
| | - Gina M Fioroni
- National Renewable Energy Laboratory, Golden, CO 80401, United States
| | - Abhijit Dutta
- National Renewable Energy Laboratory, Golden, CO 80401, United States
| | | | - Calvin Mukarakate
- National Renewable Energy Laboratory, Golden, CO 80401, United States.
| | - Seonah Kim
- National Renewable Energy Laboratory, Golden, CO 80401, United States.
| |
Collapse
|
39
|
Webb MA, Jackson NE, Gil PS, de Pablo JJ. Targeted sequence design within the coarse-grained polymer genome. SCIENCE ADVANCES 2020; 6:eabc6216. [PMID: 33087352 PMCID: PMC7577717 DOI: 10.1126/sciadv.abc6216] [Citation(s) in RCA: 66] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Accepted: 09/02/2020] [Indexed: 05/05/2023]
Abstract
The chemical design of polymers with target structural and/or functional properties represents a grand challenge in materials science. While data-driven design approaches are promising, success with polymers has been limited, largely due to limitations in data availability. Here, we demonstrate the targeted sequence design of single-chain structure in polymers by combining coarse-grained modeling, machine learning, and model optimization. Nearly 2000 unique coarse-grained polymers are simulated to construct and analyze machine learning models. We find that deep neural networks inexpensively and reliably predict structural properties with limited sequence information as input. By coupling trained ML models with sequential model-based optimization, polymer sequences are proposed to exhibit globular, swollen, or rod-like behaviors, which are verified by explicit simulations. This work highlights the promising integration of coarse-grained modeling with data-driven design and represents a necessary and crucial step toward more complex polymer design efforts.
Collapse
Affiliation(s)
- Michael A Webb
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, IL 60615, USA
| | - Nicholas E Jackson
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, IL 60615, USA
- Center for Molecular Engineering and Materials Science Division, Argonne National Laboratory, Lemont, IL 06349, USA
| | - Phwey S Gil
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, IL 60615, USA
| | - Juan J de Pablo
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, IL 60615, USA.
- Center for Molecular Engineering and Materials Science Division, Argonne National Laboratory, Lemont, IL 06349, USA
| |
Collapse
|
40
|
Prediction of organic homolytic bond dissociation enthalpies at near chemical accuracy with sub-second computational cost. Nat Commun 2020; 11:2328. [PMID: 32393773 PMCID: PMC7214445 DOI: 10.1038/s41467-020-16201-z] [Citation(s) in RCA: 90] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2019] [Accepted: 04/15/2020] [Indexed: 12/31/2022] Open
Abstract
Bond dissociation enthalpies (BDEs) of organic molecules play a fundamental role in determining chemical reactivity and selectivity. However, BDE computations at sufficiently high levels of quantum mechanical theory require substantial computing resources. In this paper, we develop a machine learning model capable of accurately predicting BDEs for organic molecules in a fraction of a second. We perform automated density functional theory (DFT) calculations at the M06-2X/def2-TZVP level of theory for 42,577 small organic molecules, resulting in 290,664 BDEs. A graph neural network trained on a subset of these results achieves a mean absolute error of 0.58 kcal mol-1 (vs DFT) for BDEs of unseen molecules. We further demonstrate the model on two applications: first, we rapidly and accurately predict major sites of hydrogen abstraction in the metabolism of drug-like molecules, and second, we determine the dominant molecular fragmentation pathways during soot formation.
Collapse
|
41
|
Lambard G, Gracheva E. SMILES-X: autonomous molecular compounds characterization for small datasets without descriptors. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2020. [DOI: 10.1088/2632-2153/ab57f3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Abstract
There is more and more evidence that machine learning can be successfully applied in materials science and related fields. However, datasets in these fields are often quite small (from tens to several thousands of samples). This means the most advanced machine learning techniques remain neglected, as they are considered to be applicable to big data only. Moreover, materials informatics methods often rely on human-engineered descriptors, that should be carefully chosen, or even created, to fit the physicochemical property that one intends to predict. In this article, we propose a new method that tackles both the issue of small datasets and the difficulty of developing task-specific descriptors. The SMILES-X is an autonomous pipeline for molecular compounds characterisation based on a {Embed-Encode-Attend-Predict} neural architecture with a data-specific Bayesian hyper-parameters optimisation. The only input to the architecture—the SMILES strings—are de-canonicalised in order to efficiently augment the data. One of the key features of the architecture is the attention mechanism, which enables the interpretation of output predictions without extra computational cost. The SMILES-X achieves state-of-the-art results in the inference of aqueous solubility (
RMSE
¯
test
≃
0.57
±
0.07
mols/L), hydration free energy (
RMSE
¯
test
≃
0.81
±
0.22
kcal/mol, which is ∼24.5% better than molecular dynamics simulations), and octanol/water distribution coefficient (
RMSE
¯
test
≃
0.59
±
0.02
for LogD at pH 7.4) of molecular compounds. The SMILES-X is intended to become an important asset in the toolkit of materials scientists and chemists. The source code for the SMILES-X is available at github.com/GLambard/SMILES-X.
Collapse
|
42
|
Swanson K, Trivedi S, Lequieu J, Swanson K, Kondor R. Deep learning for automated classification and characterization of amorphous materials. SOFT MATTER 2020; 16:435-446. [PMID: 31803878 DOI: 10.1039/c9sm01903k] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
It is difficult to quantify structure-property relationships and to identify structural features of complex materials. The characterization of amorphous materials is especially challenging because their lack of long-range order makes it difficult to define structural metrics. In this work, we apply deep learning algorithms to accurately classify amorphous materials and characterize their structural features. Specifically, we show that convolutional neural networks and message passing neural networks can classify two-dimensional liquids and liquid-cooled glasses from molecular dynamics simulations with greater than 0.98 AUC, with no a priori assumptions about local particle relationships, even when the liquids and glasses are prepared at the same inherent structure energy. Furthermore, we demonstrate that message passing neural networks surpass convolutional neural networks in this context in both accuracy and interpretability. We extract a clear interpretation of how message passing neural networks evaluate liquid and glass structures by using a self-attention mechanism. Using this interpretation, we derive three novel structural metrics that accurately characterize glass formation. The methods presented here provide a procedure to identify important structural features in materials that could be missed by standard techniques and give unique insight into how these neural networks process data.
Collapse
Affiliation(s)
- Kirk Swanson
- Department of Computer Science, The University of Chicago, Chicago, IL 60637, USA.
| | | | | | | | | |
Collapse
|
43
|
Peng SP, Zhao Y. Convolutional Neural Networks for the Design and Analysis of Non-Fullerene Acceptors. J Chem Inf Model 2019; 59:4993-5001. [DOI: 10.1021/acs.jcim.9b00732] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Shi-Ping Peng
- State Key Laboratory for Physical Chemistry of Solid Surfaces, Collaborative Innovation Center of Chemistry for Energy Materials, Fujian Provincial Key Lab of Theoretical and Computational Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| | - Yi Zhao
- State Key Laboratory for Physical Chemistry of Solid Surfaces, Collaborative Innovation Center of Chemistry for Energy Materials, Fujian Provincial Key Lab of Theoretical and Computational Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| |
Collapse
|