1
|
Zhang Y, Shen C, Xia K. Multi-Cover Persistence (MCP)-based machine learning for polymer property prediction. Brief Bioinform 2024; 25:bbae465. [PMID: 39323091 PMCID: PMC11424509 DOI: 10.1093/bib/bbae465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Revised: 08/07/2024] [Accepted: 09/05/2024] [Indexed: 09/27/2024] Open
Abstract
Accurate and efficient prediction of polymers properties is crucial for polymer design. Recently, data-driven artificial intelligence (AI) models have demonstrated great promise in polymers property analysis. Even with the great progresses, a pivotal challenge in all the AI-driven models remains to be the effective representation of molecules. Here we introduce Multi-Cover Persistence (MCP)-based molecular representation and featurization for the first time. Our MCP-based polymer descriptors are combined with machine learning models, in particular, Gradient Boosting Tree (GBT) models, for polymers property prediction. Different from all previous molecular representation, polymer molecular structure and interactions are represented as MCP, which utilizes Delaunay slices at different dimensions and Rhomboid tiling to characterize the complicated geometric and topological information within the data. Statistic features from the generated persistent barcodes are used as polymer descriptors, and further combined with GBT model. Our model has been extensively validated on polymer benchmark datasets. It has been found that our models can outperform traditional fingerprint-based models and has similar accuracy with geometric deep learning models. In particular, our model tends to be more effective on large-sized monomer structures, demonstrating the great potential of MCP in characterizing more complicated polymer data. This work underscores the potential of MCP in polymer informatics, presenting a novel perspective on molecular representation and its application in polymer science.
Collapse
Affiliation(s)
- Yipeng Zhang
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore
| | - Cong Shen
- Department of Mathematics, National University of Singapore, Singapore 119076, Singapore
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore
| |
Collapse
|
2
|
Yang JH, Lee J, Kwon H, Sohn EH, Chang H, Jang S. High Glass Transition Temperature Fluorinated Polymers Based on Transfer Learning with Small Experimental Data. Macromol Rapid Commun 2024; 45:e2400161. [PMID: 38794832 DOI: 10.1002/marc.202400161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 05/21/2024] [Indexed: 05/26/2024]
Abstract
Machine learning can be used to predict the properties of polymers and explore vast chemical spaces. However, the limited number of available experimental datasets hinders the enhancement of the predictive performance of a model. This study proposes a machine learning approach that leverages transfer learning and ensemble modeling to efficiently predict the glass transition temperature (Tg) of fluorinated polymers and guide the design of high Tg copolymers. Initially, the quantum machine 9 (QM9) dataset is employed for model pretraining, thus providing robust molecular representations for the subsequent fine-tuning of a specialized copolymer dataset. Ensemble modeling is used to further enhance prediction robustness and reliability, effectively addressing the problems owing to the limited and unevenly distributed nature of the copolymer dataset. Finally, a fine-tuned ensemble model is used to navigate a vast chemical space comprising 61 monomers and identify promising candidates for high Tg fluorinated polymers. The model predicts 247 entries capable of achieving a Tg over 390 K, of which 14 are experimentally validated. This study demonstrates the potential of machine learning in material design and discovery, highlighting the effectiveness of transfer learning and ensemble modeling strategies for overcoming the challenges posed by small datasets in complex copolymer systems.
Collapse
Affiliation(s)
- Jin-Hoon Yang
- Chemical Data-Driven Research Center, Korea Research Institute of Chemical Technology, Daejeon, 34114, Republic of Korea
| | - Jiyoung Lee
- Interface Materials and Engineering Laboratory, Korea Research Institute of Chemical Technology, Daejeon, 34114, Republic of Korea
| | - Hajin Kwon
- Interface Materials and Engineering Laboratory, Korea Research Institute of Chemical Technology, Daejeon, 34114, Republic of Korea
| | - Eun-Ho Sohn
- Interface Materials and Engineering Laboratory, Korea Research Institute of Chemical Technology, Daejeon, 34114, Republic of Korea
| | - Hyunju Chang
- Chemical Data-Driven Research Center, Korea Research Institute of Chemical Technology, Daejeon, 34114, Republic of Korea
| | - Seunghun Jang
- Chemical Data-Driven Research Center, Korea Research Institute of Chemical Technology, Daejeon, 34114, Republic of Korea
| |
Collapse
|
3
|
Wu T, Zhou M, Zou J, Chen Q, Qian F, Kurths J, Liu R, Tang Y. AI-guided few-shot inverse design of HDP-mimicking polymers against drug-resistant bacteria. Nat Commun 2024; 15:6288. [PMID: 39060236 PMCID: PMC11282099 DOI: 10.1038/s41467-024-50533-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Accepted: 07/11/2024] [Indexed: 07/28/2024] Open
Abstract
Host defense peptide (HDP)-mimicking polymers are promising therapeutic alternatives to antibiotics and have large-scale untapped potential. Artificial intelligence (AI) exhibits promising performance on large-scale chemical-content design, however, existing AI methods face difficulties on scarcity data in each family of HDP-mimicking polymers (<102), much smaller than public polymer datasets (>105), and multi-constraints on properties and structures when exploring high-dimensional polymer space. Herein, we develop a universal AI-guided few-shot inverse design framework by designing multi-modal representations to enrich polymer information for predictions and creating a graph grammar distillation for chemical space restriction to improve the efficiency of multi-constrained polymer generation with reinforcement learning. Exampled with HDP-mimicking β-amino acid polymers, we successfully simulate predictions of over 105 polymers and identify 83 optimal polymers. Furthermore, we synthesize an optimal polymer DM0.8iPen0.2 and find that this polymer exhibits broad-spectrum and potent antibacterial activity against multiple clinically isolated antibiotic-resistant pathogens, validating the effectiveness of AI-guided design strategy.
Collapse
Affiliation(s)
- Tianyu Wu
- Key Laboratory of Smart Manufacturing in Energy Chemical Process, East China University of Science and Technology, Shanghai, 200237, China
| | - Min Zhou
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, Shanghai, 200237, China
| | - Jingcheng Zou
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Frontiers Science Center for Materiobiology and Dynamic Chemistry, Key Laboratory for Ultrafine Materials of Ministry of Education, Research Center for Biomedical Materials of Ministry of Education, School of Materials Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
| | - Qi Chen
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Frontiers Science Center for Materiobiology and Dynamic Chemistry, Key Laboratory for Ultrafine Materials of Ministry of Education, Research Center for Biomedical Materials of Ministry of Education, School of Materials Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
| | - Feng Qian
- Key Laboratory of Smart Manufacturing in Energy Chemical Process, East China University of Science and Technology, Shanghai, 200237, China
| | - Jürgen Kurths
- Potsdam Institute for Climate Impact Research (PIK), Potsdam, 14473, Germany
- Institut für Physik, Humboldt-Universität zu Berlin, Berlin, 10115, Germany
- The Research Institute of Intelligent Complex Systems, Fudan University, Shanghai, 200433, China
| | - Runhui Liu
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, Shanghai, 200237, China.
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Frontiers Science Center for Materiobiology and Dynamic Chemistry, Key Laboratory for Ultrafine Materials of Ministry of Education, Research Center for Biomedical Materials of Ministry of Education, School of Materials Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China.
| | - Yang Tang
- Key Laboratory of Smart Manufacturing in Energy Chemical Process, East China University of Science and Technology, Shanghai, 200237, China.
| |
Collapse
|
4
|
Gurnani R, Shukla S, Kamal D, Wu C, Hao J, Kuenneth C, Aklujkar P, Khomane A, Daniels R, Deshmukh AA, Cao Y, Sotzing G, Ramprasad R. AI-assisted discovery of high-temperature dielectrics for energy storage. Nat Commun 2024; 15:6107. [PMID: 39030220 PMCID: PMC11271506 DOI: 10.1038/s41467-024-50413-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Accepted: 07/01/2024] [Indexed: 07/21/2024] Open
Abstract
Electrostatic capacitors play a crucial role as energy storage devices in modern electrical systems. Energy density, the figure of merit for electrostatic capacitors, is primarily determined by the choice of dielectric material. Most industry-grade polymer dielectrics are flexible polyolefins or rigid aromatics, possessing high energy density or high thermal stability, but not both. Here, we employ artificial intelligence (AI), established polymer chemistry, and molecular engineering to discover a suite of dielectrics in the polynorbornene and polyimide families. Many of the discovered dielectrics exhibit high thermal stability and high energy density over a broad temperature range. One such dielectric displays an energy density of 8.3 J cc-1 at 200 °C, a value 11 × that of any commercially available polymer dielectric at this temperature. We also evaluate pathways to further enhance the polynorbornene and polyimide families, enabling these capacitors to perform well in demanding applications (e.g., aerospace) while being environmentally sustainable. These findings expand the potential applications of electrostatic capacitors within the 85-200 °C temperature range, at which there is presently no good commercial solution. More broadly, this research demonstrates the impact of AI on chemical structure generation and property prediction, highlighting the potential for materials design advancement beyond electrostatic capacitors.
Collapse
Affiliation(s)
- Rishi Gurnani
- School of Materials Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA
- Polymer Program, Institute of Materials Science, University of Connecticut, Storrs, 06296, CT, USA
| | - Stuti Shukla
- Materials Science Program, Institute of Materials Science, University of Connecticut, Storrs, CT, USA
| | - Deepak Kamal
- School of Materials Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| | - Chao Wu
- Electrical Insulation Research Center, Institute of Materials Science, University of Connecticut, Storrs, CT, USA
- Department of Electrical Engineering, Tsinghua University, Beijing, China
| | - Jing Hao
- Electrical Insulation Research Center, Institute of Materials Science, University of Connecticut, Storrs, CT, USA
| | - Christopher Kuenneth
- School of Materials Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA
- Faculty of Engineering Science, University of Bayreuth, Bayreuth, Germany
| | - Pritish Aklujkar
- Polymer Program, Institute of Materials Science, University of Connecticut, Storrs, CT, USA
| | - Ashish Khomane
- Materials Science Program, Institute of Materials Science, University of Connecticut, Storrs, CT, USA
| | - Robert Daniels
- Materials Science Program, Institute of Materials Science, University of Connecticut, Storrs, CT, USA
| | - Ajinkya A Deshmukh
- Polymer Program, Institute of Materials Science, University of Connecticut, Storrs, 06296, CT, USA
| | - Yang Cao
- Electrical Insulation Research Center, Institute of Materials Science, University of Connecticut, Storrs, CT, USA
| | - Gregory Sotzing
- Materials Science Program, Institute of Materials Science, University of Connecticut, Storrs, CT, USA
- Polymer Program, Institute of Materials Science, University of Connecticut, Storrs, CT, USA
| | - Rampi Ramprasad
- School of Materials Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
| |
Collapse
|
5
|
Wang R, Fu T, Yang YJ, Song X, Wang XL, Wang YZ. Scientific Discovery Framework Accelerating Advanced Polymeric Materials Design. RESEARCH (WASHINGTON, D.C.) 2024; 7:0406. [PMID: 38979514 PMCID: PMC11228074 DOI: 10.34133/research.0406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/01/2024] [Accepted: 05/22/2024] [Indexed: 07/10/2024]
Abstract
Organic polymer materials, as the most abundantly produced materials, possess a flammable nature, making them potential hazards to human casualties and property losses. Target polymer design is still hindered due to the lack of a scientific foundation. Herein, we present a robust, generalizable, yet intelligent polymer discovery framework, which synergizes diverse capabilities, including the in situ burning analyzer, virtual reaction generator, and material genomic model, to achieve results that surpass the sum of individual parts. Notably, the high-throughput analyzer created for the first time, grounded in multiple spectroscopic principles, enables in situ capturing of massive combustion intermediates; then, the created realistic apparatus transforming to the virtual reaction generator acquires exponentially more intermediate information; further, the proposed feature engineering tool, which embedded both polymer hierarchical structures and massive intermediate data, develops the generalizable genomic model with excellent universality (adapting over 20 kinds of polymers) and high accuracy (88.8%), succeeding discovering series of novel polymers. This emerging approach addresses the target polymer design for flame-retardant application and underscores a pivotal role in accelerating polymeric materials discovery.
Collapse
Affiliation(s)
- Ran Wang
- The Collaborative Innovation Center for Eco-Friendly and Fire-Safety Polymeric Materials (MoE), National Engineering Laboratory of Eco-Friendly Polymeric Materials (Sichuan), State Key Laboratory of Polymer Materials Engineering, College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Teng Fu
- The Collaborative Innovation Center for Eco-Friendly and Fire-Safety Polymeric Materials (MoE), National Engineering Laboratory of Eco-Friendly Polymeric Materials (Sichuan), State Key Laboratory of Polymer Materials Engineering, College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Ya-Jie Yang
- The Collaborative Innovation Center for Eco-Friendly and Fire-Safety Polymeric Materials (MoE), National Engineering Laboratory of Eco-Friendly Polymeric Materials (Sichuan), State Key Laboratory of Polymer Materials Engineering, College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Xuan Song
- The Collaborative Innovation Center for Eco-Friendly and Fire-Safety Polymeric Materials (MoE), National Engineering Laboratory of Eco-Friendly Polymeric Materials (Sichuan), State Key Laboratory of Polymer Materials Engineering, College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Xiu-Li Wang
- The Collaborative Innovation Center for Eco-Friendly and Fire-Safety Polymeric Materials (MoE), National Engineering Laboratory of Eco-Friendly Polymeric Materials (Sichuan), State Key Laboratory of Polymer Materials Engineering, College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Yu-Zhong Wang
- The Collaborative Innovation Center for Eco-Friendly and Fire-Safety Polymeric Materials (MoE), National Engineering Laboratory of Eco-Friendly Polymeric Materials (Sichuan), State Key Laboratory of Polymer Materials Engineering, College of Chemistry, Sichuan University, Chengdu 610064, China
| |
Collapse
|
6
|
Kehrein J, Bunker A, Luxenhofer R. POxload: Machine Learning Estimates Drug Loadings of Polymeric Micelles. Mol Pharm 2024; 21:3356-3374. [PMID: 38805643 PMCID: PMC11394009 DOI: 10.1021/acs.molpharmaceut.4c00086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/30/2024]
Abstract
Block copolymers, composed of poly(2-oxazoline)s and poly(2-oxazine)s, can serve as drug delivery systems; they form micelles that carry poorly water-soluble drugs. Many recent studies have investigated the effects of structural changes of the polymer and the hydrophobic cargo on drug loading. In this work, we combine these data to establish an extended formulation database. Different molecular properties and fingerprints are tested for their applicability to serve as formulation-specific mixture descriptors. A variety of classification and regression models are built for different descriptor subsets and thresholds of loading efficiency and loading capacity, with the best models achieving overall good statistics for both cross- and external validation (balanced accuracies of 0.8). Subsequently, important features are dissected for interpretation, and the DrugBank is screened for potential therapeutic use cases where these polymers could be used to develop novel formulations of hydrophobic drugs. The most promising models are provided as an open-source software tool for other researchers to test the applicability of these delivery systems for potential new drug candidates.
Collapse
Affiliation(s)
- Josef Kehrein
- Soft Matter Chemistry, Department of Chemistry, Faculty of Science, University of Helsinki, A. I. Virtasen aukio 1, 00014 Helsinki, Finland
- Drug Research Program, Division of Pharmaceutical Biosciences Faculty of Pharmacy, University of Helsinki, Viikinkaari 5 E, 00014 Helsinki, Finland
| | - Alex Bunker
- Drug Research Program, Division of Pharmaceutical Biosciences Faculty of Pharmacy, University of Helsinki, Viikinkaari 5 E, 00014 Helsinki, Finland
| | - Robert Luxenhofer
- Soft Matter Chemistry, Department of Chemistry, Faculty of Science, University of Helsinki, A. I. Virtasen aukio 1, 00014 Helsinki, Finland
| |
Collapse
|
7
|
Luong KD, Singh A. Application of Transformers in Cheminformatics. J Chem Inf Model 2024; 64:4392-4409. [PMID: 38815246 PMCID: PMC11167597 DOI: 10.1021/acs.jcim.3c02070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Revised: 04/05/2024] [Accepted: 05/06/2024] [Indexed: 06/01/2024]
Abstract
By accelerating time-consuming processes with high efficiency, computing has become an essential part of many modern chemical pipelines. Machine learning is a class of computing methods that can discover patterns within chemical data and utilize this knowledge for a wide variety of downstream tasks, such as property prediction or substance generation. The complex and diverse chemical space requires complex machine learning architectures with great learning power. Recently, learning models based on transformer architectures have revolutionized multiple domains of machine learning, including natural language processing and computer vision. Naturally, there have been ongoing endeavors in adopting these techniques to the chemical domain, resulting in a surge of publications within a short period. The diversity of chemical structures, use cases, and learning models necessitate a comprehensive summarization of existing works. In this paper, we review recent innovations in adapting transformers to solve learning problems in chemistry. Because chemical data is diverse and complex, we structure our discussion based on chemical representations. Specifically, we highlight the strengths and weaknesses of each representation, the current progress of adapting transformer architectures, and future directions.
Collapse
Affiliation(s)
- Kha-Dinh Luong
- Department of Computer Science, University of California Santa Barbara, Santa Barbara, CA 93106, United States
| | - Ambuj Singh
- Department of Computer Science, University of California Santa Barbara, Santa Barbara, CA 93106, United States
| |
Collapse
|
8
|
Choi S, Lee J, Seo J, Han SW, Lee SH, Seo JH, Seok J. Automated BigSMILES conversion workflow and dataset for homopolymeric macromolecules. Sci Data 2024; 11:371. [PMID: 38605036 PMCID: PMC11009387 DOI: 10.1038/s41597-024-03212-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Accepted: 04/02/2024] [Indexed: 04/13/2024] Open
Abstract
The simplified molecular-input line-entry system (SMILES) has been utilized in a variety of artificial intelligence analyses owing to its capability of representing chemical structures using line notation. However, its ease of representation is limited, which has led to the proposal of BigSMILES as an alternative method suitable for the representation of macromolecules. Nevertheless, research on BigSMILES remains limited due to its preprocessing requirements. Thus, this study proposes a conversion workflow of BigSMILES, focusing on its automated generation from SMILES representations of homopolymers. BigSMILES representations for 4,927,181 records are provided, thereby enabling its immediate use for various research and development applications. Our study presents detailed descriptions on a validation process to ensure the accuracy, interchangeability, and robustness of the conversion. Additionally, a systematic overview of utilized codes and functions that emphasizes their relevance in the context of BigSMILES generation are produced. This advancement is anticipated to significantly aid researchers and facilitate further studies in BigSMILES representation, including potential applications in deep learning and further extension to complex structures such as copolymers.
Collapse
Affiliation(s)
- Sunho Choi
- School of Electrical Engineering, Korea University, Seoul, South Korea
| | - Joonbum Lee
- Department of Materials Science and Engineering, Korea University, Seoul, South Korea
| | - Jangwon Seo
- School of Electrical Engineering, Korea University, Seoul, South Korea
| | - Sung Won Han
- School of Industrial Management Engineering, Korea University, Seoul, South Korea
| | - Sang Hyun Lee
- School of Electrical Engineering, Korea University, Seoul, South Korea
| | - Ji-Hun Seo
- Department of Materials Science and Engineering, Korea University, Seoul, South Korea
| | - Junhee Seok
- School of Electrical Engineering, Korea University, Seoul, South Korea.
| |
Collapse
|
9
|
Han S, Kang Y, Park H, Yi J, Park G, Kim J. Multimodal Transformer for Property Prediction in Polymers. ACS APPLIED MATERIALS & INTERFACES 2024; 16:16853-16860. [PMID: 38501934 DOI: 10.1021/acsami.4c01207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]
Abstract
In this work, we designed a multimodal transformer that combines both the Simplified Molecular Input Line Entry System (SMILES) and molecular graph representations to enhance the prediction of polymer properties. Three models with different embeddings (SMILES, SMILES + monomer, and SMILES + dimer) were employed to assess the performance of incorporating multimodal features into transformer architectures. Fine-tuning results across five properties (i.e., density, glass-transition temperature (Tg), melting temperature (Tm), volume resistivity, and conductivity) demonstrated that the multimodal transformer with both the SMILES and the dimer configuration as inputs outperformed the transformer using only SMILES across all five properties. Furthermore, our model facilitates in-depth analysis by examining attention scores, providing deeper insights into the relationship between the deep learning model and the polymer attributes. We believe that our work, shedding light on the potential of multimodal transformers in predicting polymer properties, paves a new direction for understanding and refining polymer properties.
Collapse
Affiliation(s)
- Seunghee Han
- Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology, Daejeon 34141, Republic of Korea
| | - Yeonghun Kang
- Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology, Daejeon 34141, Republic of Korea
| | - Hyunsoo Park
- Department of Materials, Imperial College London, Exhibition Road, London SW7 2AZ, United Kingdom
| | - Jeesung Yi
- KOLON One&Only TOWER, 110, Magokdong-ro, Gangseo-gu, Seoul 07793, Republic of Korea
| | - Geunyeong Park
- KOLON One&Only TOWER, 110, Magokdong-ro, Gangseo-gu, Seoul 07793, Republic of Korea
| | - Jihan Kim
- Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology, Daejeon 34141, Republic of Korea
| |
Collapse
|
10
|
Chang J, Ye JC. Bidirectional generation of structure and properties through a single molecular foundation model. Nat Commun 2024; 15:2323. [PMID: 38485914 PMCID: PMC10940637 DOI: 10.1038/s41467-024-46440-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2023] [Accepted: 02/27/2024] [Indexed: 03/18/2024] Open
Abstract
Recent successes of foundation models in artificial intelligence have prompted the emergence of large-scale chemical pre-trained models. Despite the growing interest in large molecular pre-trained models that provide informative representations for downstream tasks, attempts for multimodal pre-training approaches on the molecule domain were limited. To address this, here we present a multimodal molecular pre-trained model that incorporates the modalities of structure and biochemical properties, drawing inspiration from recent advances in multimodal learning techniques. Our proposed model pipeline of data handling and training objectives aligns the structure/property features in a common embedding space, which enables the model to regard bidirectional information between the molecules' structure and properties. These contributions emerge synergistic knowledge, allowing us to tackle both multimodal and unimodal downstream tasks through a single model. Through extensive experiments, we demonstrate that our model has the capabilities to solve various meaningful chemical challenges, including conditional molecule generation, property prediction, molecule classification, and reaction prediction.
Collapse
Affiliation(s)
- Jinho Chang
- Graduate School of AI, KAIST, Daejeon, South Korea
| | - Jong Chul Ye
- Graduate School of AI, KAIST, Daejeon, South Korea.
| |
Collapse
|
11
|
Helal H, Firoz J, Bilbrey JA, Sprueill H, Herman KM, Krell MM, Murray T, Roldan ML, Kraus M, Li A, Das P, Xantheas SS, Choudhury S. Acceleration of Graph Neural Network-Based Prediction Models in Chemistry via Co-Design Optimization on Intelligence Processing Units. J Chem Inf Model 2024; 64:1568-1580. [PMID: 38382011 DOI: 10.1021/acs.jcim.3c01312] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/23/2024]
Abstract
Atomic structure prediction and associated property calculations are the bedrock of chemical physics. Since high-fidelity ab initio modeling techniques for computing the structure and properties can be prohibitively expensive, this motivates the development of machine-learning (ML) models that make these predictions more efficiently. Training graph neural networks over large atomistic databases introduces unique computational challenges, such as the need to process millions of small graphs with variable size and support communication patterns that are distinct from learning over large graphs, such as social networks. We demonstrate a novel hardware-software codesign approach to scale up the training of atomistic graph neural networks (GNN) for structure and property prediction. First, to eliminate redundant computation and memory associated with alternative padding techniques and to improve throughput via minimizing communication, we formulate the effective coalescing of the batches of variable-size atomistic graphs as the bin packing problem and introduce a hardware-agnostic algorithm to pack these batches. In addition, we propose hardware-specific optimizations, including a planner and vectorization for the gather-scatter operations targeted for Graphcore's Intelligence Processing Unit (IPU), as well as model-specific optimizations such as merged communication collectives and optimized softplus. Putting these all together, we demonstrate the effectiveness of the proposed codesign approach by providing an implementation of a well-established atomistic GNN on the Graphcore IPUs. We evaluate the training performance on multiple atomistic graph databases with varying degrees of graph counts, sizes, and sparsity. We demonstrate that such a codesign approach can reduce the training time of atomistic GNNs and can improve their performance by up to 1.5× compared to the baseline implementation of the model on the IPUs. Additionally, we compare our IPU implementation with a Nvidia GPU-based implementation and show that our atomistic GNN implementation on the IPUs can run 1.8× faster on average compared to the execution time on the GPUs.
Collapse
Affiliation(s)
- Hatem Helal
- Graphcore, Kett House, Station Rd, Cambridge CB1 2JH, U.K
| | - Jesun Firoz
- Advanced Computing, Mathematics and Data Division, Pacific Northwest National Laboratory, 1100 Dexter Ave N, Seattle, Washington 98109, United States
| | - Jenna A Bilbrey
- Artificial Intelligence and Data Analytics Division, Pacific Northwest National Laboratory, 902 Battelle Boulevard, Richland, Washington 99352, United States
| | - Henry Sprueill
- Artificial Intelligence and Data Analytics Division, Pacific Northwest National Laboratory, 902 Battelle Boulevard, Richland, Washington 99352, United States
| | - Kristina M Herman
- Department of Chemistry, University of Washington, Seattle, Washington 98185, United States
| | | | - Tom Murray
- Graphcore, Kett House, Station Rd, Cambridge CB1 2JH, U.K
| | | | - Mike Kraus
- Graphcore, Kett House, Station Rd, Cambridge CB1 2JH, U.K
| | - Ang Li
- Advanced Computing, Mathematics and Data Division, Pacific Northwest National Laboratory, 902 Battelle Boulevard, Richland, Washington 99352, United States
| | - Payel Das
- IBM Research, Yorktown Heights, New York 10598, United States
| | - Sotiris S Xantheas
- Department of Chemistry, University of Washington, Seattle, Washington 98185, United States
- Advanced Computing, Mathematics and Data Division, Pacific Northwest National Laboratory, 902 Battelle Boulevard, Richland, Washington 99352, United States
| | - Sutanay Choudhury
- Advanced Computing, Mathematics and Data Division, Pacific Northwest National Laboratory, 902 Battelle Boulevard, Richland, Washington 99352, United States
| |
Collapse
|
12
|
Shi J, Walsh D, Zou W, Rebello NJ, Deagen ME, Fransen KA, Gao X, Olsen BD, Audus DJ. Calculating Pairwise Similarity of Polymer Ensembles via Earth Mover's Distance. ACS POLYMERS AU 2024; 4:66-76. [PMID: 38371731 PMCID: PMC10870752 DOI: 10.1021/acspolymersau.3c00029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 11/28/2023] [Accepted: 11/29/2023] [Indexed: 02/20/2024]
Abstract
Synthetic polymers, in contrast to small molecules and deterministic biomacromolecules, are typically ensembles composed of polymer chains with varying numbers, lengths, sequences, chemistry, and topologies. While numerous approaches exist for measuring pairwise similarity among small molecules and sequence-defined biomacromolecules, accurately determining the pairwise similarity between two polymer ensembles remains challenging. This work proposes the earth mover's distance (EMD) metric to calculate the pairwise similarity score between two polymer ensembles. EMD offers a greater resolution of chemical differences between polymer ensembles than the averaging method and provides a quantitative numeric value representing the pairwise similarity between polymer ensembles in alignment with chemical intuition. The EMD approach for assessing polymer similarity enhances the development of accurate chemical search algorithms within polymer databases and can improve machine learning techniques for polymer design, optimization, and property prediction.
Collapse
Affiliation(s)
- Jiale Shi
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Dylan Walsh
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Weizhong Zou
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Nathan J. Rebello
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Michael E. Deagen
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Katharina A. Fransen
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Xian Gao
- Department
of Chemical and Biomolecular Engineering, University of Notre Dame, Notre
Dame, Indiana 46556, United States
| | - Bradley D. Olsen
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Debra J. Audus
- Materials
Science and Engineering Division, National
Institute of Standards and Technology, Gaithersburg, Maryland 20899, United States
| |
Collapse
|
13
|
Qiu H, Liu L, Qiu X, Dai X, Ji X, Sun ZY. PolyNC: a natural and chemical language model for the prediction of unified polymer properties. Chem Sci 2024; 15:534-544. [PMID: 38179518 PMCID: PMC10763023 DOI: 10.1039/d3sc05079c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Accepted: 12/04/2023] [Indexed: 01/06/2024] Open
Abstract
Language models exhibit a profound aptitude for addressing multimodal and multidomain challenges, a competency that eludes the majority of off-the-shelf machine learning models. Consequently, language models hold great potential for comprehending the intricate interplay between material compositions and diverse properties, thereby accelerating material design, particularly in the realm of polymers. While past limitations in polymer data hindered the use of data-intensive language models, the growing availability of standardized polymer data and effective data augmentation techniques now opens doors to previously uncharted territories. Here, we present a revolutionary model to enable rapid and precise prediction of Polymer properties via the power of Natural language and Chemical language (PolyNC). To showcase the efficacy of PolyNC, we have meticulously curated a labeled prompt-structure-property corpus encompassing 22 970 polymer data points on a series of essential polymer properties. Through the use of natural language prompts, PolyNC gains a comprehensive understanding of polymer properties, while employing chemical language (SMILES) to describe polymer structures. In a unified text-to-text manner, PolyNC consistently demonstrates exceptional performance on both regression tasks (such as property prediction) and the classification task (polymer classification). Simultaneous and interactive multitask learning enables PolyNC to holistically grasp the structure-property relationships of polymers. Through a combination of experiments and characterizations, the generalization ability of PolyNC has been demonstrated, with attention analysis further indicating that PolyNC effectively learns structural information about polymers from multimodal inputs. This work provides compelling evidence of the potential for deploying end-to-end language models in polymer research, representing a significant advancement in the AI community's dedicated pursuit of advancing polymer science.
Collapse
Affiliation(s)
- Haoke Qiu
- State Key Laboratory of Polymer Physics and Chemistry, Changchun Institute of Applied Chemistry, Chinese Academy of Sciences Changchun 130022 China
- School of Applied Chemistry and Engineering, University of Science and Technology of China Hefei 230026 China
| | - Lunyang Liu
- State Key Laboratory of Polymer Physics and Chemistry, Changchun Institute of Applied Chemistry, Chinese Academy of Sciences Changchun 130022 China
| | - Xuepeng Qiu
- School of Applied Chemistry and Engineering, University of Science and Technology of China Hefei 230026 China
- CAS Key Laboratory of High-Performance Synthetic Rubber and its Composite Materials, Changchun Institute of Applied Chemistry, Chinese Academy of Sciences Changchun 130022 China
| | - Xuemin Dai
- CAS Key Laboratory of High-Performance Synthetic Rubber and its Composite Materials, Changchun Institute of Applied Chemistry, Chinese Academy of Sciences Changchun 130022 China
| | - Xiangling Ji
- State Key Laboratory of Polymer Physics and Chemistry, Changchun Institute of Applied Chemistry, Chinese Academy of Sciences Changchun 130022 China
- School of Applied Chemistry and Engineering, University of Science and Technology of China Hefei 230026 China
| | - Zhao-Yan Sun
- State Key Laboratory of Polymer Physics and Chemistry, Changchun Institute of Applied Chemistry, Chinese Academy of Sciences Changchun 130022 China
- School of Applied Chemistry and Engineering, University of Science and Technology of China Hefei 230026 China
| |
Collapse
|
14
|
Zhang P, Kearney L, Bhowmik D, Fox Z, Naskar AK, Gounley J. Transferring a Molecular Foundation Model for Polymer Property Predictions. J Chem Inf Model 2023; 63:7689-7698. [PMID: 38055952 DOI: 10.1021/acs.jcim.3c01650] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/08/2023]
Abstract
Transformer-based large language models have remarkable potential to accelerate design optimization for applications such as drug development and material discovery. Self-supervised pretraining of transformer models requires large-scale data sets, which are often sparsely populated in topical areas such as polymer science. State-of-the-art approaches for polymers conduct data augmentation to generate additional samples but unavoidably incur extra computational costs. In contrast, large-scale open-source data sets are available for small molecules and provide a potential solution to data scarcity through transfer learning. In this work, we show that using transformers pretrained on small molecules and fine-tuned on polymer properties achieves comparable accuracy to those trained on augmented polymer data sets for a series of benchmark prediction tasks.
Collapse
Affiliation(s)
- Pei Zhang
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, United States
| | - Logan Kearney
- Chemical Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, United States
| | - Debsindhu Bhowmik
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, United States
| | - Zachary Fox
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, United States
| | - Amit K Naskar
- Chemical Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, United States
| | - John Gounley
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, United States
| |
Collapse
|
15
|
Day EC, Chittari SS, Bogen MP, Knight AS. Navigating the Expansive Landscapes of Soft Materials: A User Guide for High-Throughput Workflows. ACS POLYMERS AU 2023; 3:406-427. [PMID: 38107416 PMCID: PMC10722570 DOI: 10.1021/acspolymersau.3c00025] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 11/02/2023] [Accepted: 11/07/2023] [Indexed: 12/19/2023]
Abstract
Synthetic polymers are highly customizable with tailored structures and functionality, yet this versatility generates challenges in the design of advanced materials due to the size and complexity of the design space. Thus, exploration and optimization of polymer properties using combinatorial libraries has become increasingly common, which requires careful selection of synthetic strategies, characterization techniques, and rapid processing workflows to obtain fundamental principles from these large data sets. Herein, we provide guidelines for strategic design of macromolecule libraries and workflows to efficiently navigate these high-dimensional design spaces. We describe synthetic methods for multiple library sizes and structures as well as characterization methods to rapidly generate data sets, including tools that can be adapted from biological workflows. We further highlight relevant insights from statistics and machine learning to aid in data featurization, representation, and analysis. This Perspective acts as a "user guide" for researchers interested in leveraging high-throughput screening toward the design of multifunctional polymers and predictive modeling of structure-property relationships in soft materials.
Collapse
Affiliation(s)
| | | | - Matthew P. Bogen
- Department of Chemistry, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Abigail S. Knight
- Department of Chemistry, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| |
Collapse
|
16
|
Ting JM, Tamayo-Mendoza T, Petersen SR, Van Reet J, Ahmed UA, Snell NJ, Fisher JD, Stern M, Oviedo F. Frontiers in nonviral delivery of small molecule and genetic drugs, driven by polymer chemistry and machine learning for materials informatics. Chem Commun (Camb) 2023; 59:14197-14209. [PMID: 37955165 DOI: 10.1039/d3cc04705a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2023]
Abstract
Materials informatics (MI) has immense potential to accelerate the pace of innovation and new product development in biotechnology. Close collaborations between skilled physical and life scientists with data scientists are being established in pursuit of leveraging MI tools in automation and artificial intelligence (AI) to predict material properties in vitro and in vivo. However, the scarcity of large, standardized, and labeled materials data for connecting structure-function relationships represents one of the largest hurdles to overcome. In this Highlight, focus is brought to emerging developments in polymer-based therapeutic delivery platforms, where teams generate large experimental datasets around specific therapeutics and successfully establish a design-to-deployment cycle of specialized nanocarriers. Three select collaborations demonstrate how custom-built polymers protect and deliver small molecules, nucleic acids, and proteins, representing ideal use-cases for machine learning to understand how molecular-level interactions impact drug stabilization and release. We conclude with our perspectives on how MI innovations in automation efficiencies and digitalization of data-coupled with fundamental insight and creativity from the polymer science community-can accelerate translation of more gene therapies into lifesaving medicines.
Collapse
|
17
|
Hu J, Li Z, Lin J, Zhang L. Prediction and Interpretability of Glass Transition Temperature of Homopolymers by Data-Augmented Graph Convolutional Neural Networks. ACS APPLIED MATERIALS & INTERFACES 2023; 15:54006-54017. [PMID: 37934171 DOI: 10.1021/acsami.3c13698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2023]
Abstract
Establishing the structure-property relationship by machine learning (ML) models is extremely valuable for accelerating the molecular design of polymers. However, existing ML models for the polymers are subject to scarcity issues of training data and fewer variations of graph structures of molecules. In addition, limited works have explored the interpretability of ML models to infer the latent knowledge in the field of polymer science that could inspire ML-assisted molecular design. In this contribution, we integrate graph convolutional neural networks (GCNs) with data augmentation strategy to predict the glass transition temperature Tg of polymers. It is demonstrated that the data-augmented GCN model outperforms the conventional models and achieves a higher accuracy for the prediction of Tg despite a small amount of training data. Furthermore, taking advantage of molecular graph representations, the data-augmented GCN model has the capability to infer the importance of atoms or substructures from the understanding of Tg, which generally agrees with the experimental findings in the field of polymer science. The inferred knowledge of the GCN model is used to advise on the design of functional polymers with specific Tg. The data-augmented GCN model possesses prominent superiorities in the establishment of structure-property relationship and also provides an efficient way for accelerating the rational design of polymer molecules.
Collapse
Affiliation(s)
- Junyang Hu
- Shanghai Key Laboratory of Advanced Polymeric Materials, School of Materials Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Zean Li
- Shanghai Key Laboratory of Advanced Polymeric Materials, School of Materials Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Jiaping Lin
- Shanghai Key Laboratory of Advanced Polymeric Materials, School of Materials Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Liangshun Zhang
- Shanghai Key Laboratory of Advanced Polymeric Materials, School of Materials Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| |
Collapse
|
18
|
Liu Y, Wu F, Liu Z, Wang K, Wang F, Qu X. Can language models be used for real-world urban-delivery route optimization? Innovation (N Y) 2023; 4:100520. [PMID: 37869471 PMCID: PMC10587631 DOI: 10.1016/j.xinn.2023.100520] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Accepted: 09/27/2023] [Indexed: 10/24/2023] Open
Abstract
Language models have contributed to breakthroughs in interdisciplinary research, such as protein design and molecular dynamics understanding. In this study, we reveal that beyond language, representations of other entities, such as human behaviors, that are mappable to learnable sequences can be learned by language models. One compelling example is the real-world delivery route optimization problem. We here propose a novel approach based on the language model to optimize delivery routes on the basis of drivers' historical experiences. Although a broad range of optimization-based approaches have been designed to optimize delivery routes, they do not capture the implicit knowledge of complex delivery operating environments. The model we propose integrates this knowledge in the route optimization process by learning from driving behaviors in experienced drivers. A real-world delivery route that preserves drivers' implicit behavioral patterns is first analogized to a sentence in natural language. Through unsupervised learning, we then learn the vector representations of words and infer the drivers' delivery chains on the basis of the tailored chain-reaction-based algorithm. We also provide insights into the fusion of language models and operations research methods. In our approach, language models are applied to learn drivers' delivery behaviors and infer new deliveries at the delivery zone level, while the classic traveling salesman problem (TSP) model is embedded into the hybrid framework for intra-zone optimization. Numerical experiments performed on real-world data from Amazon's delivery service demonstrate that the proposed approach outperforms pure optimization, supporting the effectiveness, efficiency, and extensibility of our model. As a versatile approach, the proposed framework can easily be extended to various disciplines in which the data follow certain grammar rules. We anticipate that our work will serve as a stepping stone toward the understanding and application of language models in tackling interdisciplinary research problems.
Collapse
Affiliation(s)
- Yang Liu
- State Key Laboratory of Intelligent Green Vehicle and Mobility, Tsinghua University, Beijing 100084, China
| | - Fanyou Wu
- State Key Laboratory of Intelligent Green Vehicle and Mobility, Tsinghua University, Beijing 100084, China
| | - Zhiyuan Liu
- Jiangsu Key Laboratory of Urban ITS, Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies, School of Transportation, Southeast University, Nanjing 211189, China
| | - Kai Wang
- School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China
| | - Feiyue Wang
- Institute of Automation, State Key Laboratory for Management and Control of Complex Systems, Chinese Academy of Sciences, Beijing 100190, China
| | - Xiaobo Qu
- School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China
| |
Collapse
|