1
|
Song S, Wang Y, Tian X, He W, Chen F, Wu J, Zhang Q. Predicting the Melting Point of Energetic Molecules Using a Learnable Graph Neural Fingerprint Model. J Phys Chem A 2023; 127:4328-4337. [PMID: 37141395 DOI: 10.1021/acs.jpca.3c00112] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
Melting point prediction for organic molecules has drawn widespread attention from both academic and industrial communities. In this work, a learnable graph neural fingerprint (GNF) was employed to develop a melting point prediction model using a dataset of over 90,000 organic molecules. The GNF model exhibited a significant advantage, with a mean absolute error (MAE) of 25.0 K, when compared to other featurization methods. Furthermore, by integrating prior knowledge through a customized descriptor set (i.e., CDS) into GNF, the accuracy of the resulting model, GNF_CDS, improved to 24.7 K, surpassing the performance of previously reported models for a wide range of structurally diverse organic compounds. Moreover, the generalizability of the GNF_CDS model was significantly improved with a decreased MAE of 17 K for an independent dataset containing melt-castable energetic molecules. This work clearly demonstrates that prior knowledge is still beneficial for modeling molecular properties despite the powerful learning capability of graph neural networks, especially in specific fields where chemical data are lacking.
Collapse
Affiliation(s)
- Siwei Song
- Institute of Chemical Materials, China Academy of Engineering Physics, Mianyang, Sichuan 621000, China
| | - Yi Wang
- Institute of Chemical Materials, China Academy of Engineering Physics, Mianyang, Sichuan 621000, China
| | - Xiaolan Tian
- Institute of Chemical Materials, China Academy of Engineering Physics, Mianyang, Sichuan 621000, China
| | - Wei He
- School of Aeronautics and Astronautics, Sichuan University, Chengdu, Sichuan 610065, China
| | - Fang Chen
- Institute of Chemical Materials, China Academy of Engineering Physics, Mianyang, Sichuan 621000, China
| | - Junnan Wu
- Institute of Chemical Materials, China Academy of Engineering Physics, Mianyang, Sichuan 621000, China
| | - Qinghua Zhang
- School of Astronautics, Northwestern Polytechnic University, Xi'an, Shaanxi 710072, China
| |
Collapse
|
2
|
Zhu X, Polyakov VR, Bajjuri K, Hu H, Maderna A, Tovee CA, Ward SC. Building Machine Learning Small Molecule Melting Points and Solubility Models Using CCDC Melting Points Dataset. J Chem Inf Model 2023; 63:2948-2959. [PMID: 37125691 DOI: 10.1021/acs.jcim.3c00308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/02/2023]
Abstract
Predicting solubility of small molecules is a very difficult undertaking due to the lack of reliable and consistent experimental solubility data. It is well known that for a molecule in a crystal lattice to be dissolved, it must, first, dissociate from the lattice and then, second, be solvated. The melting point of a compound is proportional to the lattice energy, and the octanol-water partition coefficient (log P) is a measure of the compound's solvation efficiency. The CCDC's melting point dataset of almost one hundred thousand compounds was utilized to create widely applicable machine learning models of small molecule melting points. Using the general solubility equation, the aqueous thermodynamic solubilities of the same compounds can be predicted. The global model could be easily localized by adding additional melting point measurements for a chemical series of interest.
Collapse
Affiliation(s)
- Xiangwei Zhu
- Sutro Biopharma, 111 Oyster Point Blvd, South San Francisco, California 94080, United States
| | - Valery R Polyakov
- Sutro Biopharma, 111 Oyster Point Blvd, South San Francisco, California 94080, United States
| | - Krishna Bajjuri
- Sutro Biopharma, 111 Oyster Point Blvd, South San Francisco, California 94080, United States
| | - Huiyong Hu
- Sutro Biopharma, 111 Oyster Point Blvd, South San Francisco, California 94080, United States
| | - Andreas Maderna
- Sutro Biopharma, 111 Oyster Point Blvd, South San Francisco, California 94080, United States
| | - Clare A Tovee
- Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, U.K
| | - Suzanna C Ward
- Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, U.K
| |
Collapse
|
3
|
Zheng S, Guo W, Li C, Sun Y, Zhao Q, Lu H, Si Q, Wang H. Application of machine learning and deep learning methods for hydrated electron rate constant prediction. ENVIRONMENTAL RESEARCH 2023; 231:115996. [PMID: 37105290 DOI: 10.1016/j.envres.2023.115996] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/06/2022] [Revised: 04/19/2023] [Accepted: 04/24/2023] [Indexed: 05/08/2023]
Abstract
Accurately determining the second-order rate constant with eaq- (keaq-) for organic compounds (OCs) is crucial in the eaq- induced advanced reduction processes (ARPs). In this study, we collected 867 keaq- values at different pHs from peer-reviewed publications and applied machine learning (ML) algorithm-XGBoost and deep learning (DL) algorithm-convolutional neural network (CNN) to predict keaq-. Our results demonstrated that the CNN model with transfer learning and data augmentation (CNN-TL&DA) greatly improved the prediction results and overcame over-fitting. Furthermore, we compared the ML/DL modeling methods and found that the CNN-TL&DA, which combined molecular images (MI), achieved the best overall performance (R2test = 0.896, RMSEtest = 0.362, MAEtest = 0.261) when compared to the XGBoost algorithm combined with Mordred descriptors (MD) (0.692, RMSEtest = 0.622, MAEtest = 0.399) and Morgan fingerprint (MF) (R2test = 0.512, RMSEtest = 0.783, MAEtest = 0.520). Moreover, the interpretation of the MD-XGBoost and MF-XGBoost models using the SHAP method revealed the significance of MDs (e.g., molecular size, branching, electron distribution, polarizability, and bond types), MFs (e.g, aromatic carbon, carbonyl oxygen, nitrogen, and halogen) and environmental conditions (e.g., pH) that effectively influence the keaq- prediction. The interpretation of the 2D molecular image-CNN (MI-CNN) models using the Grad-CAM method showed that they correctly identified key functional groups such as -CN, -NO2, and -X functional groups that can increase the keaq- values. Additionally, almost all electron-withdrawing groups and a small part of electron-donating groups for the MI-CNN model can be highlighted for estimating keaq-. Overall, our results suggest that the CNN approach has smaller errors when compared to ML algorithms, making it a promising candidate for predicting other rate constants.
Collapse
Affiliation(s)
- Shanshan Zheng
- State Key Laboratory of Urban Water Resource and Environment, Harbin Institute of Technology, Harbin 150090, China
| | - Wanqian Guo
- State Key Laboratory of Urban Water Resource and Environment, Harbin Institute of Technology, Harbin 150090, China.
| | - Chao Li
- State Environmental Protection Key Laboratory of Wetland Ecology and Vegetation Restoration, School of Environment, Northeast Normal University, 2555 Jingyue St., Changchun 130117, Jilin Province, China
| | - Yongbin Sun
- School of Chemistry and Pharmaceutical Engineering, Shandong First Medical University and Shandong Academy of Medical Sciences, Taian 271016, Shandong, People's Republic of China
| | - Qi Zhao
- State Key Laboratory of Urban Water Resource and Environment, Harbin Institute of Technology, Harbin 150090, China
| | - Hao Lu
- State Key Laboratory of Urban Water Resource and Environment, Harbin Institute of Technology, Harbin 150090, China
| | - Qishi Si
- State Key Laboratory of Urban Water Resource and Environment, Harbin Institute of Technology, Harbin 150090, China
| | - Huazhe Wang
- State Key Laboratory of Urban Water Resource and Environment, Harbin Institute of Technology, Harbin 150090, China
| |
Collapse
|
4
|
Xiouras C, Cameli F, Quilló GL, Kavousanakis ME, Vlachos DG, Stefanidis GD. Applications of Artificial Intelligence and Machine Learning Algorithms to Crystallization. Chem Rev 2022; 122:13006-13042. [PMID: 35759465 DOI: 10.1021/acs.chemrev.2c00141] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Artificial intelligence and specifically machine learning applications are nowadays used in a variety of scientific applications and cutting-edge technologies, where they have a transformative impact. Such an assembly of statistical and linear algebra methods making use of large data sets is becoming more and more integrated into chemistry and crystallization research workflows. This review aims to present, for the first time, a holistic overview of machine learning and cheminformatics applications as a novel, powerful means to accelerate the discovery of new crystal structures, predict key properties of organic crystalline materials, simulate, understand, and control the dynamics of complex crystallization process systems, as well as contribute to high throughput automation of chemical process development involving crystalline materials. We critically review the advances in these new, rapidly emerging research areas, raising awareness in issues such as the bridging of machine learning models with first-principles mechanistic models, data set size, structure, and quality, as well as the selection of appropriate descriptors. At the same time, we propose future research at the interface of applied mathematics, chemistry, and crystallography. Overall, this review aims to increase the adoption of such methods and tools by chemists and scientists across industry and academia.
Collapse
Affiliation(s)
- Christos Xiouras
- Chemical Process R&D, Crystallization Technology Unit, Janssen R&D, Turnhoutseweg 30, 2340 Beerse, Belgium
| | - Fabio Cameli
- Department of Chemical and Biomolecular Engineering, University of Delaware, 150 Academy Street, Newark, Delaware 19716, United States
| | - Gustavo Lunardon Quilló
- Chemical Process R&D, Crystallization Technology Unit, Janssen R&D, Turnhoutseweg 30, 2340 Beerse, Belgium.,Chemical and BioProcess Technology and Control, Department of Chemical Engineering, Faculty of Engineering Technology, KU Leuven, Gebroeders de Smetstraat 1, 9000 Ghent, Belgium
| | - Mihail E Kavousanakis
- School of Chemical Engineering, National Technical University of Athens, Heroon Polytechniou 9, 15780 Zografou, Greece
| | - Dionisios G Vlachos
- Department of Chemical and Biomolecular Engineering, University of Delaware, 150 Academy Street, Newark, Delaware 19716, United States
| | - Georgios D Stefanidis
- School of Chemical Engineering, National Technical University of Athens, Heroon Polytechniou 9, 15780 Zografou, Greece.,Laboratory for Chemical Technology, Ghent University; Tech Lane Ghent Science Park 125, B-9052 Ghent, Belgium
| |
Collapse
|
5
|
Hadipour H, Liu C, Davis R, Cardona ST, Hu P. Deep clustering of small molecules at large-scale via variational autoencoder embedding and K-means. BMC Bioinformatics 2022; 23:132. [PMID: 35428173 PMCID: PMC9011935 DOI: 10.1186/s12859-022-04667-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Accepted: 04/04/2022] [Indexed: 11/13/2022] Open
Abstract
Background Converting molecules into computer-interpretable features with rich molecular information is a core problem of data-driven machine learning applications in chemical and drug-related tasks. Generally speaking, there are global and local features to represent a given molecule. As most algorithms have been developed based on one type of feature, a remaining bottleneck is to combine both feature sets for advanced molecule-based machine learning analysis. Here, we explored a novel analytical framework to make embeddings of the molecular features and apply them in the clustering of a large number of small molecules. Results In this novel framework, we first introduced a principal component analysis method encoding the molecule-specific atom and bond information. We then used a variational autoencoder (AE)-based method to make embeddings of the global chemical properties and the local atom and bond features. Next, using the embeddings from the encoded local and global features, we implemented and compared several unsupervised clustering algorithms to group the molecule-specific embeddings. The number of clusters was treated as a hyper-parameter and determined by the Silhouette method. Finally, we evaluated the corresponding results using three internal indices. Applying the analysis framework to a large chemical library of more than 47,000 molecules, we successfully identified 50 molecular clusters using the K-means method with 32 embeddings based on the AE method. We visualized the clustering result via t-SNE for the overall distribution of molecules and the similarity maps for the structural analysis of randomly selected cluster-specific molecules. Conclusions This study developed a novel analytical framework that comprises a feature engineering scheme for molecule-specific atomic and bonding features and a deep learning-based embedding strategy for different molecular features. By applying the identified embeddings, we show their usefulness for clustering a large molecule dataset. Our novel analytic algorithms can be applied to any virtual library of chemical compounds with diverse molecular structures. Hence, these tools have the potential of optimizing drug discovery, as they can decrease the number of compounds to be screened in any drug screening campaign.
Collapse
|
6
|
Bejagam KK, Lalonde J, Iverson CN, Marrone BL, Pilania G. Machine Learning for Melting Temperature Predictions and Design in Polyhydroxyalkanoate-Based Biopolymers. J Phys Chem B 2022; 126:934-945. [DOI: 10.1021/acs.jpcb.1c08354] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Karteek K. Bejagam
- Materials Science and Technology Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Jessica Lalonde
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
- Center for Biomolecular and Tissue Engineering, Duke University, Durham, North Carolina 27708, United States
| | - Carl N. Iverson
- Chemistry Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Babetta L. Marrone
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Ghanshyam Pilania
- Materials Science and Technology Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| |
Collapse
|
7
|
Park J, Shim Y, Lee F, Rammohan A, Goyal S, Shim M, Jeong C, Kim DS. Prediction and Interpretation of Polymer Properties Using the Graph Convolutional Network. ACS POLYMERS AU 2022; 2:213-222. [PMID: 36855563 PMCID: PMC9954297 DOI: 10.1021/acspolymersau.1c00050] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
We present machine learning models for the prediction of thermal and mechanical properties of polymers based on the graph convolutional network (GCN). GCN-based models provide reliable prediction performances for the glass transition temperature (T g), melting temperature (T m), density (ρ), and elastic modulus (E) with substantial dependence on the dataset, which is the best for T g (R 2 ∼ 0.9) and worst for E (R 2 ∼ 0.5). It is found that the GCN representations for polymers provide prediction performances of their properties comparable to the popular extended-connectivity circular fingerprint (ECFP) representation. Notably, the GCN combined with the neural network regression (GCN-NN) slightly outperforms the ECFP. It is investigated how the GCN captures important structural features of polymers to learn their properties. Using the dimensionality reduction, we demonstrate that the polymers are organized in the principal subspace of the GCN representation spaces with respect to the backbone rigidity. The organization in the representation space adaptively changes with the training and through the NN layers, which might facilitate a subsequent prediction of target properties based on the relationships between the structure and the property. The GCN models are found to provide an advantage to automatically extract a backbone rigidity, strongly correlated with T g, as well as a potential transferability to predict other properties associated with a backbone rigidity. Our results indicate both the capability and limitations of the GCN in learning to describe polymer systems depending on the property.
Collapse
Affiliation(s)
- Jaehong Park
- Innovation
Center, Samsung Electronics Co., Ltd., 1 Samsungjeonja-ro, Hwaseong-si, Gyeonggi-do 18448, Korea
| | - Youngseon Shim
- Innovation
Center, Samsung Electronics Co., Ltd., 1 Samsungjeonja-ro, Hwaseong-si, Gyeonggi-do 18448, Korea,
| | - Franklin Lee
- Science
and Technology Division, Corning Incorporated, Corning, New York 14831, United States
| | - Aravind Rammohan
- Science
and Technology Division, Corning Incorporated, Corning, New York 14831, United States
| | - Sushmit Goyal
- Science
and Technology Division, Corning Incorporated, Corning, New York 14831, United States
| | - Munbo Shim
- Innovation
Center, Samsung Electronics Co., Ltd., 1 Samsungjeonja-ro, Hwaseong-si, Gyeonggi-do 18448, Korea
| | - Changwook Jeong
- Innovation
Center, Samsung Electronics Co., Ltd., 1 Samsungjeonja-ro, Hwaseong-si, Gyeonggi-do 18448, Korea,
| | - Dae Sin Kim
- Innovation
Center, Samsung Electronics Co., Ltd., 1 Samsungjeonja-ro, Hwaseong-si, Gyeonggi-do 18448, Korea
| |
Collapse
|
8
|
Feinstein J, Sivaraman G, Picel K, Peters B, Vázquez-Mayagoitia Á, Ramanathan A, MacDonell M, Foster I, Yan E. Uncertainty-Informed Deep Transfer Learning of Perfluoroalkyl and Polyfluoroalkyl Substance Toxicity. J Chem Inf Model 2021; 61:5793-5803. [PMID: 34905348 DOI: 10.1021/acs.jcim.1c01204] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Perfluoroalkyl and polyfluoroalkyl substances (PFAS) pose a significant hazard because of their widespread industrial uses, environmental persistence, and bioaccumulation. A growing, increasingly diverse inventory of PFAS, including 8163 chemicals, has recently been updated by the U.S. Environmental Protection Agency. However, with the exception of a handful of well-studied examples, little is known about their human toxicity potential because of the substantial resources required for in vivo toxicity experiments. We tackle the problem of expensive in vivo experiments by evaluating multiple machine learning (ML) methods, including random forests, deep neural networks (DNN), graph convolutional networks, and Gaussian processes, for predicting acute toxicity (e.g., median lethal dose, or LD50) of PFAS compounds. To address the scarcity of toxicity information for PFAS, publicly available datasets of oral rat LD50 for all organic compounds are aggregated and used to develop state-of-the-art ML source models for transfer learning. A total of 519 fluorinated compounds containing two or more C-F bonds with known toxicity are used for knowledge transfer to ensembles of the best-performing source model, DNN, to generate the target models for the PFAS domain with access to uncertainty. This study predicts toxicity for PFAS with a defined chemical structure. To further inform prediction confidence, the transfer-learned model is embedded within a SelectiveNet architecture, where the model is allowed to identify regions of prediction with greater confidence and abstain from those with high uncertainty using a calibrated cutoff rate.
Collapse
Affiliation(s)
- Jeremy Feinstein
- Environmental Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Ganesh Sivaraman
- Data Science and Learning Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Kurt Picel
- Environmental Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Brian Peters
- Environmental Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | | | - Arvind Ramanathan
- Data Science and Learning Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Margaret MacDonell
- Environmental Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Ian Foster
- Data Science and Learning Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Eugene Yan
- Environmental Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| |
Collapse
|
9
|
Cencer MM, Moore JS, Assary RS. Machine learning for polymeric materials: an introduction. POLYM INT 2021. [DOI: 10.1002/pi.6345] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Affiliation(s)
- Morgan M Cencer
- Department of Chemistry University of Illinois at Urbana‐Champaign Urbana IL USA
- Materials Science Division Argonne National Laboratory Lemont IL USA
- Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana‐Champaign Urbana IL USA
| | - Jeffrey S Moore
- Department of Chemistry University of Illinois at Urbana‐Champaign Urbana IL USA
- Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana‐Champaign Urbana IL USA
| | - Rajeev S Assary
- Materials Science Division Argonne National Laboratory Lemont IL USA
| |
Collapse
|
10
|
Thomas M, Boardman A, Garcia-Ortegon M, Yang H, de Graaf C, Bender A. Applications of Artificial Intelligence in Drug Design: Opportunities and Challenges. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2021; 2390:1-59. [PMID: 34731463 DOI: 10.1007/978-1-0716-1787-8_1] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
Artificial intelligence (AI) has undergone rapid development in recent years and has been successfully applied to real-world problems such as drug design. In this chapter, we review recent applications of AI to problems in drug design including virtual screening, computer-aided synthesis planning, and de novo molecule generation, with a focus on the limitations of the application of AI therein and opportunities for improvement. Furthermore, we discuss the broader challenges imposed by AI in translating theoretical practice to real-world drug design; including quantifying prediction uncertainty and explaining model behavior.
Collapse
Affiliation(s)
- Morgan Thomas
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Andrew Boardman
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Miguel Garcia-Ortegon
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK.,Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Cambridge, UK
| | - Hongbin Yang
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK
| | | | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK.
| |
Collapse
|
11
|
Wheatle BK, Fuentes EF, Lynd NA, Ganesan V. Design of Polymer Blend Electrolytes through a Machine Learning Approach. Macromolecules 2020. [DOI: 10.1021/acs.macromol.0c01547] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Affiliation(s)
- Bill K. Wheatle
- McKetta Department of Chemical Engineering, The University of Texas at Austin, Austin 78712, Texas, United States
| | - Erick F. Fuentes
- McKetta Department of Chemical Engineering, The University of Texas at Austin, Austin 78712, Texas, United States
| | - Nathaniel A. Lynd
- McKetta Department of Chemical Engineering, The University of Texas at Austin, Austin 78712, Texas, United States
| | - Venkat Ganesan
- McKetta Department of Chemical Engineering, The University of Texas at Austin, Austin 78712, Texas, United States
| |
Collapse
|