1
|
Zhu Y, Li M, Xu C, Lan Z. Quantum Chemistry Dataset with Ground- and Excited-state Properties of 450 Kilo Molecules. Sci Data 2024; 11:948. [PMID: 39209851 PMCID: PMC11362161 DOI: 10.1038/s41597-024-03788-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2024] [Accepted: 08/15/2024] [Indexed: 09/04/2024] Open
Abstract
Due to rapid advancements in deep learning techniques, the demand for large-volume high-quality datasets grows significantly in chemical research. We developed a quantum-chemistry database that includes 443,106 small organic molecules with sizes up to 10 heavy atoms including C, N, O, and F. Ground-state geometry optimizations and frequency calculations of all compounds were performed at the B3LYP/6-31G* level with the BJD3 dispersion correction, while the excited-state single-point calculations were conducted at the ωB97X-D/6-31G* level. Totally twenty-seven molecular properties, such as geometric, thermodynamic, electronic and energetic properties, were gathered from these calculations. Meanwhile, we also established a comprehensive protocol for the construction of a high-volume quantum-chemistry dataset. Our QCDGE (Quantum Chemistry Dataset with Ground- and Excited-State Properties) dataset contains a substantial volume of data, exhibits high chemical diversity, and most importantly includes excited-state information. This dataset, along with its construction protocol, is expected to have a significant impact on the broad applications of machine learning studies across different fields of chemistry, especially in the area of excited-state research.
Collapse
Affiliation(s)
- Yifei Zhu
- SCNU Environmental Research Institute, Guangdong Provincial Key Laboratory of Chemical Pollution and Environmental Safety, MOE Key Laboratory of Environmental Theoretical Chemistry, South China Normal University, Guangzhou, 510006, P. R. China
- School of Environment, South China Normal University, Guangzhou, 510006, P. R. China
| | - Mengge Li
- School of Environment, South China Normal University, Guangzhou, 510006, P. R. China
| | - Chao Xu
- SCNU Environmental Research Institute, Guangdong Provincial Key Laboratory of Chemical Pollution and Environmental Safety, MOE Key Laboratory of Environmental Theoretical Chemistry, South China Normal University, Guangzhou, 510006, P. R. China
- School of Environment, South China Normal University, Guangzhou, 510006, P. R. China
| | - Zhenggang Lan
- SCNU Environmental Research Institute, Guangdong Provincial Key Laboratory of Chemical Pollution and Environmental Safety, MOE Key Laboratory of Environmental Theoretical Chemistry, South China Normal University, Guangzhou, 510006, P. R. China.
- School of Environment, South China Normal University, Guangzhou, 510006, P. R. China.
| |
Collapse
|
2
|
Sarangi R, Maity S, Acharya A. Machine Learning Approach to Vertical Energy Gap in Redox Processes. J Chem Theory Comput 2024; 20:6747-6755. [PMID: 39044422 PMCID: PMC11325558 DOI: 10.1021/acs.jctc.4c00715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/25/2024]
Abstract
A straightforward approach to calculating the free energy change (ΔG) and reorganization energy of a redox process is linear response approximation (LRA). However, accurate prediction of redox properties is still challenging due to difficulties in conformational sampling and vertical energy-gap sampling. Expensive hybrid quantum mechanical/molecular mechanical (QM/MM) calculations are typically employed in sampling energy gaps using conformations from simulations. To alleviate the computational cost associated with the expensive QM method in the QM/MM calculation, we propose machine learning (ML) methods to predict the vertical energy gaps (VEGs). We tested several ML models to predict the VEGs and observed that simple models like linear regression show excellent performance (mean absolute error ∼0.1 eV) in predicting VEGs in all test systems, even when using features extracted from cheaper semiempirical methods. Our best ML model (extra trees regressor) shows a mean absolute error of around 0.1 eV while using features from the cheapest QM method. We anticipate our approach can be generalized to larger macromolecular systems with more complex redox centers.
Collapse
Affiliation(s)
- Ronit Sarangi
- Department of Chemistry, Syracuse University, Syracuse, New York 13244, United States
| | - Suman Maity
- Department of Chemistry, Syracuse University, Syracuse, New York 13244, United States
| | - Atanu Acharya
- Department of Chemistry, Syracuse University, Syracuse, New York 13244, United States
- BioInspired Syracuse, Syracuse University, Syracuse, New York 13244, United States
| |
Collapse
|
3
|
Terrones GG, Huang SP, Rivera MP, Yue S, Hernandez A, Kulik HJ. Metal-Organic Framework Stability in Water and Harsh Environments from Data-Driven Models Trained on the Diverse WS24 Data Set. J Am Chem Soc 2024; 146:20333-20348. [PMID: 38984798 DOI: 10.1021/jacs.4c05879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/11/2024]
Abstract
Metal-organic frameworks (MOFs) are porous materials with applications in gas separations and catalysis, but a lack of water stability often limits their practical use given the ubiquity of water. Consequently, it is useful to predict whether a MOF is water-stable before investing time and resources into synthesis. Existing heuristics for designing water-stable MOFs lack generality and limit the diversity of explored chemistry due to narrowly defined criteria. Machine learning (ML) models offer the promise to improve the generality of predictions but require data. In an improvement on previous efforts, we enlarge the available training data for MOF water stability prediction by over 400%, adding 911 MOFs with water stability labels assigned through semiautomated manuscript analysis to curate the new data set WS24. The additional data are shown to improve ML model performance (test ROC-AUC > 0.8) over diverse chemistry for the prediction of both water stability and stability in harsher acidic conditions. We illustrate how the expanded data set and models can be used with a previously developed activation stability model in combination with genetic algorithms to quickly screen ∼10,000 MOFs from a space of hundreds of thousands for candidates with multivariate stability (upon activation, in water, and in acid). We uncover metal- and geometry-specific design rules for robust MOFs. The data set and ML models developed in this work, which we disseminate through an easy-to-use web interface, are expected to contribute toward the accelerated discovery of novel, water-stable MOFs for applications such as direct air gas capture and water treatment.
Collapse
Affiliation(s)
- Gianmarco G Terrones
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Shih-Peng Huang
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Matthew P Rivera
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Shuwen Yue
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Alondra Hernandez
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Heather J Kulik
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
4
|
Tempke R, Musho T. Autonomous generation of single photon emitting materials. NANOSCALE 2024; 16:10239-10249. [PMID: 38726673 DOI: 10.1039/d3nr04944b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2024]
Abstract
The utilization of machine learning in Materials Science underscores the critical importance of the quality and quantity of data in training models effectively. Unlike fields such as image processing and natural language processing, there is limited availability of atomistic datasets, leading to biases in training data. Particularly in the domain of materials discovery, there exists an issue of continuity in atomistic datasets. Experimental data sourced from literature and patents is usually only available for favorable data, resulting in bias in the training dataset. This study focuses on developing a SMILES-based model for generating synthetic datasets of quantum materials using a variational autoencoder. This study centers on the generation of a synthetic dataset of quantum materials specifically for quantum sensing applications, with a focus on two-level quantum molecules that exhibit a dipole blockade. The proposed technique offers an improved sampling algorithm by incorporating newly generated data into the sampling algorithm to create a more normally distributed dataset. Through this technique, the study was able to generate over 1 000 000 candidate quantum materials from a small dataset of only 8000 materials. The generated dataset identified several iodine-containing molecules as promising single photon emitting materials for potential quantum sensing applications.
Collapse
Affiliation(s)
- Robert Tempke
- Department of Mechanical, Materials and Aerospace Engineering, West Virginia University, P.O. Box 6106, Morgantown, WV, USA.
| | - Terence Musho
- Department of Mechanical, Materials and Aerospace Engineering, West Virginia University, P.O. Box 6106, Morgantown, WV, USA.
| |
Collapse
|
5
|
Raush E, Abagyan R, Totrov M. Efficient Generation of Conformer Ensembles Using Internal Coordinates and a Generative Directional Graph Convolution Neural Network. J Chem Theory Comput 2024; 20:4054-4063. [PMID: 38669307 DOI: 10.1021/acs.jctc.4c00280] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/28/2024]
Abstract
We present a neural-network-based high-throughput molecular conformer-generation algorithm. A chemical graph-convolutional network is trained to predict low-energy conformers in internal coordinate representation (bond lengths, bond, and torsion angles), starting from two-dimensional (2D) chemical topology. Generative neural network (NN) architecture performs denoising from torsion space, producing conformer ensembles with populations that are well correlated with torsion energy profiles. Short force-field-based energy minimization is applied to refine final conformers. All computation-intensive stages of the algorithm are GPU-optimized. The procedure (termed GINGER) is benchmarked on a commonly used test set of bioactive three-dimensional (3D) conformers from the PDB. We demonstrate highly competitive results in conformer recovery and throughput rates suitable for giga-scale compound library processing. A web server that allows interactive conformer ensemble generation by GINGER and their viewing is made freely available at https://www.molsoft.com/gingerdemo.html.
Collapse
Affiliation(s)
- Eugene Raush
- Molsoft L.L.C., 11199 Sorrento Valley Road, S209, San Diego, California 92121, United States
| | - Ruben Abagyan
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California 92093, United States
| | - Maxim Totrov
- Molsoft L.L.C., 11199 Sorrento Valley Road, S209, San Diego, California 92121, United States
| |
Collapse
|
6
|
Lee AS, Elliott S, Harb H, Ward L, Foster I, Curtiss L, Assary RS. Emin: A First-Principles Thermochemical Descriptor for Predicting Molecular Synthesizability. J Chem Inf Model 2024; 64:1277-1289. [PMID: 38359461 DOI: 10.1021/acs.jcim.3c01583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/17/2024]
Abstract
Predicting the synthesizability of a new molecule remains an unsolved challenge that chemists have long tackled with heuristic approaches. Here, we report a new method for predicting synthesizability using a simple yet accurate thermochemical descriptor. We introduce Emin, the energy difference between a molecule and its lowest energy constitutional isomer, as a synthesizability predictor that is accurate, physically meaningful, and first-principles based. We apply Emin to 134,000 molecules in the QM9 data set and find that Emin is accurate when used alone and reduces incorrect predictions of "synthesizable" by up to 52% when used to augment commonly used prediction methods. Our work illustrates how first-principles thermochemistry and heuristic approximations for molecular stability are complementary, opening a new direction for synthesizability prediction methods.
Collapse
Affiliation(s)
- Andrew S Lee
- Department of Materials Science and Engineering, Northwestern University, Evanston, Illinois 60208, United States
| | - Sarah Elliott
- Chemical Sciences and Engineering Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Hassan Harb
- Materials Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Logan Ward
- Data Science and Learning Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Ian Foster
- Data Science and Learning Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Larry Curtiss
- Materials Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Rajeev S Assary
- Materials Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| |
Collapse
|
7
|
Viswanathan K, Goel M, Laghuvarapu S, Varma G, Priyakumar UD. Streamlining pipeline efficiency: a novel model-agnostic technique for accelerating conditional generative and virtual screening pipelines. Sci Rep 2023; 13:21069. [PMID: 38030689 PMCID: PMC10686981 DOI: 10.1038/s41598-023-42952-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Accepted: 09/16/2023] [Indexed: 12/01/2023] Open
Abstract
The discovery of potential therapeutic agents for life-threatening diseases has become a significant problem. There is a requirement for fast and accurate methods to identify drug-like molecules that can be used as potential candidates for novel targets. Existing techniques like high-throughput screening and virtual screening are time-consuming and inefficient. Traditional molecule generation pipelines are more efficient than virtual screening but use time-consuming docking software. Such docking functions can be emulated using Machine Learning models with comparable accuracy and faster execution times. However, we find that when pre-trained machine learning models are employed in generative pipelines as oracles, they suffer from model degradation in areas where data is scarce. In this study, we propose an active learning-based model that can be added as a supplement to enhanced molecule generation architectures. The proposed method uses uncertainty sampling on the molecules created by the generator model and dynamically learns as the generator samples molecules from different regions of the chemical space. The proposed framework can generate molecules with high binding affinity with [Formula: see text]a 70% improvement in runtime compared to the baseline model by labeling only [Formula: see text]30% of molecules compared to the baseline oracle.
Collapse
Affiliation(s)
- Karthik Viswanathan
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - Manan Goel
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - Siddhartha Laghuvarapu
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - Girish Varma
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - U Deva Priyakumar
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India.
| |
Collapse
|
8
|
Xia S, Chen E, Zhang Y. Integrated Molecular Modeling and Machine Learning for Drug Design. J Chem Theory Comput 2023; 19:7478-7495. [PMID: 37883810 PMCID: PMC10653122 DOI: 10.1021/acs.jctc.3c00814] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 10/10/2023] [Accepted: 10/11/2023] [Indexed: 10/28/2023]
Abstract
Modern therapeutic development often involves several stages that are interconnected, and multiple iterations are usually required to bring a new drug to the market. Computational approaches have increasingly become an indispensable part of helping reduce the time and cost of the research and development of new drugs. In this Perspective, we summarize our recent efforts on integrating molecular modeling and machine learning to develop computational tools for modulator design, including a pocket-guided rational design approach based on AlphaSpace to target protein-protein interactions, delta machine learning scoring functions for protein-ligand docking as well as virtual screening, and state-of-the-art deep learning models to predict calculated and experimental molecular properties based on molecular mechanics optimized geometries. Meanwhile, we discuss remaining challenges and promising directions for further development and use a retrospective example of FDA approved kinase inhibitor Erlotinib to demonstrate the use of these newly developed computational tools.
Collapse
Affiliation(s)
- Song Xia
- Department
of Chemistry, New York University, New York, New York 10003, United States
| | - Eric Chen
- Department
of Chemistry, New York University, New York, New York 10003, United States
| | - Yingkai Zhang
- Department
of Chemistry, New York University, New York, New York 10003, United States
- Simons
Center for Computational Physical Chemistry at New York University, New York, New York 10003, United States
- NYU-ECNU
Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| |
Collapse
|
9
|
Li CH, Tabor DP. Generative organic electronic molecular design informed by quantum chemistry. Chem Sci 2023; 14:11045-11055. [PMID: 37860647 PMCID: PMC10583709 DOI: 10.1039/d3sc03781a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2023] [Accepted: 09/11/2023] [Indexed: 10/21/2023] Open
Abstract
Generative molecular design strategies have emerged as promising alternatives to trial-and-error approaches for exploring and optimizing within large chemical spaces. To date, generative models with reinforcement learning approaches have frequently used low-cost methods to evaluate the quality of the generated molecules, enabling many loops through the generative model. However, for functional molecular materials tasks, such low-cost methods are either not available or would require the generation of large amounts of training data to train surrogate machine learning models. In this work, we develop a framework that connects the REINVENT reinforcement learning framework with excited state quantum chemistry calculations to discover molecules with specified molecular excited state energy levels, specifically molecules with excited state landscapes that would serve as promising singlet fission or triplet-triplet annihilation materials. We employ a two-step curriculum strategy to first find a set of diverse promising molecules, then demonstrate the framework's ability to exploit a more focused chemical space with anthracene derivatives. Under this protocol, we show that the framework can find desired molecules and improve Pareto fronts for targeted properties versus synthesizability. Moreover, we are able to find several different design principles used by chemists for the design of singlet fission and triplet-triplet annihilation molecules.
Collapse
Affiliation(s)
- Cheng-Han Li
- Department of Chemistry, Texas A&M University College Station TX 77842 USA
| | - Daniel P Tabor
- Department of Chemistry, Texas A&M University College Station TX 77842 USA
| |
Collapse
|
10
|
Nakata M, Maeda T. PubChemQC B3LYP/6-31G*//PM6 Data Set: The Electronic Structures of 86 Million Molecules Using B3LYP/6-31G* Calculations. J Chem Inf Model 2023; 63:5734-5754. [PMID: 37677147 DOI: 10.1021/acs.jcim.3c00899] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/09/2023]
Abstract
The presented "PubChemQC B3LYP/6-31G*//PM6" data set is composed of the electronic properties of 85,938,443 molecules, encompassing a broad spectrum of molecules from essential compounds to biomolecules with a molecular weight up to 1000. These molecules account for 94.0% of the original PubChem Compound catalog as of August 29, 2016. The electronic properties, including orbitals, orbital energies, total energies, dipole moments, and other pertinent properties, were computed by using the B3LYP/6-31G* and PM6 methods. The data set, available in three formats, namely, GAMESS quantum chemistry program files, selected JSON output files, and a PostgreSQL database, provides researchers with the ability to query molecular properties. It is further subdivided into five subdata sets for more specific data. The first two subsets encompass molecules with carbon, hydrogen, oxygen, and nitrogen with molecular weights under 300 and 500, respectively. The third and fourth subsets incorporate molecules with carbon, hydrogen, nitrogen, oxygen, phosphorus, sulfur, fluorine, and chlorine, with molecular weights under 300 and 500, respectively. The fifth subset comprises molecules with carbon, hydrogen, nitrogen, oxygen, phosphorus, sulfur, fluorine, chlorine, sodium, potassium, magnesium, and calcium, with a molecular weight of under 500. The coefficients of determination for the highest occupied molecular orbital-lowest unoccupied molecular orbital energy gap range from 0.892 (for CHON500) to 0.803 (for the whole data set). These comprehensive results pave the way for applications in drug discovery and materials science, among others. The data sets can be accessed under the Creative Commons Attribution 4.0 International license at the following web address: https://nakatamaho.riken.jp/pubchemqc.riken.jp/b3lyp_pm6_datasets.html.
Collapse
Affiliation(s)
- Maho Nakata
- RIKEN Cluster for Pioneering Research, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan
| | - Toshiyuki Maeda
- Software Technology and Artificial Intelligence Research Laboratory, Chiba Institute of Technology, 2-17-1 Tsudanuma, Narashino, Chiba 275-0016, Japan
| |
Collapse
|
11
|
Morgan JP, Paiement A, Klinke C. Domain-informed graph neural networks: A quantum chemistry case study. Neural Netw 2023; 165:938-952. [PMID: 37453397 DOI: 10.1016/j.neunet.2023.06.030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Revised: 05/05/2023] [Accepted: 06/24/2023] [Indexed: 07/18/2023]
Abstract
We explore different strategies to integrate prior domain knowledge into the design of graph neural networks (GNN). Our study is supported by a use-case of estimating the potential energy of chemical systems (molecules and crystals) represented as graphs. We integrate two elements of domain knowledge into the design of the GNN to constrain and regularise its learning, towards higher accuracy and generalisation. First, knowledge on the existence of different types of relations/graph edges (e.g. chemical bonds in our case study) between nodes of the graph is used to modulate their interactions. We formulate and compare two strategies, namely specialised message production and specialised update of internal states. Second, knowledge of the relevance of some physical quantities is used to constrain the learnt features towards a higher physical relevance using a simple multi-task learning (MTL) paradigm. We explore the potential of MTL to better capture the underlying mechanisms behind the studied phenomenon. We demonstrate the general applicability of our two knowledge integrations by applying them to three architectures that rely on different mechanisms to propagate information between nodes and to update node states. Our implementations are made publicly available. To support these experiments, we release three new datasets of out-of-equilibrium molecules and crystals of various complexities.
Collapse
Affiliation(s)
- Jay Paul Morgan
- Université de Toulon, Aix Marseille Univ, CNRS, LIS, Marseille, France; Department of Computer Science, Swansea University, Swansea, SA2 8PP, United Kingdom.
| | - Adeline Paiement
- Université de Toulon, Aix Marseille Univ, CNRS, LIS, Marseille, France; Department of Computer Science, Swansea University, Swansea, SA2 8PP, United Kingdom.
| | - Christian Klinke
- Institute of Physics, University of Rostock, Rostock, 18059, Germany; Department "Life, Light & Matter", University of Rostock, Rostock, 18059, Germany; Department of Chemistry, Swansea University, Swansea, SA2 8PP, United Kingdom.
| |
Collapse
|
12
|
Yan X, Yue T, Winkler DA, Yin Y, Zhu H, Jiang G, Yan B. Converting Nanotoxicity Data to Information Using Artificial Intelligence and Simulation. Chem Rev 2023. [PMID: 37262026 DOI: 10.1021/acs.chemrev.3c00070] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Decades of nanotoxicology research have generated extensive and diverse data sets. However, data is not equal to information. The question is how to extract critical information buried in vast data streams. Here we show that artificial intelligence (AI) and molecular simulation play key roles in transforming nanotoxicity data into critical information, i.e., constructing the quantitative nanostructure (physicochemical properties)-toxicity relationships, and elucidating the toxicity-related molecular mechanisms. For AI and molecular simulation to realize their full impacts in this mission, several obstacles must be overcome. These include the paucity of high-quality nanomaterials (NMs) and standardized nanotoxicity data, the lack of model-friendly databases, the scarcity of specific and universal nanodescriptors, and the inability to simulate NMs at realistic spatial and temporal scales. This review provides a comprehensive and representative, but not exhaustive, summary of the current capability gaps and tools required to fill these formidable gaps. Specifically, we discuss the applications of AI and molecular simulation, which can address the large-scale data challenge for nanotoxicology research. The need for model-friendly nanotoxicity databases, powerful nanodescriptors, new modeling approaches, molecular mechanism analysis, and design of the next-generation NMs are also critically discussed. Finally, we provide a perspective on future trends and challenges.
Collapse
Affiliation(s)
- Xiliang Yan
- Institute of Environmental Research at the Greater Bay Area, Key Laboratory for Water Quality and Conservation of the Pearl River Delta, Ministry of Education, Guangzhou University, Guangzhou 510006, China
| | - Tongtao Yue
- Key Laboratory of Marine Environment and Ecology, Ministry of Education, Institute of Coastal Environmental Pollution Control, Ocean University of China, Qingdao 266100, China
| | - David A Winkler
- Monash Institute of Pharmaceutical Sciences, Monash University, Parkville, Victoria 3052, Australia
- School of Pharmacy, University of Nottingham, Nottingham NG7 2QL, U.K
- Department of Biochemistry and Chemistry, La Trobe Institute for Molecular Science, La Trobe University, Melbourne, Victoria 3086, Australia
| | - Yongguang Yin
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China
| | - Hao Zhu
- Department of Chemistry and Biochemistry, Rowan University, Glassboro, New Jersey 08028, United States
| | - Guibin Jiang
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China
| | - Bing Yan
- Institute of Environmental Research at the Greater Bay Area, Key Laboratory for Water Quality and Conservation of the Pearl River Delta, Ministry of Education, Guangzhou University, Guangzhou 510006, China
| |
Collapse
|
13
|
Zhao Q, Vaddadi SM, Woulfe M, Ogunfowora LA, Garimella SS, Isayev O, Savoie BM. Comprehensive exploration of graphically defined reaction spaces. Sci Data 2023; 10:145. [PMID: 36935430 PMCID: PMC10025260 DOI: 10.1038/s41597-023-02043-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Accepted: 02/27/2023] [Indexed: 03/21/2023] Open
Abstract
Existing reaction transition state (TS) databases are comparatively small and lack chemical diversity. Here, this data gap has been addressed using the concept of a graphically-defined model reaction to comprehensively characterize a reaction space associated with C, H, O, and N containing molecules with up to 10 heavy (non-hydrogen) atoms. The resulting dataset is composed of 176,992 organic reactions possessing at least one validated TS, activation energy, heat of reaction, reactant and product geometries, frequencies, and atom-mapping. For 33,032 reactions, more than one TS was discovered by conformational sampling, allowing conformational errors in TS prediction to be assessed. Data is supplied at the GFN2-xTB and B3LYP-D3/TZVP levels of theory. A subset of reactions were recalculated at the CCSD(T)-F12/cc-pVDZ-F12 and ωB97X-D2/def2-TZVP levels to establish relative errors. The resulting collection of reactions and properties are called the Reaction Graph Depth 1 (RGD1) dataset. RGD1 represents the largest and most chemically diverse TS dataset published to date and should find immediate use in developing novel machine learning models for predicting reaction properties.
Collapse
Affiliation(s)
- Qiyuan Zhao
- Davidson School of Chemical Engineering, Purdue University, West Lafayette, IN, 47906, USA
| | - Sai Mahit Vaddadi
- Davidson School of Chemical Engineering, Purdue University, West Lafayette, IN, 47906, USA
| | - Michael Woulfe
- Davidson School of Chemical Engineering, Purdue University, West Lafayette, IN, 47906, USA
| | - Lawal A Ogunfowora
- Department of Chemistry, Purdue University, West Lafayette, IN, 47906, USA
| | - Sanjay S Garimella
- Davidson School of Chemical Engineering, Purdue University, West Lafayette, IN, 47906, USA
| | - Olexandr Isayev
- Department of Chemistry, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
| | - Brett M Savoie
- Davidson School of Chemical Engineering, Purdue University, West Lafayette, IN, 47906, USA.
| |
Collapse
|
14
|
Belenahalli Shekarappa S, Kandagalla S, Lee J. Development of machine learning models based on molecular fingerprints for selection of small molecule inhibitors against JAK2 protein. J Comput Chem 2023; 44:1493-1504. [PMID: 36929511 DOI: 10.1002/jcc.27103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 02/18/2023] [Accepted: 02/24/2023] [Indexed: 03/18/2023]
Abstract
Janus kinase 2 (JAK2) is emerging as a potential therapeutic target for many inflammatory diseases such as myeloproliferative disorders (MPD), cancer and rheumatoid arthritis (RA). In this study, we have collected experimental data of JAK2 protein containing 6021 unique inhibitors. We then characterized them based on Morgan (ECFP6) fingerprints followed by clustering into training and test set based on their molecular scaffolds. These data were used to build the classification models with various supervised machine learning (ML) algorithms that could prioritize novel inhibitors for future drug development against JAK2 protein. The best model built by Random Forest (RF) and Morgan fingerprints achieved the G-mean value of 0.84 on the external test set. As an application of our classification model, virtual screening was performed against Drugbank molecules in order to identify the potential inhibitors based on the confidence score by RF model. Nine potential molecules were identified, which were further subject to molecular docking studies to evaluate the virtual screening results of the best RF model. This proposed method can prove useful for developing novel target-specific JAK2 inhibitors.
Collapse
Affiliation(s)
- Sharath Belenahalli Shekarappa
- School of Systems Biomedical Science and Department of Bioinformatics and Life Science, Soongsil University, Seoul, South Korea
| | - Shivananda Kandagalla
- Laboratory of Computational Modeling of Drugs, Higher Medical & Biological School, South Ural State University, Chelyabinsk, Russia
| | - Julian Lee
- School of Systems Biomedical Science and Department of Bioinformatics and Life Science, Soongsil University, Seoul, South Korea
| |
Collapse
|
15
|
Guo J, Sun M, Zhao X, Shi C, Su H, Guo Y, Pu X. General Graph Neural Network-Based Model To Accurately Predict Cocrystal Density and Insight from Data Quality and Feature Representation. J Chem Inf Model 2023; 63:1143-1156. [PMID: 36734616 DOI: 10.1021/acs.jcim.2c01538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Cocrystal engineering as an effective way to modify solid-state properties has inspired great interest from diverse material fields while cocrystal density is an important property closely correlated with the material function. In order to accurately predict the cocrystal density, we develop a graph neural network (GNN)-based deep learning framework by considering three key factors of machine learning (data quality, feature presentation, and model architecture). The result shows that different stoichiometric ratios of molecules in cocrystals can significantly influence the prediction performances, highlighting the importance of data quality. In addition, the feature complementary is not suitable for augmenting the molecular graph representation in the cocrystal density prediction, suggesting that the complementary strategy needs to consider whether extra features can sufficiently supplement the lacked information in the original representation. Based on these results, 4144 cocrystals with 1:1 stoichiometry ratio are selected as the dataset, supplemented by the data augmentation of exchanging a pair of coformers. The molecular graph is determined to learn feature representation to train the GNN-based model. Global attention is introduced to further optimize the feature space and identify important atoms to realize the interpretability of the model. Benefited from the advantages, our model significantly outperforms three competitive models and exhibits high prediction accuracy for unseen cocrystals, showcasing its robustness and generality. Overall, our work not only provides a general cocrystal density prediction tool for experimental investigations but also provides useful guidelines for the machine learning application. All source codes are freely available at https://github.com/Xiao-Gua00/CCPGraph.
Collapse
Affiliation(s)
- Jiali Guo
- College of Chemistry, Sichuan University, Chengdu610064, People's Republic of China
| | - Ming Sun
- College of Chemistry, Sichuan University, Chengdu610064, People's Republic of China
| | - Xueyan Zhao
- Institute of Chemical Materials, China Academy of Engineering Physics, Mianyang621900, China
| | - Chaojie Shi
- College of Chemistry, Sichuan University, Chengdu610064, People's Republic of China
| | - Haoming Su
- College of Chemistry, Sichuan University, Chengdu610064, People's Republic of China
| | - Yanzhi Guo
- College of Chemistry, Sichuan University, Chengdu610064, People's Republic of China
| | - Xuemei Pu
- College of Chemistry, Sichuan University, Chengdu610064, People's Republic of China
| |
Collapse
|
16
|
Kondratyev V, Dryzhakov M, Gimadiev T, Slutskiy D. Generative model based on junction tree variational autoencoder for HOMO value prediction and molecular optimization. J Cheminform 2023; 15:11. [PMID: 36732800 PMCID: PMC9893566 DOI: 10.1186/s13321-023-00681-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2022] [Accepted: 01/06/2023] [Indexed: 02/04/2023] Open
Abstract
In this work, we provide further development of the junction tree variational autoencoder (JT VAE) architecture in terms of implementation and application of the internal feature space of the model. Pretraining of JT VAE on a large dataset and further optimization with a regression model led to a latent space that can solve several tasks simultaneously: prediction, generation, and optimization. We use the ZINC database as a source of molecules for the JT VAE pretraining and the QM9 dataset with its HOMO values to show the application case. We evaluate our model on multiple tasks such as property (value) prediction, generation of new molecules with predefined properties, and structure modification toward the property. Across these tasks, our model shows improvements in generation and optimization tasks while preserving the precision of state-of-the-art models.
Collapse
Affiliation(s)
- Vladimir Kondratyev
- Computer Science and Artificial Intelligence Laboratory, ENGIE Lab CRIGEN, 4 rue Josephine Baker, 93240 Stains, France ,grid.89485.380000 0004 0600 5611Telecom Paris, 19 Place Marguerite Perey, CS 20031, 91123 Palaiseau, France
| | - Marian Dryzhakov
- Computer Science and Artificial Intelligence Laboratory, ENGIE Lab CRIGEN, 4 rue Josephine Baker, 93240 Stains, France
| | - Timur Gimadiev
- grid.77268.3c0000 0004 0543 9688Laboratory of Chemoinformatics and Molecular Modeling, Butlerov Institute of Chemistry, Kazan Federal University, 18 Kremlyovskaya str., 420008 Kazan, Russia ,grid.465285.80000 0004 0637 9007Federal Research Center “Kazan Scientific Center of Russian Academy of Sciences”, 420008 Kazan, Russia ,JSC “BIOCAD”, Petrodvortsoviy District, Strelna, Svyazi St., Bld. 34, Liter A., 198515 St. Petersburg, Russia
| | - Dmitriy Slutskiy
- Computer Science and Artificial Intelligence Laboratory, ENGIE Lab CRIGEN, 4 rue Josephine Baker, 93240 Stains, France
| |
Collapse
|
17
|
Kříž K, Schmidt L, Andersson AT, Walz MM, van der Spoel D. An Imbalance in the Force: The Need for Standardized Benchmarks for Molecular Simulation. J Chem Inf Model 2023; 63:412-431. [PMID: 36630710 PMCID: PMC9875315 DOI: 10.1021/acs.jcim.2c01127] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Indexed: 01/12/2023]
Abstract
Force fields (FFs) for molecular simulation have been under development for more than half a century. As with any predictive model, rigorous testing and comparisons of models critically depends on the availability of standardized data sets and benchmarks. While such benchmarks are rather common in the fields of quantum chemistry, this is not the case for empirical FFs. That is, few benchmarks are reused to evaluate FFs, and development teams rather use their own training and test sets. Here we present an overview of currently available tests and benchmarks for computational chemistry, focusing on organic compounds, including halogens and common ions, as FFs for these are the most common ones. We argue that many of the benchmark data sets from quantum chemistry can in fact be reused for evaluating FFs, but new gas phase data is still needed for compounds containing phosphorus and sulfur in different valence states. In addition, more nonequilibrium interaction energies and forces, as well as molecular properties such as electrostatic potentials around compounds, would be beneficial. For the condensed phases there is a large body of experimental data available, and tools to utilize these data in an automated fashion are under development. If FF developers, as well as researchers in artificial intelligence, would adopt a number of these data sets, it would become easier to compare the relative strengths and weaknesses of different models and to, eventually, restore the balance in the force.
Collapse
Affiliation(s)
- Kristian Kříž
- Department
of Cell and Molecular Biology, Uppsala University, Box 596, SE-75124Uppsala, Sweden
| | - Lisa Schmidt
- Faculty
of Biosciences, University of Heidelberg, Heidelberg69117, Germany
| | - Alfred T. Andersson
- Department
of Cell and Molecular Biology, Uppsala University, Box 596, SE-75124Uppsala, Sweden
| | - Marie-Madeleine Walz
- Department
of Cell and Molecular Biology, Uppsala University, Box 596, SE-75124Uppsala, Sweden
| | - David van der Spoel
- Department
of Cell and Molecular Biology, Uppsala University, Box 596, SE-75124Uppsala, Sweden
| |
Collapse
|
18
|
Xia S, Zhang D, Zhang Y. Multitask Deep Ensemble Prediction of Molecular Energetics in Solution: From Quantum Mechanics to Experimental Properties. J Chem Theory Comput 2023; 19:10.1021/acs.jctc.2c01024. [PMID: 36607141 PMCID: PMC10323048 DOI: 10.1021/acs.jctc.2c01024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
The past few years have witnessed significant advances in developing machine learning methods for molecular energetics predictions, including calculated electronic energies with high-level quantum mechanical methods and experimental properties, such as solvation free energy and logP. Typically, task-specific machine learning models are developed for distinct prediction tasks. In this work, we present a multitask deep ensemble model, sPhysNet-MT-ens5, which can simultaneously and accurately predict electronic energies of molecules in gas, water, and octanol phases, as well as transfer free energies at both calculated and experimental levels. On the calculated data set Frag20-solv-678k, which is developed in this work and contains 678,916 molecular conformations, up to 20 heavy atoms, and their properties calculated at B3LYP/6-31G* level of theory with continuum solvent models, sPhysNet-MT-ens5 predicts density functional theory (DFT)-level electronic energies directly from force field-optimized geometry within chemical accuracy. On the experimental data sets, sPhysNet-MT-ens5 achieves state-of-the-art performances, which predict both experimental hydration free energy with a RMSE of 0.620 kcal/mol on the FreeSolv data set and experimental logP with a RMSE of 0.393 on the PHYSPROP data set. Furthermore, sPhysNet-MT-ens5 also provides a reasonable estimation of model uncertainty which shows correlations with prediction error. Finally, by analyzing the atomic contributions of its predictions, we find that the developed deep learning model is aware of the chemical environment of each atom by assigning reasonable atomic contributions consistent with our chemical knowledge.
Collapse
Affiliation(s)
- Song Xia
- Department of Chemistry, New York University, New York, New York 10003, United States
| | - Dongdong Zhang
- Department of Chemistry, New York University, New York, New York 10003, United States
| | - Yingkai Zhang
- Department of Chemistry, New York University, New York, New York 10003, United States
- Simons Center for Computational Physical Chemistry at New York University, New York, New York 10003, United States
- NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| |
Collapse
|
19
|
Alygizakis N, Giannakopoulos T, Τhomaidis NS, Slobodnik J. Detecting the sources of chemicals in the Black Sea using non-target screening and deep learning convolutional neural networks. THE SCIENCE OF THE TOTAL ENVIRONMENT 2022; 847:157554. [PMID: 35878861 DOI: 10.1016/j.scitotenv.2022.157554] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Revised: 07/17/2022] [Accepted: 07/18/2022] [Indexed: 06/15/2023]
Abstract
The Black Sea is an important ecosystem, which is affected by various anthropogenic pressures, such as shipping activities and wastewater inputs from large coastal cities. Significant loads of chemical pollutants are being continuously brought in by major European rivers. This study investigated the spatial distribution of chemicals in the Ukrainian shelf (the northwestern part of the Black Sea) and their main sources. Chemical occurrence data used in the study was generated within the Joint Black Sea Surveys (JBSS), which took place in 2016 and 2017 as a part of the EU/UNDP EMBLAS II project (www.emblasproject.org). During the JBSS, seawater samples were analyzed by a non-target screening workflow using liquid chromatography high-resolution mass spectrometry (LC-HRMS). Open-source algorithms were applied to generate a combined dataset of 30,489 detected chemical signals and their intensities. Out of these, 35 compounds were tentatively identified by the application of a non-target screening identification workflow based on automated matching of their mass spectra against those in available mass spectral libraries. The dataset was used to generate images, representing spatial distribution of each of the signals. These images were then used as an input to a deep learning convolutional neural network classification model. The study resulted in the development of an open-source end-to-end workflow for the estimation of the pollution load by chemicals contributed by the two major inflowing rivers (Danube and Dnieper) and other, so far unidentified, sources. A dedicated dashboard was built to facilitate data visualization per detected signal/compound. The presented model proved to be especially useful at the prioritization of signals of unknown compounds, which is of key importance for the follow up structure elucidation efforts of bulky non-target screening data. The deep learning approach for peak prioritization of unknown chemicals in the environment has been used for the first time.
Collapse
Affiliation(s)
- Nikiforos Alygizakis
- Laboratory of Analytical Chemistry, Department of Chemistry, University of Athens, Panepistimiopolis Zografou, 15771 Athens, Greece; Environmental Institute, Okružná 784/42, 97241 Koš, Slovak Republic.
| | | | - Nikolaos S Τhomaidis
- Laboratory of Analytical Chemistry, Department of Chemistry, University of Athens, Panepistimiopolis Zografou, 15771 Athens, Greece.
| | | |
Collapse
|
20
|
Rahman ASMZ, Liu C, Sturm H, Hogan AM, Davis R, Hu P, Cardona ST. A machine learning model trained on a high-throughput antibacterial screen increases the hit rate of drug discovery. PLoS Comput Biol 2022; 18:e1010613. [PMID: 36228001 PMCID: PMC9624395 DOI: 10.1371/journal.pcbi.1010613] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 11/01/2022] [Accepted: 09/26/2022] [Indexed: 01/24/2023] Open
Abstract
Screening for novel antibacterial compounds in small molecule libraries has a low success rate. We applied machine learning (ML)-based virtual screening for antibacterial activity and evaluated its predictive power by experimental validation. We first binarized 29,537 compounds according to their growth inhibitory activity (hit rate 0.87%) against the antibiotic-resistant bacterium Burkholderia cenocepacia and described their molecular features with a directed-message passing neural network (D-MPNN). Then, we used the data to train an ML model that achieved a receiver operating characteristic (ROC) score of 0.823 on the test set. Finally, we predicted antibacterial activity in virtual libraries corresponding to 1,614 compounds from the Food and Drug Administration (FDA)-approved list and 224,205 natural products. Hit rates of 26% and 12%, respectively, were obtained when we tested the top-ranked predicted compounds for growth inhibitory activity against B. cenocepacia, which represents at least a 14-fold increase from the previous hit rate. In addition, more than 51% of the predicted antibacterial natural compounds inhibited ESKAPE pathogens showing that predictions expand beyond the organism-specific dataset to a broad range of bacteria. Overall, the developed ML approach can be used for compound prioritization before screening, increasing the typical hit rate of drug discovery.
Collapse
Affiliation(s)
| | - Chengyou Liu
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, Manitoba, Canada
| | - Hunter Sturm
- Department of Chemistry, University of Manitoba, Winnipeg, Manitoba, Canada
| | - Andrew M. Hogan
- Department of Microbiology, University of Manitoba, Winnipeg, Manitoba, Canada
| | - Rebecca Davis
- Department of Chemistry, University of Manitoba, Winnipeg, Manitoba, Canada
| | - Pingzhao Hu
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, Manitoba, Canada
- Department of Computer Science, University of Manitoba, Winnipeg, Manitoba, Canada
- Department of Biochemistry and Medical Genetics, University of Manitoba, Winnipeg, Manitoba, Canada
| | - Silvia T. Cardona
- Department of Microbiology, University of Manitoba, Winnipeg, Manitoba, Canada
- Department of Medical Microbiology & Infectious Diseases, University of Manitoba, Winnipeg, Canada
- * E-mail:
| |
Collapse
|
21
|
Lim S, Lee S, Piao Y, Choi M, Bang D, Gu J, Kim S. On modeling and utilizing chemical compound information with deep learning technologies: A task-oriented approach. Comput Struct Biotechnol J 2022; 20:4288-4304. [PMID: 36051875 PMCID: PMC9399946 DOI: 10.1016/j.csbj.2022.07.049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2022] [Revised: 07/29/2022] [Accepted: 07/29/2022] [Indexed: 11/22/2022] Open
Abstract
A large number of chemical compounds are available in databases such as PubChem and ZINC. However, currently known compounds, though large, represent only a fraction of possible compounds, which is known as chemical space. Many of these compounds in the databases are annotated with properties and assay data that can be used for drug discovery efforts. For this goal, a number of machine learning algorithms have been developed and recent deep learning technologies can be effectively used to navigate chemical space, especially for unknown chemical compounds, in terms of drug-related tasks. In this article, we survey how deep learning technologies can model and utilize chemical compound information in a task-oriented way by exploiting annotated properties and assay data in the chemical compounds databases. We first compile what kind of tasks are trying to be accomplished by machine learning methods. Then, we survey deep learning technologies to show their modeling power and current applications for accomplishing drug related tasks. Next, we survey deep learning techniques to address the insufficiency issue of annotated data for more effective navigation of chemical space. Chemical compound information alone may not be powerful enough for drug related tasks, thus we survey what kind of information, such as assay and gene expression data, can be used to improve the prediction power of deep learning models. Finally, we conclude this survey with four important newly developed technologies that are yet to be fully incorporated into computational analysis of chemical information.
Collapse
Affiliation(s)
- Sangsoo Lim
- Bioinformatics Institute, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Sangseon Lee
- Institute of Computer Technology, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Yinhua Piao
- Department of Computer Science and Engineering, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - MinGyu Choi
- Department of Chemistry, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
- AIGENDRUG Co., Ltd., Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Dongmin Bang
- Interdisciplinary Program in Bioinformatics, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Jeonghyeon Gu
- Interdisciplinary Program in Artificial Intelligence, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Sun Kim
- Department of Computer Science and Engineering, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
- Interdisciplinary Program in Artificial Intelligence, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
- MOGAM Institute for Biomedical Research, Yong-in 16924, South Korea
- AIGENDRUG Co., Ltd., Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| |
Collapse
|
22
|
Singh K, Münchmeyer J, Weber L, Leser U, Bande A. Graph Neural Networks for Learning Molecular Excitation Spectra. J Chem Theory Comput 2022; 18:4408-4417. [PMID: 35671364 DOI: 10.1021/acs.jctc.2c00255] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Machine learning (ML) approaches have demonstrated the ability to predict molecular spectra at a fraction of the computational cost of traditional theoretical chemistry methods while maintaining high accuracy. Graph neural networks (GNNs) are particularly promising in this regard, but different types of GNNs have not yet been systematically compared. In this work, we benchmark and analyze five different GNNs for the prediction of excitation spectra from the QM9 dataset of organic molecules. We compare the GNN performance in the obvious runtime measurements, prediction accuracy, and analysis of outliers in the test set. Moreover, through TMAP clustering and statistical analysis, we are able to highlight clear hotspots of high prediction errors as well as optimal spectra prediction for molecules with certain functional groups. This in-depth benchmarking and subsequent analysis protocol lays down a recipe for comparing different ML methods and evaluating dataset quality.
Collapse
Affiliation(s)
- Kanishka Singh
- Helmholtz-Zentrum Berlin für Materialien und Energie GmbH, Hahn-Meitner-Platz 1, Berlin 10409, Germany.,Institute of Chemistry and Biochemistry, Freie Universität Berlin, Arnimallee 22, Berlin 14195, Germany
| | - Jannes Münchmeyer
- Deutsches GeoForschungsZentrum GFZ, Telegrafenberg, 14473 Potsdam, Germany.,Humboldt-Universität zu Berlin, Unter den Linden 6, 10117 Berlin, Germany
| | - Leon Weber
- Humboldt-Universität zu Berlin, Unter den Linden 6, 10117 Berlin, Germany.,Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Robert-Rössle-Strase 10, Berlin 13125, Germany
| | - Ulf Leser
- Humboldt-Universität zu Berlin, Unter den Linden 6, 10117 Berlin, Germany
| | - Annika Bande
- Helmholtz-Zentrum Berlin für Materialien und Energie GmbH, Hahn-Meitner-Platz 1, Berlin 10409, Germany
| |
Collapse
|
23
|
Isert C, Atz K, Jiménez-Luna J, Schneider G. QMugs, quantum mechanical properties of drug-like molecules. Sci Data 2022; 9:273. [PMID: 35672335 PMCID: PMC9174255 DOI: 10.1038/s41597-022-01390-7] [Citation(s) in RCA: 39] [Impact Index Per Article: 19.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2021] [Accepted: 05/17/2022] [Indexed: 12/16/2022] Open
Abstract
Machine learning approaches in drug discovery, as well as in other areas of the chemical sciences, benefit from curated datasets of physical molecular properties. However, there currently is a lack of data collections featuring large bioactive molecules alongside first-principle quantum chemical information. The open-access QMugs (Quantum-Mechanical Properties of Drug-like Molecules) dataset fills this void. The QMugs collection comprises quantum mechanical properties of more than 665 k biologically and pharmacologically relevant molecules extracted from the ChEMBL database, totaling ~2 M conformers. QMugs contains optimized molecular geometries and thermodynamic data obtained via the semi-empirical method GFN2-xTB. Atomic and molecular properties are provided on both the GFN2-xTB and on the density-functional levels of theory (DFT, ωB97X-D/def2-SVP). QMugs features molecules of significantly larger size than previously-reported collections and comprises their respective quantum mechanical wave functions, including DFT density and orbital matrices. This dataset is intended to facilitate the development of models that learn from molecular data on different levels of theory while also providing insight into the corresponding relationships between molecular structure and biological activity.
Collapse
Affiliation(s)
- Clemens Isert
- Department of Chemistry and Applied Biosciences, RETHINK, ETH Zurich, 8093, Zurich, Switzerland
| | - Kenneth Atz
- Department of Chemistry and Applied Biosciences, RETHINK, ETH Zurich, 8093, Zurich, Switzerland
| | - José Jiménez-Luna
- Department of Chemistry and Applied Biosciences, RETHINK, ETH Zurich, 8093, Zurich, Switzerland.
- Department of Medicinal Chemistry, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Straße 65, 88397, Biberach an der Riss, Germany.
| | - Gisbert Schneider
- Department of Chemistry and Applied Biosciences, RETHINK, ETH Zurich, 8093, Zurich, Switzerland.
- ETH Singapore SEC Ltd, 1 CREATE Way, #06-01 CREATE Tower, Singapore, 138602, Singapore.
| |
Collapse
|
24
|
Panapitiya G, Girard M, Hollas A, Sepulveda J, Murugesan V, Wang W, Saldanha E. Evaluation of Deep Learning Architectures for Aqueous Solubility Prediction. ACS OMEGA 2022; 7:15695-15710. [PMID: 35571767 PMCID: PMC9096921 DOI: 10.1021/acsomega.2c00642] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Accepted: 04/11/2022] [Indexed: 05/17/2023]
Abstract
Determining the aqueous solubility of molecules is a vital step in many pharmaceutical, environmental, and energy storage applications. Despite efforts made over decades, there are still challenges associated with developing a solubility prediction model with satisfactory accuracy for many of these applications. The goals of this study are to assess current deep learning methods for solubility prediction, develop a general model capable of predicting the solubility of a broad range of organic molecules, and to understand the impact of data properties, molecular representation, and modeling architecture on predictive performance. Using the largest currently available solubility data set, we implement deep learning-based models to predict solubility from the molecular structure and explore several different molecular representations including molecular descriptors, simplified molecular-input line-entry system strings, molecular graphs, and three-dimensional atomic coordinates using four different neural network architectures-fully connected neural networks, recurrent neural networks, graph neural networks (GNNs), and SchNet. We find that models using molecular descriptors achieve the best performance, with GNN models also achieving good performance. We perform extensive error analysis to understand the molecular properties that influence model performance, perform feature analysis to understand which information about the molecular structure is most valuable for prediction, and perform a transfer learning and data size study to understand the impact of data availability on model performance.
Collapse
Affiliation(s)
- Gihan Panapitiya
- Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Michael Girard
- Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Aaron Hollas
- Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Jonathan Sepulveda
- Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | | | - Wei Wang
- Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Emily Saldanha
- Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| |
Collapse
|
25
|
Autonomous design of new chemical reactions using a variational autoencoder. Commun Chem 2022; 5:40. [PMID: 36697652 PMCID: PMC9814385 DOI: 10.1038/s42004-022-00647-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2021] [Accepted: 02/16/2022] [Indexed: 01/28/2023] Open
Abstract
Artificial intelligence based chemistry models are a promising method of exploring chemical reaction design spaces. However, training datasets based on experimental synthesis are typically reported only for the optimal synthesis reactions. This leads to an inherited bias in the model predictions. Therefore, robust datasets that span the entirety of the solution space are necessary to remove inherited bias and permit complete training of the space. In this study, an artificial intelligence model based on a Variational AutoEncoder (VAE) has been developed and investigated to synthetically generate continuous datasets. The approach involves sampling the latent space to generate new chemical reactions. This developed technique is demonstrated by generating over 7,000,000 new reactions from a training dataset containing only 7,000 reactions. The generated reactions include molecular species that are larger and more diverse than the training set.
Collapse
|
26
|
Jacobson LD, Stevenson JM, Ramezanghorbani F, Ghoreishi D, Leswing K, Harder ED, Abel R. Transferable Neural Network Potential Energy Surfaces for Closed-Shell Organic Molecules: Extension to Ions. J Chem Theory Comput 2022; 18:2354-2366. [PMID: 35290063 DOI: 10.1021/acs.jctc.1c00821] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Transferable high dimensional neural network potentials (HDNNPs) have shown great promise as an avenue to increase the accuracy and domain of applicability of existing atomistic force fields for organic systems relevant to life science. We have previously reported such a potential (Schrödinger-ANI) that has broad coverage of druglike molecules. We extend that work here to cover ionic and zwitterionic druglike molecules expected to be relevant to drug discovery research activities. We report a novel HDNNP architecture, which we call QRNN, that predicts atomic charges and uses these charges as descriptors in an energy model that delivers conformational energies within chemical accuracy when measured against the reference theory it is trained to. Further, we find that delta learning based on a semiempirical level of theory approximately halves the errors. We test the models on torsion energy profiles, relative conformational energies, geometric parameters, and relative tautomer errors.
Collapse
Affiliation(s)
- Leif D Jacobson
- Schrödinger Inc., 1540 Broadway, 24th floor, New York, New York 10036, United States
| | - James M Stevenson
- Schrödinger Inc., 1540 Broadway, 24th floor, New York, New York 10036, United States
| | | | - Delaram Ghoreishi
- Schrödinger Inc., 1540 Broadway, 24th floor, New York, New York 10036, United States
| | - Karl Leswing
- Schrödinger Inc., 1540 Broadway, 24th floor, New York, New York 10036, United States
| | - Edward D Harder
- Schrödinger Inc., 1540 Broadway, 24th floor, New York, New York 10036, United States
| | - Robert Abel
- Schrödinger Inc., 1540 Broadway, 24th floor, New York, New York 10036, United States
| |
Collapse
|
27
|
Gebauer NWA, Gastegger M, Hessmann SSP, Müller KR, Schütt KT. Inverse design of 3d molecular structures with conditional generative neural networks. Nat Commun 2022; 13:973. [PMID: 35190542 PMCID: PMC8861047 DOI: 10.1038/s41467-022-28526-y] [Citation(s) in RCA: 38] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2021] [Accepted: 01/28/2022] [Indexed: 11/09/2022] Open
Abstract
The rational design of molecules with desired properties is a long-standing challenge in chemistry. Generative neural networks have emerged as a powerful approach to sample novel molecules from a learned distribution. Here, we propose a conditional generative neural network for 3d molecular structures with specified chemical and structural properties. This approach is agnostic to chemical bonding and enables targeted sampling of novel molecules from conditional distributions, even in domains where reference calculations are sparse. We demonstrate the utility of our method for inverse design by generating molecules with specified motifs or composition, discovering particularly stable molecules, and jointly targeting multiple electronic properties beyond the training regime.
Collapse
Affiliation(s)
- Niklas W A Gebauer
- Machine Learning Group, Technische Universität Berlin, 10587, Berlin, Germany.
- Berlin Institute for the Foundations of Learning and Data, 10587, Berlin, Germany.
- BASLEARN-TU Berlin/BASF Joint Lab for Machine Learning, Technische Universität Berlin, 10587, Berlin, Germany.
| | - Michael Gastegger
- Machine Learning Group, Technische Universität Berlin, 10587, Berlin, Germany
- BASLEARN-TU Berlin/BASF Joint Lab for Machine Learning, Technische Universität Berlin, 10587, Berlin, Germany
| | - Stefaan S P Hessmann
- Machine Learning Group, Technische Universität Berlin, 10587, Berlin, Germany
- Berlin Institute for the Foundations of Learning and Data, 10587, Berlin, Germany
| | - Klaus-Robert Müller
- Machine Learning Group, Technische Universität Berlin, 10587, Berlin, Germany
- Berlin Institute for the Foundations of Learning and Data, 10587, Berlin, Germany
- Department of Artificial Intelligence, Korea University, Anam-dong, Seongbuk-gu, Seoul, 02841, Korea
- Max-Planck-Institut für Informatik, 66123, Saarbrücken, Germany
| | - Kristof T Schütt
- Machine Learning Group, Technische Universität Berlin, 10587, Berlin, Germany.
- Berlin Institute for the Foundations of Learning and Data, 10587, Berlin, Germany.
| |
Collapse
|
28
|
Shi X, Lin X, Luo R, Wu S, Li L, Zhao ZJ, Gong J. Dynamics of Heterogeneous Catalytic Processes at Operando Conditions. JACS AU 2021; 1:2100-2120. [PMID: 34977883 PMCID: PMC8715484 DOI: 10.1021/jacsau.1c00355] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Indexed: 05/02/2023]
Abstract
The rational design of high-performance catalysts is hindered by the lack of knowledge of the structures of active sites and the reaction pathways under reaction conditions, which can be ideally addressed by an in situ/operando characterization. Besides the experimental insights, a theoretical investigation that simulates reaction conditions-so-called operando modeling-is necessary for a plausible understanding of a working catalyst system at the atomic scale. However, there is still a huge gap between the current widely used computational model and the concept of operando modeling, which should be achieved through multiscale computational modeling. This Perspective describes various modeling approaches and machine learning techniques that step toward operando modeling, followed by selected experimental examples that present an operando understanding in the thermo- and electrocatalytic processes. At last, the remaining challenges in this area are outlined.
Collapse
Affiliation(s)
- Xiangcheng Shi
- Key
Laboratory for Green Chemical Technology of Ministry of Education,
School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
- Collaborative
Innovation Center of Chemical Science and Engineering, Tianjin 300072, China
- Joint
School of National University of Singapore and Tianjin University,
International Campus of Tianjin University, Fuzhou 350207, China
| | - Xiaoyun Lin
- Key
Laboratory for Green Chemical Technology of Ministry of Education,
School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
- Collaborative
Innovation Center of Chemical Science and Engineering, Tianjin 300072, China
| | - Ran Luo
- Key
Laboratory for Green Chemical Technology of Ministry of Education,
School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
- Collaborative
Innovation Center of Chemical Science and Engineering, Tianjin 300072, China
| | - Shican Wu
- Key
Laboratory for Green Chemical Technology of Ministry of Education,
School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
- Collaborative
Innovation Center of Chemical Science and Engineering, Tianjin 300072, China
| | - Lulu Li
- Key
Laboratory for Green Chemical Technology of Ministry of Education,
School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
- Collaborative
Innovation Center of Chemical Science and Engineering, Tianjin 300072, China
| | - Zhi-Jian Zhao
- Key
Laboratory for Green Chemical Technology of Ministry of Education,
School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
- Collaborative
Innovation Center of Chemical Science and Engineering, Tianjin 300072, China
| | - Jinlong Gong
- Key
Laboratory for Green Chemical Technology of Ministry of Education,
School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
- Collaborative
Innovation Center of Chemical Science and Engineering, Tianjin 300072, China
- Joint
School of National University of Singapore and Tianjin University,
International Campus of Tianjin University, Fuzhou 350207, China
| |
Collapse
|
29
|
Busk J, Bjørn Jørgensen P, Bhowmik A, Schmidt MN, Winther O, Vegge T. Calibrated uncertainty for molecular property prediction using ensembles of message passing neural networks. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2021. [DOI: 10.1088/2632-2153/ac3eb3] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Abstract
Data-driven methods based on machine learning have the potential to accelerate computational analysis of atomic structures. In this context, reliable uncertainty estimates are important for assessing confidence in predictions and enabling decision making. However, machine learning models can produce badly calibrated uncertainty estimates and it is therefore crucial to detect and handle uncertainty carefully. In this work we extend a message passing neural network designed specifically for predicting properties of molecules and materials with a calibrated probabilistic predictive distribution. The method presented in this paper differs from previous work by considering both aleatoric and epistemic uncertainty in a unified framework, and by recalibrating the predictive distribution on unseen data. Through computer experiments, we show that our approach results in accurate models for predicting molecular formation energies with well calibrated uncertainty in and out of the training data distribution on two public molecular benchmark datasets, QM9 and PC9. The proposed method provides a general framework for training and evaluating neural network ensemble models that are able to produce accurate predictions of properties of molecules with well calibrated uncertainty estimates.
Collapse
|
30
|
Meng F, Xi Y, Huang J, Ayers PW. A curated diverse molecular database of blood-brain barrier permeability with chemical descriptors. Sci Data 2021; 8:289. [PMID: 34716354 PMCID: PMC8556334 DOI: 10.1038/s41597-021-01069-5] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2021] [Accepted: 09/22/2021] [Indexed: 01/31/2023] Open
Abstract
The highly-selective blood-brain barrier (BBB) prevents neurotoxic substances in blood from crossing into the extracellular fluid of the central nervous system (CNS). As such, the BBB has a close relationship with CNS disease development and treatment, so predicting whether a substance crosses the BBB is a key task in lead discovery for CNS drugs. Machine learning (ML) is a promising strategy for predicting the BBB permeability, but existing studies have been limited by small datasets with limited chemical diversity. To mitigate this issue, we present a large benchmark dataset, B3DB, complied from 50 published resources and categorized based on experimental uncertainty. A subset of the molecules in B3DB has numerical log BB values (1058 compounds), while the whole dataset has categorical (BBB+ or BBB-) BBB permeability labels (7807). The dataset is freely available at https://github.com/theochem/B3DB and https://doi.org/10.6084/m9.figshare.15634230.v3 (version 3). We also provide some physicochemical properties of the molecules. By analyzing these properties, we can demonstrate some physiochemical similarities and differences between BBB+ and BBB- compounds.
Collapse
Affiliation(s)
- Fanwang Meng
- grid.25073.330000 0004 1936 8227Department of Chemistry and Chemical Biology, McMaster University, Hamilton, L8S 4L8 Canada
| | - Yang Xi
- grid.25073.330000 0004 1936 8227Department of Chemistry and Chemical Biology, McMaster University, Hamilton, L8S 4L8 Canada
| | - Jinfeng Huang
- grid.25073.330000 0004 1936 8227Department of Chemistry and Chemical Biology, McMaster University, Hamilton, L8S 4L8 Canada
| | - Paul W. Ayers
- grid.25073.330000 0004 1936 8227Department of Chemistry and Chemical Biology, McMaster University, Hamilton, L8S 4L8 Canada
| |
Collapse
|
31
|
Leguy J, Glavatskikh M, Cauchy T, Da Mota B. Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization. J Cheminform 2021; 13:76. [PMID: 34600576 PMCID: PMC8487551 DOI: 10.1186/s13321-021-00554-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 09/15/2021] [Indexed: 01/21/2023] Open
Abstract
Chemical diversity is one of the key term when dealing with machine learning and molecular generation. This is particularly true for quantum chemical datasets. The composition of which should be done meticulously since the calculation is highly time demanding. Previously we have seen that the most known quantum chemical dataset QM9 lacks chemical diversity. As a consequence, ML models trained on QM9 showed generalizability shortcomings. In this paper we would like to present (i) a fast and generic method to evaluate chemical diversity, (ii) a new quantum chemical dataset of 435k molecules, OD9, that includes QM9 and new molecules generated with a diversity objective, (iii) an analysis of the diversity impact on unconstrained and goal-directed molecular generation on the example of QED optimization. Our innovative approach makes it possible to individually estimate the impact of a solution to the diversity of a set, allowing for effective incremental evaluation. In the first application, we will see how the diversity constraint allows us to generate more than a million of molecules that would efficiently complete the reference datasets. The compounds were calculated with DFT thanks to a collaborative effort through the QuChemPedIA@home BOINC project. With regard to goal-directed molecular generation, getting a high QED score is not complicated, but adding a little diversity can cut the number of calls to the evaluation function by a factor of ten.
Collapse
Affiliation(s)
- Jules Leguy
- Univ Angers, LERIA, SFR MATHSTIC, 49000, Angers, France
| | - Marta Glavatskikh
- Univ Angers, LERIA, SFR MATHSTIC, 49000, Angers, France.,Univ Angers, CNRS, MOLTECH-ANJOU, SFR MATRIX, 49000, Angers, France
| | - Thomas Cauchy
- Univ Angers, CNRS, MOLTECH-ANJOU, SFR MATRIX, 49000, Angers, France.
| | - Benoit Da Mota
- Univ Angers, LERIA, SFR MATHSTIC, 49000, Angers, France.
| |
Collapse
|
32
|
Sattari K, Xie Y, Lin J. Data-driven algorithms for inverse design of polymers. SOFT MATTER 2021; 17:7607-7622. [PMID: 34397078 DOI: 10.1039/d1sm00725d] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The ever-increasing demand for novel polymers with superior properties requires a deeper understanding and exploration of the chemical space. Recently, data-driven approaches to explore the chemical space for polymer design have emerged. Among them, inverse design strategies for designing polymers with specific properties have evolved to be a significant materials informatics platform by learning hidden knowledge from materials data as well as smartly navigating the chemical space in an optimized way. In this review, we first summarize the progress in the representation of polymers, a prerequisite step for the inverse design of polymers. Then, we systematically introduce three data-driven strategies implemented for the inverse design of polymers, i.e., high-throughput virtual screening, global optimization, and generative models. Finally, we discuss the challenges and opportunities of the data-driven strategies as well as optimization algorithms employed in the inverse design of polymers.
Collapse
Affiliation(s)
- Kianoosh Sattari
- Department of Mechanical and Aerospace Engineering, University of Missouri, Columbia, MO 65211, USA.
| | | | | |
Collapse
|
33
|
Westermayr J, Marquetand P. Machine Learning for Electronically Excited States of Molecules. Chem Rev 2021; 121:9873-9926. [PMID: 33211478 PMCID: PMC8391943 DOI: 10.1021/acs.chemrev.0c00749] [Citation(s) in RCA: 171] [Impact Index Per Article: 57.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Indexed: 12/11/2022]
Abstract
Electronically excited states of molecules are at the heart of photochemistry, photophysics, as well as photobiology and also play a role in material science. Their theoretical description requires highly accurate quantum chemical calculations, which are computationally expensive. In this review, we focus on not only how machine learning is employed to speed up such excited-state simulations but also how this branch of artificial intelligence can be used to advance this exciting research field in all its aspects. Discussed applications of machine learning for excited states include excited-state dynamics simulations, static calculations of absorption spectra, as well as many others. In order to put these studies into context, we discuss the promises and pitfalls of the involved machine learning techniques. Since the latter are mostly based on quantum chemistry calculations, we also provide a short introduction into excited-state electronic structure methods and approaches for nonadiabatic dynamics simulations and describe tricks and problems when using them in machine learning for excited states of molecules.
Collapse
Affiliation(s)
- Julia Westermayr
- Institute
of Theoretical Chemistry, Faculty of Chemistry, University of Vienna, Währinger Strasse 17, 1090 Vienna, Austria
| | - Philipp Marquetand
- Institute
of Theoretical Chemistry, Faculty of Chemistry, University of Vienna, Währinger Strasse 17, 1090 Vienna, Austria
- Vienna
Research Platform on Accelerating Photoreaction Discovery, University of Vienna, Währinger Strasse 17, 1090 Vienna, Austria
- Data
Science @ Uni Vienna, University of Vienna, Währinger Strasse 29, 1090 Vienna, Austria
| |
Collapse
|
34
|
Abstract
Chemical compound space (CCS), the set of all theoretically conceivable combinations of chemical elements and (meta-)stable geometries that make up matter, is colossal. The first-principles based virtual sampling of this space, for example, in search of novel molecules or materials which exhibit desirable properties, is therefore prohibitive for all but the smallest subsets and simplest properties. We review studies aimed at tackling this challenge using modern machine learning techniques based on (i) synthetic data, typically generated using quantum mechanics based methods, and (ii) model architectures inspired by quantum mechanics. Such Quantum mechanics based Machine Learning (QML) approaches combine the numerical efficiency of statistical surrogate models with an ab initio view on matter. They rigorously reflect the underlying physics in order to reach universality and transferability across CCS. While state-of-the-art approximations to quantum problems impose severe computational bottlenecks, recent QML based developments indicate the possibility of substantial acceleration without sacrificing the predictive power of quantum mechanics.
Collapse
Affiliation(s)
- Bing Huang
- Faculty
of Physics, University of Vienna, 1090 Vienna, Austria
| | - O. Anatole von Lilienfeld
- Faculty
of Physics, University of Vienna, 1090 Vienna, Austria
- Institute
of Physical Chemistry and National Center for Computational Design
and Discovery of Novel Materials (MARVEL), Department of Chemistry, University of Basel, 4056 Basel, Switzerland
| |
Collapse
|
35
|
Abstract
Electronically excited states of molecules are at the heart of photochemistry, photophysics, as well as photobiology and also play a role in material science. Their theoretical description requires highly accurate quantum chemical calculations, which are computationally expensive. In this review, we focus on not only how machine learning is employed to speed up such excited-state simulations but also how this branch of artificial intelligence can be used to advance this exciting research field in all its aspects. Discussed applications of machine learning for excited states include excited-state dynamics simulations, static calculations of absorption spectra, as well as many others. In order to put these studies into context, we discuss the promises and pitfalls of the involved machine learning techniques. Since the latter are mostly based on quantum chemistry calculations, we also provide a short introduction into excited-state electronic structure methods and approaches for nonadiabatic dynamics simulations and describe tricks and problems when using them in machine learning for excited states of molecules.
Collapse
Affiliation(s)
- Julia Westermayr
- Institute of Theoretical Chemistry, Faculty of Chemistry, University of Vienna, Währinger Strasse 17, 1090 Vienna, Austria
| | - Philipp Marquetand
- Institute of Theoretical Chemistry, Faculty of Chemistry, University of Vienna, Währinger Strasse 17, 1090 Vienna, Austria
- Vienna Research Platform on Accelerating Photoreaction Discovery, University of Vienna, Währinger Strasse 17, 1090 Vienna, Austria
- Data Science @ Uni Vienna, University of Vienna, Währinger Strasse 29, 1090 Vienna, Austria
| |
Collapse
|
36
|
Kerner J, Dogan A, von Recum H. Machine learning and big data provide crucial insight for future biomaterials discovery and research. Acta Biomater 2021; 130:54-65. [PMID: 34087445 DOI: 10.1016/j.actbio.2021.05.053] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Revised: 05/24/2021] [Accepted: 05/25/2021] [Indexed: 02/06/2023]
Abstract
Machine learning have been widely adopted in a variety of fields including engineering, science, and medicine revolutionizing how data is collected, used, and stored. Their implementation has led to a drastic increase in the number of computational models for the prediction of various numerical, categorical, or association events given input variables. We aim to examine recent advances in the use of machine learning when applied to the biomaterial field. Specifically, quantitative structure properties relationships offer the unique ability to correlate microscale molecular descriptors to larger macroscale material properties. These new models can be broken down further into four categories: regression, classification, association, and clustering. We examine recent approaches and new uses of machine learning in the three major categories of biomaterials: metals, polymers, and ceramics for rapid property prediction and trend identification. While current research is promising, limitations in the form of lack of standardized reporting and available databases complicates the implementation of described models. Herein, we hope to provide a snapshot of the current state of the field and a beginner's guide to navigating the intersection of biomaterials research and machine learning. STATEMENT OF SIGNIFICANCE: Machine learning and its methods have found a variety of uses beyond the field of computer science but have largely been neglected by those in realm of biomaterials. Through the use of more computational methods, biomaterials development can be expediated while reducing the need for standard trial and error methods. Within, we introduce four basic models that readers can potentially apply to their current research as well as current applications within the field. Furthermore, we hope that this article may act as a "call to action" for readers to realize and address the current lack of implementation within the biomaterials field.
Collapse
Affiliation(s)
- Jacob Kerner
- Case Western Reserve University; 10900 Euclid Ave., Cleveland Ohio 44106.
| | - Alan Dogan
- Case Western Reserve University; 10900 Euclid Ave., Cleveland Ohio 44106.
| | - Horst von Recum
- Case Western Reserve University; 10900 Euclid Ave., Cleveland Ohio 44106.
| |
Collapse
|
37
|
Gawriljuk VO, Foil DH, Puhl AC, Zorn KM, Lane TR, Riabova O, Makarov V, Godoy AS, Oliva G, Ekins S. Development of Machine Learning Models and the Discovery of a New Antiviral Compound against Yellow Fever Virus. J Chem Inf Model 2021; 61:3804-3813. [PMID: 34286575 DOI: 10.1021/acs.jcim.1c00460] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Yellow fever (YF) is an acute viral hemorrhagic disease transmitted by infected mosquitoes. Large epidemics of YF occur when the virus is introduced into heavily populated areas with high mosquito density and low vaccination coverage. The lack of a specific small molecule drug treatment against YF as well as for homologous infections, such as zika and dengue, highlights the importance of these flaviviruses as a public health concern. With the advancement in computer hardware and bioactivity data availability, new tools based on machine learning methods have been introduced into drug discovery, as a means to utilize the growing high throughput screening (HTS) data generated to reduce costs and increase the speed of drug development. The use of predictive machine learning models using previously published data from HTS campaigns or data available in public databases, can enable the selection of compounds with desirable bioactivity and absorption, distribution, metabolism, and excretion profiles. In this study, we have collated cell-based assay data for yellow fever virus from the literature and public databases. The data were used to build predictive models with several machine learning methods that could prioritize compounds for in vitro testing. Five molecules were prioritized and tested in vitro from which we have identified a new pyrazolesulfonamide derivative with EC50 3.2 μM and CC50 24 μM, which represents a new scaffold suitable for hit-to-lead optimization that can expand the available drug discovery candidates for YF.
Collapse
Affiliation(s)
- Victor O Gawriljuk
- São Carlos Institute of Physics, University of São Paulo, Av. João Dagnone, 1100 - Santa Angelina, São Carlos, São Paulo 13563-120, Brazil
| | - Daniel H Foil
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Ana C Puhl
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Kimberley M Zorn
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Thomas R Lane
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Olga Riabova
- Research Center of Biotechnology RAS, Leninsky Prospekt 33-2, 119071 Moscow, Russia
| | - Vadim Makarov
- Research Center of Biotechnology RAS, Leninsky Prospekt 33-2, 119071 Moscow, Russia
| | - Andre S Godoy
- São Carlos Institute of Physics, University of São Paulo, Av. João Dagnone, 1100 - Santa Angelina, São Carlos, São Paulo 13563-120, Brazil
| | - Glaucius Oliva
- São Carlos Institute of Physics, University of São Paulo, Av. João Dagnone, 1100 - Santa Angelina, São Carlos, São Paulo 13563-120, Brazil
| | - Sean Ekins
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| |
Collapse
|
38
|
Vazquez-Salazar LI, Boittier ED, Unke OT, Meuwly M. Impact of the Characteristics of Quantum Chemical Databases on Machine Learning Prediction of Tautomerization Energies. J Chem Theory Comput 2021; 17:4769-4785. [PMID: 34288675 DOI: 10.1021/acs.jctc.1c00363] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
An essential aspect for adequate predictions of chemical properties by machine learning models is the database used for training them. However, studies that analyze how the content and structure of the databases used for training impact the prediction quality are scarce. In this work, we analyze and quantify the relationships learned by a machine learning model (Neural Network) trained on five different reference databases (QM9, PC9, ANI-1E, ANI-1, and ANI-1x) to predict tautomerization energies from molecules in Tautobase. For this, characteristics such as the number of heavy atoms in a molecule, number of atoms of a given element, bond composition, or initial geometry on the quality of the predictions are considered. The results indicate that training on a chemically diverse database is crucial for obtaining good results and also that conformational sampling can partly compensate for limited coverage of chemical diversity. The overall best-performing reference database (ANI-1x) performs on average by 1 kcal/mol better than PC9, which, however, contains about 2 orders of magnitude fewer reference structures. On the other hand, PC9 is chemically more diverse by a factor of ∼5 as quantified by the number of atom-in-molecule-based fragments (amons) it contains compared with the ANI family of databases. A quantitative measure for deficiencies is the Kullback-Leibler divergence between reference and target distributions. It is explicitly demonstrated that when certain types of bonds need to be covered in the target database (Tautobase) but are undersampled in the reference databases, the resulting predictions are poor. Examples of this include the poor performance of all databases analyzed to predict C(sp2)-C(sp2) double bonds close to heteroatoms and azoles containing N-N and N-O bonds. Analysis of the results with a Tree MAP algorithm provides deeper understanding of specific deficiencies in predicting tautomerization energies by the reference datasets due to inadequate coverage of chemical space. Capitalizing on this information can be used to either improve existing databases or generate new databases of sufficient diversity for a range of machine learning (ML) applications in chemistry.
Collapse
Affiliation(s)
| | - Eric D Boittier
- Department of Chemistry, University of Basel, Klingelbergstrasse 80, CH-4056 Basel, Switzerland
| | - Oliver T Unke
- Machine Learning Group, Technische Universität Berlin, 10587 Berlin, Germany.,DFG Cluster of Excellence "Unifying Systems in Catalysis" (UniSysCat), Technische Universität Berlin, 10623 Berlin, Germany
| | - Markus Meuwly
- Department of Chemistry, University of Basel, Klingelbergstrasse 80, CH-4056 Basel, Switzerland.,Department of Chemistry, Brown University, Providence, Rhode Island 02912, United States
| |
Collapse
|
39
|
Häse F, Aldeghi M, Hickman RJ, Roch LM, Christensen M, Liles E, Hein JE, Aspuru-Guzik A. Olympus: a benchmarking framework for noisy optimization and experiment planning. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2021. [DOI: 10.1088/2632-2153/abedc8] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Abstract
Research challenges encountered across science, engineering, and economics can frequently be formulated as optimization tasks. In chemistry and materials science, recent growth in laboratory digitization and automation has sparked interest in optimization-guided autonomous discovery and closed-loop experimentation. Experiment planning strategies based on off-the-shelf optimization algorithms can be employed in fully autonomous research platforms to achieve desired experimentation goals with the minimum number of trials. However, the experiment planning strategy that is most suitable to a scientific discovery task is a priori unknown while rigorous comparisons of different strategies are highly time and resource demanding. As optimization algorithms are typically benchmarked on low-dimensional synthetic functions, it is unclear how their performance would translate to noisy, higher-dimensional experimental tasks encountered in chemistry and materials science. We introduce Olympus, a software package that provides a consistent and easy-to-use framework for benchmarking optimization algorithms against realistic experiments emulated via probabilistic deep-learning models. Olympus includes a collection of experimentally derived benchmark sets from chemistry and materials science and a suite of experiment planning strategies that can be easily accessed via a user-friendly Python interface. Furthermore, Olympus facilitates the integration, testing, and sharing of custom algorithms and user-defined datasets. In brief, Olympus mitigates the barriers associated with benchmarking optimization algorithms on realistic experimental scenarios, promoting data sharing and the creation of a standard framework for evaluating the performance of experiment planning strategies.
Collapse
|
40
|
Stuke A, Rinke P, Todorović M. Efficient hyperparameter tuning for kernel ridge regression with Bayesian optimization. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2021. [DOI: 10.1088/2632-2153/abee59] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Abstract
Machine learning methods usually depend on internal parameters—so called hyperparameters—that need to be optimized for best performance. Such optimization poses a burden on machine learning practitioners, requiring expert knowledge, intuition or computationally demanding brute-force parameter searches. We here assess three different hyperparameter selection methods: grid search, random search and an efficient automated optimization technique based on Bayesian optimization (BO). We apply these methods to a machine learning problem based on kernel ridge regression in computational chemistry. Two different descriptors are employed to represent the atomic structure of organic molecules, one of which introduces its own set of hyperparameters to the method. We identify optimal hyperparameter configurations and infer entire prediction error landscapes in hyperparameter space that serve as visual guides for the hyperparameter performance. We further demonstrate that for an increasing number of hyperparameters, BO and random search become significantly more efficient in computational time than an exhaustive grid search, while delivering an equivalent or even better accuracy.
Collapse
|
41
|
Lu J, Xia S, Lu J, Zhang Y. Dataset Construction to Explore Chemical Space with 3D Geometry and Deep Learning. J Chem Inf Model 2021; 61:1095-1104. [PMID: 33683885 DOI: 10.1021/acs.jcim.1c00007] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
A dataset is the basis of deep learning model development, and the success of deep learning models heavily relies on the quality and size of the dataset. In this work, we present a new data preparation protocol and build a large fragment-based dataset Frag20, which consists of optimized 3D geometries and calculated molecular properties from Merck molecular force field (MMFF) and DFT at the B3LYP/6-31G* level of theory for more than half a million molecules composed of H, B, C, O, N, F, P, S, Cl, and Br with no larger than 20 heavy atoms. Based on the new dataset, we develop robust molecular energy prediction models using a simplified PhysNet architecture for both DFT-optimized and MMFF-optimized geometries, which achieve better than or close to chemical accuracy (1 kcal/mol) on multiple test sets, including CSD20 and Plati20 based on experimental crystal structures.
Collapse
Affiliation(s)
- Jianing Lu
- Department of Chemistry, New York University, New York, New York 10003, United States
| | - Song Xia
- Department of Chemistry, New York University, New York, New York 10003, United States
| | - Jieyu Lu
- Department of Chemistry, New York University, New York, New York 10003, United States
| | - Yingkai Zhang
- Department of Chemistry, New York University, New York, New York 10003, United States.,NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| |
Collapse
|
42
|
Shen WX, Zeng X, Zhu F, Wang YL, Qin C, Tan Y, Jiang YY, Chen YZ. Out-of-the-box deep learning prediction of pharmaceutical properties by broadly learned knowledge-based molecular representations. NAT MACH INTELL 2021. [DOI: 10.1038/s42256-021-00301-6] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
43
|
Koge D, Ono N, Huang M, Altaf‐Ul‐Amin M, Kanaya S. Embedding of Molecular Structure Using Molecular Hypergraph Variational Autoencoder with Metric Learning. Mol Inform 2021; 40:e2000203. [PMID: 33164295 PMCID: PMC7900996 DOI: 10.1002/minf.202000203] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Accepted: 10/29/2020] [Indexed: 11/06/2022]
Abstract
Deep learning approaches are widely used to search molecular structures for a candidate drug/material. The basic approach in drug/material candidate structure discovery is to embed a relationship that holds between a molecular structure and the physical property into a low-dimensional vector space (chemical space) and search for a candidate molecular structure in that space based on a desired physical property value. Deep learning simplifies the structure search by efficiently modeling the structure of the chemical space with greater detail and lower dimensions than the original input space. In our research, we propose an effective method for molecular embedding learning that combines variational autoencoders (VAEs) and metric learning using any physical property. Our method enables molecular structures and physical properties to be embedded locally and continuously into VAEs' latent space while maintaining the consistency of the relationship between the structural features and the physical properties of molecules to yield better predictions.
Collapse
Affiliation(s)
- Daiki Koge
- Division of Information ScienceGraduate School of Science and TechnologyNara Institute of Science and Technology8916-5 Takayama, IkomaNara630-0192Japan
| | - Naoaki Ono
- Division of Information ScienceGraduate School of Science and TechnologyNara Institute of Science and Technology8916-5 Takayama, IkomaNara630-0192Japan
- Data Science CenterGraduate School of Science and TechnologyNara Institute of Science and Technology8916-5 Takayama, IkomaNara630-0192Japan
| | - Ming Huang
- Division of Information ScienceGraduate School of Science and TechnologyNara Institute of Science and Technology8916-5 Takayama, IkomaNara630-0192Japan
| | - Md. Altaf‐Ul‐Amin
- Division of Information ScienceGraduate School of Science and TechnologyNara Institute of Science and Technology8916-5 Takayama, IkomaNara630-0192Japan
| | - Shigehiko Kanaya
- Division of Information ScienceGraduate School of Science and TechnologyNara Institute of Science and Technology8916-5 Takayama, IkomaNara630-0192Japan
- Data Science CenterGraduate School of Science and TechnologyNara Institute of Science and Technology8916-5 Takayama, IkomaNara630-0192Japan
| |
Collapse
|
44
|
Leguy J, Cauchy T, Glavatskikh M, Duval B, Da Mota B. EvoMol: a flexible and interpretable evolutionary algorithm for unbiased de novo molecular generation. J Cheminform 2020; 12:55. [PMID: 33431049 PMCID: PMC7494000 DOI: 10.1186/s13321-020-00458-z] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2020] [Accepted: 08/31/2020] [Indexed: 11/24/2022] Open
Abstract
The objective of this work is to design a molecular generator capable of exploring known as well as unfamiliar areas of the chemical space. Our method must be flexible to adapt to very different problems. Therefore, it has to be able to work with or without the influence of prior data and knowledge. Moreover, regardless of the success, it should be as interpretable as possible to allow for diagnosis and improvement. We propose here a new open source generation method using an evolutionary algorithm to sequentially build molecular graphs. It is independent of starting data and can generate totally unseen compounds. To be able to search a large part of the chemical space, we define an original set of 7 generic mutations close to the atomic level. Our method achieves excellent performances and even records on the QED, penalised logP, SAscore, CLscore as well as the set of goal-directed functions defined in GuacaMol. To demonstrate its flexibility, we tackle a very different objective issued from the organic molecular materials domain. We show that EvoMol can generate sets of optimised molecules having high energy HOMO or low energy LUMO, starting only from methane. We can also set constraints on a synthesizability score and structural features. Finally, the interpretability of EvoMol allows for the visualisation of its exploration process as a chemically relevant tree. ![]()
Collapse
Affiliation(s)
- Jules Leguy
- Laboratoire LERIA, UNIV Angers, SFR MathSTIC, 2 Bd Lavoisier, 49045, Angers, France
| | - Thomas Cauchy
- Laboratoire MOLTECH-Anjou, UMR CNRS 6200, UNIV Angers, SFR MATRIX, 2 Bd Lavoisier, 49045, Angers, France.
| | - Marta Glavatskikh
- Laboratoire LERIA, UNIV Angers, SFR MathSTIC, 2 Bd Lavoisier, 49045, Angers, France.,Laboratoire MOLTECH-Anjou, UMR CNRS 6200, UNIV Angers, SFR MATRIX, 2 Bd Lavoisier, 49045, Angers, France
| | - Béatrice Duval
- Laboratoire LERIA, UNIV Angers, SFR MathSTIC, 2 Bd Lavoisier, 49045, Angers, France
| | - Benoit Da Mota
- Laboratoire LERIA, UNIV Angers, SFR MathSTIC, 2 Bd Lavoisier, 49045, Angers, France.
| |
Collapse
|
45
|
Smith DGA, Altarawy D, Burns LA, Welborn M, Naden LN, Ward L, Ellis S, Pritchard BP, Crawford TD. The
MolSSI
QCA
rchive
project: An open‐source platform to compute, organize, and share quantum chemistry data. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2020. [DOI: 10.1002/wcms.1491] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Affiliation(s)
| | - Doaa Altarawy
- Molecular Sciences Software Institute Blacksburg Virginia USA
- Department of Computer and Systems Engineering Alexandria University Alexandria Egypt
| | - Lori A. Burns
- Center for Computational Molecular Science and Technology School of Chemistry and Biochemistry, Georgia Institute of Technology Atlanta Georgia USA
| | - Matthew Welborn
- Molecular Sciences Software Institute Blacksburg Virginia USA
| | - Levi N. Naden
- Molecular Sciences Software Institute Blacksburg Virginia USA
| | - Logan Ward
- Data Science and Learning Division Argonne National Laboratory Lemont Illinois USA
| | - Sam Ellis
- Molecular Sciences Software Institute Blacksburg Virginia USA
| | | | - T. Daniel Crawford
- Molecular Sciences Software Institute Blacksburg Virginia USA
- Department of Chemistry Virginia Tech Blacksburg, Virginia USA
| |
Collapse
|
46
|
Mancuso JL, Mroz AM, Le KN, Hendon CH. Electronic Structure Modeling of Metal-Organic Frameworks. Chem Rev 2020; 120:8641-8715. [PMID: 32672939 DOI: 10.1021/acs.chemrev.0c00148] [Citation(s) in RCA: 97] [Impact Index Per Article: 24.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Owing to their molecular building blocks, yet highly crystalline nature, metal-organic frameworks (MOFs) sit at the interface between molecule and material. Their diverse structures and compositions enable them to be useful materials as catalysts in heterogeneous reactions, electrical conductors in energy storage and transfer applications, chromophores in photoenabled chemical transformations, and beyond. In all cases, density functional theory (DFT) and higher-level methods for electronic structure determination provide valuable quantitative information about the electronic properties that underpin the functions of these frameworks. However, there are only two general modeling approaches in conventional electronic structure software packages: those that treat materials as extended, periodic solids, and those that treat materials as discrete molecules. Each approach has features and benefits; both have been widely employed to understand the emergent chemistry that arises from the formation of the metal-organic interface. This Review canvases these approaches to date, with emphasis placed on the application of electronic structure theory to explore reactivity and electron transfer using periodic, molecular, and embedded models. This includes (i) computational chemistry considerations such as how functional, k-grid, and other model variables are selected to enable insights into MOF properties, (ii) extended solid models that treat MOFs as materials rather than molecules, (iii) the mechanics of cluster extraction and subsequent chemistry enabled by these molecular models, (iv) catalytic studies using both solids and clusters thereof, and (v) embedded, mixed-method approaches, which simulate a fraction of the material using one level of theory and the remainder of the material using another dissimilar theoretical implementation.
Collapse
Affiliation(s)
- Jenna L Mancuso
- Department of Chemistry and Biochemistry, University of Oregon, Eugene, Oregon 97405, United States
| | - Austin M Mroz
- Department of Chemistry and Biochemistry, University of Oregon, Eugene, Oregon 97405, United States
| | - Khoa N Le
- Department of Chemistry and Biochemistry, University of Oregon, Eugene, Oregon 97405, United States
| | - Christopher H Hendon
- Department of Chemistry and Biochemistry, University of Oregon, Eugene, Oregon 97405, United States
| |
Collapse
|
47
|
Rauer C, Bereau T. Hydration free energies from kernel-based machine learning: Compound-database bias. J Chem Phys 2020; 153:014101. [DOI: 10.1063/5.0012230] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Affiliation(s)
- Clemens Rauer
- Max Planck Institute for Polymer Research, 55128 Mainz, Germany
| | - Tristan Bereau
- Max Planck Institute for Polymer Research, 55128 Mainz, Germany
- Van ’t Hoff Institute for Molecular Sciences and Informatics Institute, University of Amsterdam, Amsterdam 1098 XH, The Netherlands
| |
Collapse
|
48
|
Abstract
As the quantum chemistry (QC) community embraces machine learning (ML), the number of new methods and applications based on the combination of QC and ML is surging. In this Perspective, a view of the current state of affairs in this new and exciting research field is offered, challenges of using machine learning in quantum chemistry applications are described, and potential future developments are outlined. Specifically, examples of how machine learning is used to improve the accuracy and accelerate quantum chemical research are shown. Generalization and classification of existing techniques are provided to ease the navigation in the sea of literature and to guide researchers entering the field. The emphasis of this Perspective is on supervised machine learning.
Collapse
Affiliation(s)
- Pavlo O Dral
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| |
Collapse
|
49
|
Schindl A, Hawker RR, Schaffarczyk McHale KS, Liu KTC, Morris DC, Hsieh AY, Gilbert A, Prescott SW, Haines RS, Croft AK, Harper JB, Jäger CM. Controlling the outcome of S N2 reactions in ionic liquids: from rational data set design to predictive linear regression models. Phys Chem Chem Phys 2020; 22:23009-23018. [PMID: 33043942 DOI: 10.1039/d0cp04224b] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
Rate constants for a bimolecular nucleophilic substitution (SN2) process in a range of ionic liquids are correlated with calculated parameters associated with the charge localisation on the cation of the ionic liquid (including the molecular electrostatic potential). Simple linear regression models proved effective, though the interdependency of the descriptors needs to be taken into account when considering generality. A series of ionic liquids were then prepared and evaluated as solvents for the same process; this data set was rationally chosen to incorporate homologous series (to evaluate systematic variation) and functionalities not available in the original data set. These new data were used to evaluate and refine the original models, which were expanded to include simple artificial neural networks. Along with showing the importance of an appropriate data set and the perils of overfitting, the work demonstrates that such models can be used to reliably predict ionic liquid solvent effects on an organic process, within the limits of the data set.
Collapse
Affiliation(s)
- Alexandra Schindl
- Department of Chemical and Environmental Engineering, University of Nottingham, Nottingham NG7 2RD, UK.
| | - Rebecca R Hawker
- School of Chemistry, University of New South Wales, UNSW Sydney, 2052, Australia.
| | | | - Kenny T-C Liu
- School of Chemistry, University of New South Wales, UNSW Sydney, 2052, Australia.
| | - Daniel C Morris
- School of Chemistry, University of New South Wales, UNSW Sydney, 2052, Australia. and School of Chemical Engineering, University of New South Wales, UNSW Sydney, 2052, Australia
| | - Andrew Y Hsieh
- School of Chemistry, University of New South Wales, UNSW Sydney, 2052, Australia.
| | - Alyssa Gilbert
- School of Chemistry, University of New South Wales, UNSW Sydney, 2052, Australia.
| | - Stuart W Prescott
- School of Chemical Engineering, University of New South Wales, UNSW Sydney, 2052, Australia
| | - Ronald S Haines
- School of Chemistry, University of New South Wales, UNSW Sydney, 2052, Australia.
| | - Anna K Croft
- Department of Chemical and Environmental Engineering, University of Nottingham, Nottingham NG7 2RD, UK.
| | - Jason B Harper
- School of Chemistry, University of New South Wales, UNSW Sydney, 2052, Australia.
| | - Christof M Jäger
- Department of Chemical and Environmental Engineering, University of Nottingham, Nottingham NG7 2RD, UK.
| |
Collapse
|