1
|
Tang MJ, Zhu TC, Zhang SQ, Hong X. QM9star, two Million DFT-computed Equilibrium Structures for Ions and Radicals with Atomic Information. Sci Data 2024; 11:1158. [PMID: 39433783 PMCID: PMC11494049 DOI: 10.1038/s41597-024-03933-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2024] [Accepted: 09/24/2024] [Indexed: 10/23/2024] Open
Abstract
Ions and radicals serve as key intermediates in molecular transformation, with their chemical properties being essential for understanding and predicting reaction reactivity and selectivity. In this data descriptor, we report a quantum chemical dataset named QM9star, comprising cations, anions, and radicals. This dataset is derived from the molecular structures of the QM9 dataset, created by removing terminal hydrogens followed by optimization using B3LYP-D3(BJ)/6-311 + G(d,p) level of density functional theory. The QM9star dataset includes approximately 1.9 million cations, anions, and radicals, along with 120 kilo neutral molecules prior to hydrogen removal. Each entry encompasses both molecular and atomic information: representative global properties include orbital energies, vibrational frequencies, etc., while local properties cover aspects such as charges and spin densities at each atomic site. The QM9star dataset not only serves as a comprehensive source of quantum chemical information for intermediates but also offers insights into the principle of atomic property distribution. We anticipate that these data will aid in machine learning studies related to chemical intermediates and contribute to the molecular representation learning.
Collapse
Affiliation(s)
- Miao-Jiong Tang
- Center of Chemistry for Frontier Technologies, Department of Chemistry, Zhejiang University, Hangzhou, 310027, P. R. China
| | - Tian-Cheng Zhu
- Center of Chemistry for Frontier Technologies, Department of Chemistry, Zhejiang University, Hangzhou, 310027, P. R. China
| | - Shuo-Qing Zhang
- Center of Chemistry for Frontier Technologies, Department of Chemistry, Zhejiang University, Hangzhou, 310027, P. R. China.
| | - Xin Hong
- Center of Chemistry for Frontier Technologies, Department of Chemistry, Zhejiang University, Hangzhou, 310027, P. R. China.
- School of Chemistry and Chemical Engineering, Henan Normal University, Xinxiang, 453007, P. R. China.
| |
Collapse
|
2
|
Shirani H, Hashemianzadeh SM. Quantum-level machine learning calculations of Levodopa. Comput Biol Chem 2024; 112:108146. [PMID: 39067350 DOI: 10.1016/j.compbiolchem.2024.108146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Revised: 06/20/2024] [Accepted: 07/08/2024] [Indexed: 07/30/2024]
Abstract
Many drug molecules contain functional groups, resulting in a torsional barrier corresponding to rotation around the bond linking the fragments. In medicinal chemistry and pharmaceutical sciences, inclusive of drug design studies, the exact calculation of the potential energy surface (PES) of these molecular torsions is extremely important and precious. Machine learning (ML), including deep learning (DL), is currently one of the most rapidly evolving tools in computer-aided drug discovery and molecular simulations. In this work, we used ANI-1x neural network potential as a quantum-level ML to predict the PESs of the L-3,4-dihydroxyphenylalanine (Levodopa) antiparkinsonian drug molecule. The electronic energies and structural parameters calculated by density functional theory (DFT) using the wB97X method and all possible Pople's basis sets indicated the 6-31G(d) basis set, when used with the wB97X functional, exhibits behavior similar to that of the ANI-1x model. The vibrational frequencies investigation showed a linear correlation between DFT and ML data. All ANI-1x calculations were completed quickly in a very short computing time. From this perspective, we expect the ANI-1x dataset applied in this work to be appreciably efficient and effective in computational structure-based drug design studies.
Collapse
Affiliation(s)
- Hossein Shirani
- Molecular Simulation Research Laboratory, Department of Chemistry, Iran University of Science and Technology, P.O. Box 16846-13114, Tehran, Iran.
| | - Seyed Majid Hashemianzadeh
- Molecular Simulation Research Laboratory, Department of Chemistry, Iran University of Science and Technology, P.O. Box 16846-13114, Tehran, Iran.
| |
Collapse
|
3
|
Charry J, Tkatchenko A. van der Waals Radii of Free and Bonded Atoms from Hydrogen (Z = 1) to Oganesson (Z = 118). J Chem Theory Comput 2024; 20:7469-7478. [PMID: 39208255 PMCID: PMC11391583 DOI: 10.1021/acs.jctc.4c00784] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]
Abstract
Reliable numerical values of van der Waals (vdW) radii are required for constructing empirical force fields, vdW-inclusive density functional, and quantum-chemical methods, as well as for implicit solvent models. However, multiple definitions exist for vdW radii, involving either equilibrium or the closest contact distances between free or bonded atoms within molecules or crystals. For the paradigmatic case of the hydrogen atom, its reported vdW radius fluctuates between 2.15 and 3.70 Bohr depending on the definition, leading to a high uncertainty in calculations and different conceptual interpretations of noncovalent interactions. In this work, we systematically review different definitions and methodologies to establish the free and bonded vdW radii for hydrogen, based on equilibrium vdW distances in noncovalently bonded molecules, enveloping electron density cutoffs, noncovalent positron bonds in hydrogen anion dimer, vacuum virtual photon cloud caused by the hydrogen atom, and atomic dipole polarizability. By doing so, we show that the vdW radius of the free hydrogen atom is 3.16 ± 0.06 Bohr. By employing the most general and elegant definition of atomic vdW radius as a function of the atomic polarizability, we tabulate consistent values of vdW radii for all atoms in the periodic table up to Z = 118.
Collapse
Affiliation(s)
- Jorge Charry
- Department of Physics and Materials Science, University of Luxembourg, L-1511 Luxembourg City, Luxembourg
| | - Alexandre Tkatchenko
- Department of Physics and Materials Science, University of Luxembourg, L-1511 Luxembourg City, Luxembourg
| |
Collapse
|
4
|
Góger S, Karimpour MR, Tkatchenko A. Four-Dimensional Scaling of Dipole Polarizability: From Single-Particle Models to Atoms and Molecules. J Chem Theory Comput 2024; 20:6621-6631. [PMID: 39015013 PMCID: PMC11325554 DOI: 10.1021/acs.jctc.4c00582] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/18/2024]
Abstract
Scaling laws enable the determination of physicochemical properties of molecules and materials as a function of their size, density, number of electrons or other easily accessible descriptors. Such relations can be counterintuitive and nonlinear, and ultimately yield much needed insight into quantum mechanics of many-particle systems. In this work, we show on the basis of single-particle models, multielectron atoms and molecules that the dipole polarizability of quantum systems is generally proportional to the fourth power of a characteristic length, computed from the ground-state wave function. This four-dimensional (4D) scaling is independent of the ratio of bound-to-bound and bound-to-continuum electronic transitions and applies to many-electron atoms when a correlated length metric is used. Finally, this scaling law is applied to predict the polarizability of molecules by electrostatically coupled atoms-in-molecules approach, obtaining approximately 8% absolute and relative accuracy with respect to hybrid density functional theory (DFT) on the QM7-X data set of organic molecules, providing an efficient and scalable model for the anisotropic polarizability tensors of extended (bio)molecules.
Collapse
Affiliation(s)
- Szabolcs Góger
- Department of Physics and Materials Science, University of Luxembourg, L-1511 Luxembourg City, Luxembourg
| | - Mohammad Reza Karimpour
- Department of Physics and Materials Science, University of Luxembourg, L-1511 Luxembourg City, Luxembourg
| | - Alexandre Tkatchenko
- Department of Physics and Materials Science, University of Luxembourg, L-1511 Luxembourg City, Luxembourg
| |
Collapse
|
5
|
Frank JT, Unke OT, Müller KR, Chmiela S. A Euclidean transformer for fast and stable machine learned force fields. Nat Commun 2024; 15:6539. [PMID: 39107296 PMCID: PMC11303804 DOI: 10.1038/s41467-024-50620-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Accepted: 07/10/2024] [Indexed: 08/10/2024] Open
Abstract
Recent years have seen vast progress in the development of machine learned force fields (MLFFs) based on ab-initio reference calculations. Despite achieving low test errors, the reliability of MLFFs in molecular dynamics (MD) simulations is facing growing scrutiny due to concerns about instability over extended simulation timescales. Our findings suggest a potential connection between robustness to cumulative inaccuracies and the use of equivariant representations in MLFFs, but the computational cost associated with these representations can limit this advantage in practice. To address this, we propose a transformer architecture called SO3KRATES that combines sparse equivariant representations (Euclidean variables) with a self-attention mechanism that separates invariant and equivariant information, eliminating the need for expensive tensor products. SO3KRATES achieves a unique combination of accuracy, stability, and speed that enables insightful analysis of quantum properties of matter on extended time and system size scales. To showcase this capability, we generate stable MD trajectories for flexible peptides and supra-molecular structures with hundreds of atoms. Furthermore, we investigate the PES topology for medium-sized chainlike molecules (e.g., small peptides) by exploring thousands of minima. Remarkably, SO3KRATES demonstrates the ability to strike a balance between the conflicting demands of stability and the emergence of new minimum-energy conformations beyond the training data, which is crucial for realistic exploration tasks in the field of biochemistry.
Collapse
Affiliation(s)
- J Thorben Frank
- Machine Learning Group, TU Berlin, Berlin, Germany
- BIFOLD, Berlin Institute for the Foundations of Learning and Data, Berlin, Germany
| | | | - Klaus-Robert Müller
- Machine Learning Group, TU Berlin, Berlin, Germany.
- BIFOLD, Berlin Institute for the Foundations of Learning and Data, Berlin, Germany.
- Google DeepMind, Berlin, Germany.
- Department of Artificial Intelligence, Korea University, Seoul, Korea.
- Max Planck Institut für Informatik, Saarbrücken, Germany.
| | - Stefan Chmiela
- Machine Learning Group, TU Berlin, Berlin, Germany.
- BIFOLD, Berlin Institute for the Foundations of Learning and Data, Berlin, Germany.
| |
Collapse
|
6
|
Schwarting M, Seifert NA, Davis MJ, Blaiszik B, Foster I, Prozument K. Twins in rotational spectroscopy: Does a rotational spectrum uniquely identify a molecule? J Chem Phys 2024; 161:044309. [PMID: 39051838 DOI: 10.1063/5.0212632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Accepted: 07/03/2024] [Indexed: 07/27/2024] Open
Abstract
Rotational spectroscopy is the most accurate method for determining structures of molecules in the gas phase. It is often assumed that a rotational spectrum is a unique "fingerprint" of a molecule. The availability of large molecular databases and the development of artificial intelligence methods for spectroscopy make the testing of this assumption timely. In this paper, we pose the determination of molecular structures from rotational spectra as an inverse problem. Within this framework, we adopt a funnel-based approach to search for molecular twins, which are two or more molecules, which have similar rotational spectra but distinctly different molecular structures. We demonstrate that there are twins within standard levels of computational accuracy by generating rotational constants for many molecules from several large molecular databases, indicating that the inverse problem is ill-posed. However, some twins can be distinguished by increasing the accuracy of the theoretical methods or by performing additional experiments.
Collapse
Affiliation(s)
- Marcus Schwarting
- Department of Computer Science, University of Chicago, Chicago, Illinois 60637, USA
| | - Nathan A Seifert
- Department of Chemistry and Chemical and Biomedical Engineering, University of New Haven, West Haven, Connecticut 06516, USA
| | - Michael J Davis
- Chemical Sciences and Engineering Division, Argonne National Laboratory, Lemont, Illinois 60439, USA
| | - Ben Blaiszik
- Data Science and Learning Division, Argonne National Laboratory, Lemont, Illinois 60439, USA
| | - Ian Foster
- Department of Computer Science, University of Chicago, Chicago, Illinois 60637, USA
- Data Science and Learning Division, Argonne National Laboratory, Lemont, Illinois 60439, USA
| | - Kirill Prozument
- Chemical Sciences and Engineering Division, Argonne National Laboratory, Lemont, Illinois 60439, USA
| |
Collapse
|
7
|
Plé T, Adjoua O, Lagardère L, Piquemal JP. FeNNol: An efficient and flexible library for building force-field-enhanced neural network potentials. J Chem Phys 2024; 161:042502. [PMID: 39051830 DOI: 10.1063/5.0217688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Accepted: 06/28/2024] [Indexed: 07/27/2024] Open
Abstract
Neural network interatomic potentials (NNPs) have recently proven to be powerful tools to accurately model complex molecular systems while bypassing the high numerical cost of ab initio molecular dynamics simulations. In recent years, numerous advances in model architectures as well as the development of hybrid models combining machine-learning (ML) with more traditional, physically motivated, force-field interactions have considerably increased the design space of ML potentials. In this paper, we present FeNNol, a new library for building, training, and running force-field-enhanced neural network potentials. It provides a flexible and modular system for building hybrid models, allowing us to easily combine state-of-the-art embeddings with ML-parameterized physical interaction terms without the need for explicit programming. Furthermore, FeNNol leverages the automatic differentiation and just-in-time compilation features of the Jax Python library to enable fast evaluation of NNPs, shrinking the performance gap between ML potentials and standard force-fields. This is demonstrated with the popular ANI-2x model reaching simulation speeds nearly on par with the AMOEBA polarizable force-field on commodity GPUs (graphics processing units). We hope that FeNNol will facilitate the development and application of new hybrid NNP architectures for a wide range of molecular simulation problems.
Collapse
Affiliation(s)
- Thomas Plé
- Sorbonne Université, LCT, UMR 7616 CNRS, 75005 Paris, France
| | - Olivier Adjoua
- Sorbonne Université, LCT, UMR 7616 CNRS, 75005 Paris, France
| | - Louis Lagardère
- Sorbonne Université, LCT, UMR 7616 CNRS, 75005 Paris, France
| | | |
Collapse
|
8
|
Fallani A, Medrano Sandonas L, Tkatchenko A. Inverse mapping of quantum properties to structures for chemical space of small organic molecules. Nat Commun 2024; 15:6061. [PMID: 39025883 PMCID: PMC11258234 DOI: 10.1038/s41467-024-50401-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 07/01/2024] [Indexed: 07/20/2024] Open
Abstract
Computer-driven molecular design combines the principles of chemistry, physics, and artificial intelligence to identify chemical compounds with tailored properties. While quantum-mechanical (QM) methods, coupled with machine learning, already offer a direct mapping from 3D molecular structures to their properties, effective methodologies for the inverse mapping in chemical space remain elusive. We address this challenge by demonstrating the possibility of parametrizing a chemical space with a finite set of QM properties. Our proof-of-concept implementation achieves an approximate property-to-structure mapping, the QIM model (which stands for "Quantum Inverse Mapping"), by forcing a variational auto-encoder with a property encoder to obtain a common internal representation for both structures and properties. After validating this mapping for small drug-like molecules, we illustrate its capabilities with an explainability study as well as by the generation of de novo molecular structures with targeted properties and transition pathways between conformational isomers. Our findings thus provide a proof-of-principle demonstration aiming to enable the inverse property-to-structure design in diverse chemical spaces.
Collapse
Affiliation(s)
- Alessio Fallani
- Department of Physics and Materials Science, University of Luxembourg, L-1511, Luxembourg City, Luxembourg.
| | - Leonardo Medrano Sandonas
- Department of Physics and Materials Science, University of Luxembourg, L-1511, Luxembourg City, Luxembourg.
- Institute for Materials Science and Max Bergmann Center of Biomaterials, TU Dresden, 01062, Dresden, Germany.
| | - Alexandre Tkatchenko
- Department of Physics and Materials Science, University of Luxembourg, L-1511, Luxembourg City, Luxembourg.
| |
Collapse
|
9
|
Cheng Z, Bi H, Liu S, Chen J, Misquitta AJ, Yu K. Developing a Differentiable Long-Range Force Field for Proteins with E(3) Neural Network-Predicted Asymptotic Parameters. J Chem Theory Comput 2024; 20:5598-5608. [PMID: 38888427 DOI: 10.1021/acs.jctc.4c00337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/20/2024]
Abstract
Accurately describing long-range interactions is a significant challenge in molecular dynamics (MD) simulations of proteins. High-quality long-range potential is also an important component of the range-separated machine learning force field. This study introduces a comprehensive asymptotic parameter database encompassing atomic multipole moments, polarizabilities, and dispersion coefficients. Leveraging active learning, our database comprehensively represents protein fragments with up to 8 heavy atoms, capturing their conformational diversity with merely 78,000 data points. Additionally, the E(3) neural network (E3NN) is employed to predict the asymptotic parameters directly from the local geometry. The E3NN models demonstrate exceptional accuracy and transferability across all asymptotic parameters, achieving an R2 of 0.999 for both protein fragments and 20 amino acid dipeptide test sets. The long-range electrostatic and dispersion energies can be obtained using the E3NN-predicted parameters, with an error of 0.07 and 0.02 kcal/mol, respectively, when compared to symmetry-adapted perturbation theory (SAPT). Therefore, our force fields demonstrate the capability to accurately describe long-range interactions in proteins, paving the way for next-generation protein force fields.
Collapse
Affiliation(s)
- Zheng Cheng
- School of Mathematical Sciences, Peking University, Beijing 100871, China
- AI for Science Institute, Beijing 100084, P. R. China
| | - Hangrui Bi
- School of Mathematical Sciences, Peking University, Beijing 100871, China
- DP Technology, Beijing 100080, P. R. China
| | - Siyuan Liu
- DP Technology, Beijing 100080, P. R. China
| | - Junmin Chen
- Tsinghua-Berkeley Shenzhen Institute, Shenzhen 518055, Guangdong, P. R. China
- Tsinghua Shenzhen International Graduate School, Shenzhen 518055, Guangdong, P. R. China
| | - Alston J Misquitta
- School of Physics and Astronomy, Queen Mary, University of London, London E1 4NS, U.K
| | - Kuang Yu
- Tsinghua-Berkeley Shenzhen Institute, Shenzhen 518055, Guangdong, P. R. China
- Tsinghua Shenzhen International Graduate School, Shenzhen 518055, Guangdong, P. R. China
| |
Collapse
|
10
|
Medrano Sandonas L, Van Rompaey D, Fallani A, Hilfiker M, Hahn D, Perez-Benito L, Verhoeven J, Tresadern G, Kurt Wegner J, Ceulemans H, Tkatchenko A. Dataset for quantum-mechanical exploration of conformers and solvent effects in large drug-like molecules. Sci Data 2024; 11:742. [PMID: 38972891 PMCID: PMC11228031 DOI: 10.1038/s41597-024-03521-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Accepted: 06/13/2024] [Indexed: 07/09/2024] Open
Abstract
We here introduce the Aquamarine (AQM) dataset, an extensive quantum-mechanical (QM) dataset that contains the structural and electronic information of 59,783 low-and high-energy conformers of 1,653 molecules with a total number of atoms ranging from 2 to 92 (mean: 50.9), and containing up to 54 (mean: 28.2) non-hydrogen atoms. To gain insights into the solvent effects as well as collective dispersion interactions for drug-like molecules, we have performed QM calculations supplemented with a treatment of many-body dispersion (MBD) interactions of structures and properties in the gas phase and implicit water. Thus, AQM contains over 40 global and local physicochemical properties (including ground-state and response properties) per conformer computed at the tightly converged PBE0+MBD level of theory for gas-phase molecules, whereas PBE0+MBD with the modified Poisson-Boltzmann (MPB) model of water was used for solvated molecules. By addressing both molecule-solvent and dispersion interactions, AQM dataset can serve as a challenging benchmark for state-of-the-art machine learning methods for property modeling and de novo generation of large (solvated) molecules with pharmaceutical and biological relevance.
Collapse
Affiliation(s)
- Leonardo Medrano Sandonas
- Department of Physics and Materials Science, University of Luxembourg, L-1511, Luxembourg City, Luxembourg.
- Institute for Materials Science and Max Bergmann Center of Biomaterials, TU Dresden, 01062, Dresden, Germany.
| | - Dries Van Rompaey
- Drug Discovery Data Sciences (D3S), Janssen Pharmaceutica NV, Turnhoutseweg 30, 2340, Beerse, Belgium.
| | - Alessio Fallani
- Department of Physics and Materials Science, University of Luxembourg, L-1511, Luxembourg City, Luxembourg
- Drug Discovery Data Sciences (D3S), Janssen Pharmaceutica NV, Turnhoutseweg 30, 2340, Beerse, Belgium
| | - Mathias Hilfiker
- Department of Physics and Materials Science, University of Luxembourg, L-1511, Luxembourg City, Luxembourg
| | - David Hahn
- Computational Chemistry, Janssen Pharmaceutica NV, Turnhoutseweg 30, 2340, Beerse, Belgium
| | - Laura Perez-Benito
- Computational Chemistry, Janssen Pharmaceutica NV, Turnhoutseweg 30, 2340, Beerse, Belgium
| | - Jonas Verhoeven
- Drug Discovery Data Sciences (D3S), Janssen Pharmaceutica NV, Turnhoutseweg 30, 2340, Beerse, Belgium
| | - Gary Tresadern
- Computational Chemistry, Janssen Pharmaceutica NV, Turnhoutseweg 30, 2340, Beerse, Belgium
| | - Joerg Kurt Wegner
- Drug Discovery Data Sciences (D3S), Janssen Pharmaceutica NV, Turnhoutseweg 30, 2340, Beerse, Belgium
- Drug Discovery Data Sciences (D3S), Johnson & Johnson Innovative Medicine, 301 Binney Street, MA 02142, Cambridge, USA
| | - Hugo Ceulemans
- Drug Discovery Data Sciences (D3S), Janssen Pharmaceutica NV, Turnhoutseweg 30, 2340, Beerse, Belgium
| | - Alexandre Tkatchenko
- Department of Physics and Materials Science, University of Luxembourg, L-1511, Luxembourg City, Luxembourg.
| |
Collapse
|
11
|
Cui M, Reuter K, Margraf JT. Obtaining Robust Density Functional Tight-Binding Parameters for Solids across the Periodic Table. J Chem Theory Comput 2024; 20:5276-5290. [PMID: 38865478 DOI: 10.1021/acs.jctc.4c00228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2024]
Abstract
The density functional tight-binding (DFTB) approach allows electronic structure-based simulations at length and time scales far beyond what is possible with first-principles methods. This is achieved by using minimal basis sets and empirical approximations. Unfortunately, the sparse availability of parameters across the periodic table is a significant barrier to the use of DFTB in many cases. We therefore propose a workflow that allows the robust and consistent parametrization of DFTB across the periodic table. Importantly, our approach requires no element-pairwise parameters so that the parameters can be used for all element combinations and are readily extendable. This is achieved by parametrizing all elements on a consistent set of artificial homoelemental crystals, spanning a wide range of coordination environments. The transferability of the resulting periodic table baseline parameters to multielement systems and unknown structures is explored and the model is extensively benchmarked against previous specialized and general DFTB parametrizations.
Collapse
Affiliation(s)
- Mengnan Cui
- Fritz Haber Institute of the Max Planck Society, 14195 Berlin, Germany
- University of Bayreuth, Bavarian Center for Battery Technology (BayBatt), 95448 Bayreuth, Germany
| | - Karsten Reuter
- Fritz Haber Institute of the Max Planck Society, 14195 Berlin, Germany
| | - Johannes T Margraf
- University of Bayreuth, Bavarian Center for Battery Technology (BayBatt), 95448 Bayreuth, Germany
| |
Collapse
|
12
|
Wan K, He J, Shi X. Construction of High Accuracy Machine Learning Interatomic Potential for Surface/Interface of Nanomaterials-A Review. ADVANCED MATERIALS (DEERFIELD BEACH, FLA.) 2024; 36:e2305758. [PMID: 37640376 DOI: 10.1002/adma.202305758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 08/24/2023] [Indexed: 08/31/2023]
Abstract
The inherent discontinuity and unique dimensional attributes of nanomaterial surfaces and interfaces bestow them with various exceptional properties. These properties, however, also introduce difficulties for both experimental and computational studies. The advent of machine learning interatomic potential (MLIP) addresses some of the limitations associated with empirical force fields, presenting a valuable avenue for accurate simulations of these surfaces/interfaces of nanomaterials. Central to this approach is the idea of capturing the relationship between system configuration and potential energy, leveraging the proficiency of machine learning (ML) to precisely approximate high-dimensional functions. This review offers an in-depth examination of MLIP principles and their execution and elaborates on their applications in the realm of nanomaterial surface and interface systems. The prevailing challenges faced by this potent methodology are also discussed.
Collapse
Affiliation(s)
- Kaiwei Wan
- Laboratory of Theoretical and Computational Nanoscience, National Center for Nanoscience and Technology, Chinese Academy of Sciences, Beijing, 100190, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Jianxin He
- Laboratory of Theoretical and Computational Nanoscience, National Center for Nanoscience and Technology, Chinese Academy of Sciences, Beijing, 100190, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Xinghua Shi
- Laboratory of Theoretical and Computational Nanoscience, National Center for Nanoscience and Technology, Chinese Academy of Sciences, Beijing, 100190, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| |
Collapse
|
13
|
Chen M, Jiang X, Zhang L, Chen X, Wen Y, Gu Z, Li X, Zheng M. The emergence of machine learning force fields in drug design. Med Res Rev 2024; 44:1147-1182. [PMID: 38173298 DOI: 10.1002/med.22008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2023] [Revised: 11/29/2023] [Accepted: 12/05/2023] [Indexed: 01/05/2024]
Abstract
In the field of molecular simulation for drug design, traditional molecular mechanic force fields and quantum chemical theories have been instrumental but limited in terms of scalability and computational efficiency. To overcome these limitations, machine learning force fields (MLFFs) have emerged as a powerful tool capable of balancing accuracy with efficiency. MLFFs rely on the relationship between molecular structures and potential energy, bypassing the need for a preconceived notion of interaction representations. Their accuracy depends on the machine learning models used, and the quality and volume of training data sets. With recent advances in equivariant neural networks and high-quality datasets, MLFFs have significantly improved their performance. This review explores MLFFs, emphasizing their potential in drug design. It elucidates MLFF principles, provides development and validation guidelines, and highlights successful MLFF implementations. It also addresses potential challenges in developing and applying MLFFs. The review concludes by illuminating the path ahead for MLFFs, outlining the challenges to be overcome and the opportunities to be harnessed. This inspires researchers to embrace MLFFs in their investigations as a new tool to perform molecular simulations in drug design.
Collapse
Affiliation(s)
- Mingan Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- School of Physical Science and Technology, ShanghaiTech University, Shanghai, China
- Lingang Laboratory, Shanghai, China
| | - Xinyu Jiang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- School of Pharmacy, University of Chinese Academy of Sciences, Beijing, China
| | - Lehan Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- School of Pharmacy, University of Chinese Academy of Sciences, Beijing, China
| | - Xiaoxu Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- School of Pharmacy, University of Chinese Academy of Sciences, Beijing, China
- School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, UCAS, Hangzhou, China
| | - Yiming Wen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- School of Pharmacy, University of Chinese Academy of Sciences, Beijing, China
- School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, UCAS, Hangzhou, China
| | - Zhiyong Gu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- School of Pharmacy, University of Chinese Academy of Sciences, Beijing, China
- School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, UCAS, Hangzhou, China
| | - Xutong Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- School of Pharmacy, University of Chinese Academy of Sciences, Beijing, China
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- School of Pharmacy, University of Chinese Academy of Sciences, Beijing, China
- School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, UCAS, Hangzhou, China
| |
Collapse
|
14
|
Jin H, Merz KM. Modeling Zinc Complexes Using Neural Networks. J Chem Inf Model 2024; 64:3140-3148. [PMID: 38587510 PMCID: PMC11040731 DOI: 10.1021/acs.jcim.4c00095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Revised: 03/04/2024] [Accepted: 03/28/2024] [Indexed: 04/09/2024]
Abstract
Understanding the energetic landscapes of large molecules is necessary for the study of chemical and biological systems. Recently, deep learning has greatly accelerated the development of models based on quantum chemistry, making it possible to build potential energy surfaces and explore chemical space. However, most of this work has focused on organic molecules due to the simplicity of their electronic structures as well as the availability of data sets. In this work, we build a deep learning architecture to model the energetics of zinc organometallic complexes. To achieve this, we have compiled a configurationally and conformationally diverse data set of zinc complexes using metadynamics to overcome the limitations of traditional sampling methods. In terms of the neural network potentials, our results indicate that for zinc complexes, partial charges play an important role in modeling the long-range interactions with a neural network. Our developed model outperforms semiempirical methods in predicting the relative energy of zinc conformers, yielding a mean absolute error (MAE) of 1.32 kcal/mol with reference to the double-hybrid PWPB95 method.
Collapse
Affiliation(s)
- Hongni Jin
- Department
of Chemistry, Michigan State University, East Lansing, Michigan 48824, United States
| | - Kenneth M. Merz
- Department
of Chemistry, Michigan State University, East Lansing, Michigan 48824, United States
- Department
of Biochemistry and Molecular Biology, Michigan
State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
15
|
Buterez D, Janet JP, Kiddle SJ, Oglic D, Lió P. Transfer learning with graph neural networks for improved molecular property prediction in the multi-fidelity setting. Nat Commun 2024; 15:1517. [PMID: 38409255 PMCID: PMC11258334 DOI: 10.1038/s41467-024-45566-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Accepted: 01/25/2024] [Indexed: 02/28/2024] Open
Abstract
We investigate the potential of graph neural networks for transfer learning and improving molecular property prediction on sparse and expensive to acquire high-fidelity data by leveraging low-fidelity measurements as an inexpensive proxy for a targeted property of interest. This problem arises in discovery processes that rely on screening funnels for trading off the overall costs against throughput and accuracy. Typically, individual stages in these processes are loosely connected and each one generates data at different scale and fidelity. We consider this setup holistically and demonstrate empirically that existing transfer learning techniques for graph neural networks are generally unable to harness the information from multi-fidelity cascades. Here, we propose several effective transfer learning strategies and study them in transductive and inductive settings. Our analysis involves a collection of more than 28 million unique experimental protein-ligand interactions across 37 targets from drug discovery by high-throughput screening and 12 quantum properties from the dataset QMugs. The results indicate that transfer learning can improve the performance on sparse tasks by up to eight times while using an order of magnitude less high-fidelity training data. Moreover, the proposed methods consistently outperform existing transfer learning strategies for graph-structured data on drug discovery and quantum mechanics datasets.
Collapse
Affiliation(s)
- David Buterez
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK.
| | - Jon Paul Janet
- Molecular AI, BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden
| | - Steven J Kiddle
- Data Science & Advanced Analytics, Data Science & AI, R&D, AstraZeneca, Cambridge, UK
| | - Dino Oglic
- Centre for AI, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK
| | - Pietro Lió
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
| |
Collapse
|
16
|
Fonseca G, Poltavsky I, Tkatchenko A. Force Field Analysis Software and Tools (FFAST): Assessing Machine Learning Force Fields under the Microscope. J Chem Theory Comput 2023; 19:8706-8717. [PMID: 38011895 PMCID: PMC10720330 DOI: 10.1021/acs.jctc.3c00985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Revised: 11/06/2023] [Accepted: 11/07/2023] [Indexed: 11/29/2023]
Abstract
As the sophistication of machine learning force fields (MLFF) increases to match the complexity of extended molecules and materials, so does the need for tools to properly analyze and assess the practical performance of MLFFs. To go beyond average error metrics and into a complete picture of a model's applicability and limitations, we developed FFAST (force field analysis software and tools): a cross-platform software package designed to gain detailed insights into a model's performance and limitations, complete with an easy-to-use graphical user interface. The software allows the user to gauge the performance of any molecular force field,─such as popular state-of-the-art MLFF models, ─ on various popular data set types, providing general prediction error overviews, outlier detection mechanisms, atom-projected errors, and more. It has a 3D visualizer to find and picture problematic configurations, atoms, or clusters in a large data set. In this paper, the example of the MACE and NequIP models is used on two data sets of interest [stachyose and docosahexaenoic acid (DHA)]─to illustrate the use cases of the software. With this, it was found that carbons and oxygens involved in or near glycosidic bonds inside the stachyose molecule present increased prediction errors. In addition, prediction errors on DHA rise as the molecule folds, especially for the carboxylic group at the edge of the molecule. We emphasize the need for a systematic assessment of MLFF models for ensuring their successful application to the study of dynamics of molecules and materials.
Collapse
Affiliation(s)
- Gregory Fonseca
- Department of Physics and Materials
Science, University of Luxembourg, Luxembourg City L-1511, Luxembourg
| | - Igor Poltavsky
- Department of Physics and Materials
Science, University of Luxembourg, Luxembourg City L-1511, Luxembourg
| | - Alexandre Tkatchenko
- Department of Physics and Materials
Science, University of Luxembourg, Luxembourg City L-1511, Luxembourg
| |
Collapse
|
17
|
Burns JW, Rogers DM. QuantumScents: Quantum-Mechanical Properties for 3.5k Olfactory Molecules. J Chem Inf Model 2023; 63:7330-7337. [PMID: 37988325 DOI: 10.1021/acs.jcim.3c01338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2023]
Abstract
Quantitative structure-odor relationships are critically important for studies related to the function of olfaction. Current literature data sets contain expert-labeled molecules but lack feature data. This paper introduces QuantumScents, a quantum mechanics augmented derivative of the Leffingwell data set. QuantumScents contains 3.5k structurally and chemically diverse molecules ranging from 2 to 30 heavy atoms (CNOS) and their corresponding 3D coordinates, total PBE0 energy, molecular dipole moment, and per-atom Hirshfeld charges, dipoles, and ratios. The authors demonstrate that Hirshfeld charges and ratios contain sufficient information to perform molecular classification by training a Message Passing Neural Network with chemprop (Heid, E.; et al. ChemRxiv, 2023, DOI: 10.26434/chemrxiv-2023-3zcfl) to predict scent labels. The QuantumScents data set is freely available on Zenodo along with the authors' code, example models, and data set generation workflow (https://zenodo.org/doi/10.5281/zenodo.8239853).
Collapse
Affiliation(s)
- Jackson W Burns
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - David M Rogers
- National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, United States
| |
Collapse
|
18
|
Stylianakis I, Zervos N, Lii JH, Pantazis DA, Kolocouris A. Conformational energies of reference organic molecules: benchmarking of common efficient computational methods against coupled cluster theory. J Comput Aided Mol Des 2023; 37:607-656. [PMID: 37597063 PMCID: PMC10618395 DOI: 10.1007/s10822-023-00513-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Accepted: 06/03/2023] [Indexed: 08/21/2023]
Abstract
We selected 145 reference organic molecules that include model fragments used in computer-aided drug design. We calculated 158 conformational energies and barriers using force fields, with wide applicability in commercial and free softwares and extensive application on the calculation of conformational energies of organic molecules, e.g. the UFF and DREIDING force fields, the Allinger's force fields MM3-96, MM3-00, MM4-8, the MM2-91 clones MMX and MM+, the MMFF94 force field, MM4, ab initio Hartree-Fock (HF) theory with different basis sets, the standard density functional theory B3LYP, the second-order post-HF MP2 theory and the Domain-based Local Pair Natural Orbital Coupled Cluster DLPNO-CCSD(T) theory, with the latter used for accurate reference values. The data set of the organic molecules includes hydrocarbons, haloalkanes, conjugated compounds, and oxygen-, nitrogen-, phosphorus- and sulphur-containing compounds. We reviewed in detail the conformational aspects of these model organic molecules providing the current understanding of the steric and electronic factors that determine the stability of low energy conformers and the literature including previous experimental observations and calculated findings. While progress on the computer hardware allows the calculations of thousands of conformations for later use in drug design projects, this study is an update from previous classical studies that used, as reference values, experimental ones using a variety of methods and different environments. The lowest mean error against the DLPNO-CCSD(T) reference was calculated for MP2 (0.35 kcal mol-1), followed by B3LYP (0.69 kcal mol-1) and the HF theories (0.81-1.0 kcal mol-1). As regards the force fields, the lowest errors were observed for the Allinger's force fields MM3-00 (1.28 kcal mol-1), ΜΜ3-96 (1.40 kcal mol-1) and the Halgren's MMFF94 force field (1.30 kcal mol-1) and then for the MM2-91 clones MMX (1.77 kcal mol-1) and MM+ (2.01 kcal mol-1) and MM4 (2.05 kcal mol-1). The DREIDING (3.63 kcal mol-1) and UFF (3.77 kcal mol-1) force fields have the lowest performance. These model organic molecules we used are often present as fragments in drug-like molecules. The values calculated using DLPNO-CCSD(T) make up a valuable data set for further comparisons and for improved force field parameterization.
Collapse
Affiliation(s)
- Ioannis Stylianakis
- Department of Medicinal Chemistry, Faculty of Pharmacy, National and Kapodistrian University of Athens, Panepistimioupolis Zografou, 15771, Athens, Greece
| | - Nikolaos Zervos
- Department of Medicinal Chemistry, Faculty of Pharmacy, National and Kapodistrian University of Athens, Panepistimioupolis Zografou, 15771, Athens, Greece
| | - Jenn-Huei Lii
- Department of Chemistry, National Changhua University of Education, Changhua City, Taiwan
| | - Dimitrios A Pantazis
- Max-Planck-Institut für Kohlenforschung, Kaiser-Wilhelm-Platz 1, 45470, Mülheim an der Ruhr, Germany
| | - Antonios Kolocouris
- Department of Medicinal Chemistry, Faculty of Pharmacy, National and Kapodistrian University of Athens, Panepistimioupolis Zografou, 15771, Athens, Greece.
- Laboratory of Medicinal Chemistry, Section of Pharmaceutical Chemistry, Department of Pharmacy, National and Kapodistrian University of Athens, Panepistimiopolis-Zografou, 15771, Athens, Greece.
| |
Collapse
|
19
|
Buterez D, Janet JP, Kiddle SJ, Oglic D, Liò P. Modelling local and general quantum mechanical properties with attention-based pooling. Commun Chem 2023; 6:262. [PMID: 38030692 PMCID: PMC10686994 DOI: 10.1038/s42004-023-01045-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 10/27/2023] [Indexed: 12/01/2023] Open
Abstract
Atom-centred neural networks represent the state-of-the-art for approximating the quantum chemical properties of molecules, such as internal energies. While the design of machine learning architectures that respect chemical principles has continued to advance, the final atom pooling operation that is necessary to convert from atomic to molecular representations in most models remains relatively undeveloped. The most common choices, sum and average pooling, compute molecular representations that are naturally a good fit for many physical properties, while satisfying properties such as permutation invariance which are desirable from a geometric deep learning perspective. However, there are growing concerns that such simplistic functions might have limited representational power, while also being suboptimal for physical properties that are highly localised or intensive. Based on recent advances in graph representation learning, we investigate the use of a learnable pooling function that leverages an attention mechanism to model interactions between atom representations. The proposed pooling operation is a drop-in replacement requiring no changes to any of the other architectural components. Using SchNet and DimeNet++ as starting models, we demonstrate consistent uplifts in performance compared to sum and mean pooling and a recent physics-aware pooling operation designed specifically for orbital energies, on several datasets, properties, and levels of theory, with up to 85% improvements depending on the specific task.
Collapse
Affiliation(s)
- David Buterez
- Department of Computer Science and Technology, University of Cambridge, Cambridge, CB3 0FD, UK.
| | - Jon Paul Janet
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, 431 50, Sweden
| | - Steven J Kiddle
- Data Science & Advanced Analytics, Data Science & AI, R&D, AstraZeneca, Cambridge, CB2 8PA, UK
| | - Dino Oglic
- Center for AI, Data Science & AI, R&D, AstraZeneca, Cambridge, CB2 8PA, UK
| | - Pietro Liò
- Department of Computer Science and Technology, University of Cambridge, Cambridge, CB3 0FD, UK
| |
Collapse
|
20
|
Khabibrakhmanov A, Fedorov DV, Tkatchenko A. Universal Pairwise Interatomic van der Waals Potentials Based on Quantum Drude Oscillators. J Chem Theory Comput 2023; 19:7895-7907. [PMID: 37875419 PMCID: PMC10653113 DOI: 10.1021/acs.jctc.3c00797] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Revised: 09/30/2023] [Accepted: 10/05/2023] [Indexed: 10/26/2023]
Abstract
Repulsive short-range and attractive long-range van der Waals (vdW) forces play an appreciable role in the behavior of extended molecular systems. When using empirical force fields, the most popular computational methods applied to such systems, vdW forces are typically described by Lennard-Jones-like potentials, which unfortunately have a limited predictive power. Here, we present a universal parameterization of a quantum-mechanical vdW potential, which requires only two free-atom properties─the static dipole polarizability α1 and the dipole-dipole C6 dispersion coefficient. This is achieved by deriving the functional form of the potential from the quantum Drude oscillator (QDO) model, employing scaling laws for the equilibrium distance and the binding energy, and applying the microscopic law of corresponding states. The vdW-QDO potential is shown to be accurate for vdW binding energy curves, as demonstrated by comparing to the ab initio binding curves of 21 noble-gas dimers. The functional form of the vdW-QDO potential has the correct asymptotic behavior at both zero and infinite distances. In addition, it is shown that the damped vdW-QDO potential can accurately describe vdW interactions in dimers consisting of group II elements. Finally, we demonstrate the applicability of the atom-in-molecule vdW-QDO model for predicting accurate dispersion energies for molecular systems. The present work makes an important step toward constructing universal vdW potentials, which could benefit (bio)molecular computational studies.
Collapse
Affiliation(s)
- Almaz Khabibrakhmanov
- Department of Physics and Materials
Science, University of Luxembourg, L-1511 Luxembourg
City, Luxembourg
| | - Dmitry V. Fedorov
- Department of Physics and Materials
Science, University of Luxembourg, L-1511 Luxembourg
City, Luxembourg
| | - Alexandre Tkatchenko
- Department of Physics and Materials
Science, University of Luxembourg, L-1511 Luxembourg
City, Luxembourg
| |
Collapse
|
21
|
Zou Z, Zhang Y, Liang L, Wei M, Leng J, Jiang J, Luo Y, Hu W. A deep learning model for predicting selected organic molecular spectra. NATURE COMPUTATIONAL SCIENCE 2023; 3:957-964. [PMID: 38177591 DOI: 10.1038/s43588-023-00550-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Accepted: 10/06/2023] [Indexed: 01/06/2024]
Abstract
Accurate and efficient molecular spectra simulations are crucial for substance discovery and structure identification. However, the conventional approach of relying on the quantum chemistry is cost intensive, which hampers efficiency. Here we develop DetaNet, a deep-learning model combining E(3)-equivariance group and self-attention mechanism to predict molecular spectra with improved efficiency and accuracy. By passing high-order geometric tensorial messages, DetaNet is able to generate a wide variety of molecular properties, including scalars, vectors, and second- and third-order tensors-all at the accuracy of quantum chemistry calculations. Based on this we developed generalized modules to predict four important types of molecular spectra, namely infrared, Raman, ultraviolet-visible, and 1H and 13C nuclear magnetic resonance, taking the QM9S dataset containing 130,000 molecular species as an example. By speeding up the prediction of molecular spectra at quantum chemical accuracy, DetaNet could help progress toward real-time structural identification using spectroscopic measurements.
Collapse
Affiliation(s)
- Zihan Zou
- School of Chemistry and Chemical Engineering, Qilu University of Technology (Shandong Academy of Science), Jinan, China
| | - Yujin Zhang
- School of Chemistry and Chemical Engineering, Qilu University of Technology (Shandong Academy of Science), Jinan, China.
| | - Lijun Liang
- College of Automation, Hangzhou Dianzi University, Hangzhou, China
| | - Mingzhi Wei
- School of Chemistry and Chemical Engineering, Qilu University of Technology (Shandong Academy of Science), Jinan, China
| | - Jiancai Leng
- School of Chemistry and Chemical Engineering, Qilu University of Technology (Shandong Academy of Science), Jinan, China
| | - Jun Jiang
- Key Laboratory of Precision and Intelligent Chemistry, School of Chemistry and Materials Science, University of Science and Technology of China, Hefei, China.
- Hefei National Laboratory, University of Science and Technology of China, Hefei, China.
| | - Yi Luo
- Hefei National Laboratory, University of Science and Technology of China, Hefei, China.
- Hefei National Research Center for Physical Sciences at the Microscale, University of Science and Technology of China, Hefei, China.
| | - Wei Hu
- School of Chemistry and Chemical Engineering, Qilu University of Technology (Shandong Academy of Science), Jinan, China.
| |
Collapse
|
22
|
Medrano Sandonas L, Hoja J, Ernst BG, Vázquez-Mayagoitia Á, DiStasio RA, Tkatchenko A. "Freedom of design" in chemical compound space: towards rational in silico design of molecules with targeted quantum-mechanical properties. Chem Sci 2023; 14:10702-10717. [PMID: 37829035 PMCID: PMC10566466 DOI: 10.1039/d3sc03598k] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Accepted: 08/17/2023] [Indexed: 10/14/2023] Open
Abstract
The rational design of molecules with targeted quantum-mechanical (QM) properties requires an advanced understanding of the structure-property/property-property relationships (SPR/PPR) that exist across chemical compound space (CCS). In this work, we analyze these fundamental relationships in the sector of CCS spanned by small (primarily organic) molecules using the recently developed QM7-X dataset, a systematic, extensive, and tightly converged collection of 42 QM properties corresponding to ≈4.2M equilibrium and non-equilibrium molecular structures containing up to seven heavy/non-hydrogen atoms (including C, N, O, S, and Cl). By characterizing and enumerating progressively more complex manifolds of molecular property space-the corresponding high-dimensional space defined by the properties of each molecule in this sector of CCS-our analysis reveals that one has a substantial degree of flexibility or "freedom of design" when searching for a single molecule with a desired pair of properties or a set of distinct molecules sharing an array of properties. To explore how this intrinsic flexibility manifests in the molecular design process, we used multi-objective optimization to search for molecules with simultaneously large polarizabilities and HOMO-LUMO gaps; analysis of the resulting Pareto fronts identified non-trivial paths through CCS consisting of sequential structural and/or compositional changes that yield molecules with optimal combinations of these properties.
Collapse
Affiliation(s)
- Leonardo Medrano Sandonas
- Department of Physics and Materials Science, University of Luxembourg L-1511 Luxembourg City Luxembourg
| | - Johannes Hoja
- Department of Physics and Materials Science, University of Luxembourg L-1511 Luxembourg City Luxembourg
- Institute of Chemistry, University of Graz 8010 Graz Austria
| | - Brian G Ernst
- Department of Chemistry and Chemical Biology, Cornell University Ithaca NY 14853 USA
| | | | - Robert A DiStasio
- Department of Chemistry and Chemical Biology, Cornell University Ithaca NY 14853 USA
| | - Alexandre Tkatchenko
- Department of Physics and Materials Science, University of Luxembourg L-1511 Luxembourg City Luxembourg
| |
Collapse
|
23
|
Góger S, Sandonas LM, Müller C, Tkatchenko A. Data-driven tailoring of molecular dipole polarizability and frontier orbital energies in chemical compound space. Phys Chem Chem Phys 2023; 25:22211-22222. [PMID: 37566426 PMCID: PMC10445328 DOI: 10.1039/d3cp02256k] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 07/27/2023] [Indexed: 08/12/2023]
Abstract
Understanding correlations - or lack thereof - between molecular properties is crucial for enabling fast and accurate molecular design strategies. In this contribution, we explore the relation between two key quantities describing the electronic structure and chemical properties of molecular systems: the energy gap between the frontier orbitals and the dipole polarizability. Based on the recently introduced QM7-X dataset, augmented with accurate molecular polarizability calculations as well as analysis of functional group compositions, we show that polarizability and HOMO-LUMO gap are uncorrelated when considering sufficiently extended subsets of the chemical compound space. The relation between these two properties is further analyzed on specific examples of molecules with similar composition as well as homooligomers. Remarkably, the freedom brought by the lack of correlation between molecular polarizability and HOMO-LUMO gap enables the design of novel materials, as we demonstrate on the example of organic photodetector candidates.
Collapse
Affiliation(s)
- Szabolcs Góger
- Department of Physics and Materials Science, University of Luxembourg, L-1511 Luxembourg City, Luxembourg.
| | - Leonardo Medrano Sandonas
- Department of Physics and Materials Science, University of Luxembourg, L-1511 Luxembourg City, Luxembourg.
| | - Carolin Müller
- Department of Physics and Materials Science, University of Luxembourg, L-1511 Luxembourg City, Luxembourg.
| | - Alexandre Tkatchenko
- Department of Physics and Materials Science, University of Luxembourg, L-1511 Luxembourg City, Luxembourg.
| |
Collapse
|
24
|
Zhang P, Yang W. Toward a general neural network force field for protein simulations: Refining the intramolecular interaction in protein. J Chem Phys 2023; 159:024118. [PMID: 37431910 PMCID: PMC10481389 DOI: 10.1063/5.0142280] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Accepted: 06/22/2023] [Indexed: 07/12/2023] Open
Abstract
Molecular dynamics (MD) is an extremely powerful, highly effective, and widely used approach to understanding the nature of chemical processes in atomic details for proteins. The accuracy of results from MD simulations is highly dependent on force fields. Currently, molecular mechanical (MM) force fields are mainly utilized in MD simulations because of their low computational cost. Quantum mechanical (QM) calculation has high accuracy, but it is exceedingly time consuming for protein simulations. Machine learning (ML) provides the capability for generating accurate potential at the QM level without increasing much computational effort for specific systems that can be studied at the QM level. However, the construction of general machine learned force fields, needed for broad applications and large and complex systems, is still challenging. Here, general and transferable neural network (NN) force fields based on CHARMM force fields, named CHARMM-NN, are constructed for proteins by training NN models on 27 fragments partitioned from the residue-based systematic molecular fragmentation (rSMF) method. The NN for each fragment is based on atom types and uses new input features that are similar to MM inputs, including bonds, angles, dihedrals, and non-bonded terms, which enhance the compatibility of CHARMM-NN to MM MD and enable the implementation of CHARMM-NN force fields in different MD programs. While the main part of the energy of the protein is based on rSMF and NN, the nonbonded interactions between the fragments and with water are taken from the CHARMM force field through mechanical embedding. The validations of the method for dipeptides on geometric data, relative potential energies, and structural reorganization energies demonstrate that the CHARMM-NN local minima on the potential energy surface are very accurate approximations to QM, showing the success of CHARMM-NN for bonded interactions. However, the MD simulations on peptides and proteins indicate that more accurate methods to represent protein-water interactions in fragments and non-bonded interactions between fragments should be considered in the future improvement of CHARMM-NN, which can increase the accuracy of approximation beyond the current mechanical embedding QM/MM level.
Collapse
Affiliation(s)
- Pan Zhang
- Department of Chemistry, Duke University, Durham, North Carolina 27708, USA
| | - Weitao Yang
- Department of Chemistry, Duke University, Durham, North Carolina 27708, USA
| |
Collapse
|
25
|
Huang B, von Rudorff GF, von Lilienfeld OA. The central role of density functional theory in the AI age. Science 2023; 381:170-175. [PMID: 37440654 DOI: 10.1126/science.abn3445] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Accepted: 05/30/2023] [Indexed: 07/15/2023]
Abstract
Density functional theory (DFT) plays a pivotal role in chemical and materials science because of its relatively high predictive power, applicability, versatility, and computational efficiency. We review recent progress in machine learning (ML) model developments, which have relied heavily on DFT for synthetic data generation and for the design of model architectures. The general relevance of these developments is placed in a broader context for chemical and materials sciences. DFT-based ML models have reached high efficiency, accuracy, scalability, and transferability and pave the way to the routine use of successful experimental planning software within self-driving laboratories.
Collapse
Affiliation(s)
- Bing Huang
- University of Vienna, Faculty of Physics, AT1090 Wien, Austria
| | - Guido Falk von Rudorff
- University Kassel, Department of Chemistry, 34132 Kassel, Germany
- Center for Interdisciplinary Nanostructure Science and Technology (CINSaT), 34132 Kassel, Germany
| | - O Anatole von Lilienfeld
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5S 1M1, Canada
- Department of Chemistry, University of Toronto, St. George Campus, Toronto, Ontario M5S 3H6, Canada
- Department of Materials Science and Engineering, University of Toronto, St. George Campus, Toronto, Ontario M5S 3E4, Canada
- Department of Physics, University of Toronto, St. George Campus, Toronto, Ontario M5S 1A7, Canada
- Machine Learning Group, Technische Universität Berlin and Berlin Institute for the Foundations of Learning and Data, 10587 Berlin, Germany
| |
Collapse
|
26
|
Chigaev M, Smith JS, Anaya S, Nebgen B, Bettencourt M, Barros K, Lubbers N. Lightweight and effective tensor sensitivity for atomistic neural networks. J Chem Phys 2023; 158:2889493. [PMID: 37158328 DOI: 10.1063/5.0142127] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Accepted: 04/20/2023] [Indexed: 05/10/2023] Open
Abstract
Atomistic machine learning focuses on the creation of models that obey fundamental symmetries of atomistic configurations, such as permutation, translation, and rotation invariances. In many of these schemes, translation and rotation invariance are achieved by building on scalar invariants, e.g., distances between atom pairs. There is growing interest in molecular representations that work internally with higher rank rotational tensors, e.g., vector displacements between atoms, and tensor products thereof. Here, we present a framework for extending the Hierarchically Interacting Particle Neural Network (HIP-NN) with Tensor Sensitivity information (HIP-NN-TS) from each local atomic environment. Crucially, the method employs a weight tying strategy that allows direct incorporation of many-body information while adding very few model parameters. We show that HIP-NN-TS is more accurate than HIP-NN, with negligible increase in parameter count, for several datasets and network sizes. As the dataset becomes more complex, tensor sensitivities provide greater improvements to model accuracy. In particular, HIP-NN-TS achieves a record mean absolute error of 0.927 kcalmol for conformational energy variation on the challenging COMP6 benchmark, which includes a broad set of organic molecules. We also compare the computational performance of HIP-NN-TS to HIP-NN and other models in the literature.
Collapse
Affiliation(s)
- Michael Chigaev
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, USA
- Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, USA
| | - Justin S Smith
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, USA
- Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, USA
- NVIDIA, 2788 San Tomas Expy, Santa Clara, California 95051, USA
| | - Steven Anaya
- High Performance Computing Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, USA
- Computer, Computational, and Statistical Sciences Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, USA
| | - Benjamin Nebgen
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, USA
| | | | - Kipton Barros
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, USA
- Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, USA
| | - Nicholas Lubbers
- Computer, Computational, and Statistical Sciences Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, USA
| |
Collapse
|
27
|
Pinheiro M, Zhang S, Dral PO, Barbatti M. WS22 database, Wigner Sampling and geometry interpolation for configurationally diverse molecular datasets. Sci Data 2023; 10:95. [PMID: 36792601 PMCID: PMC9931705 DOI: 10.1038/s41597-023-01998-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 02/01/2023] [Indexed: 02/17/2023] Open
Abstract
Multidimensional surfaces of quantum chemical properties, such as potential energies and dipole moments, are common targets for machine learning, requiring the development of robust and diverse databases extensively exploring molecular configurational spaces. Here we composed the WS22 database covering several quantum mechanical (QM) properties (including potential energies, forces, dipole moments, polarizabilities, HOMO, and LUMO energies) for ten flexible organic molecules of increasing complexity and with up to 22 atoms. This database consists of 1.18 million equilibrium and non-equilibrium geometries carefully sampled from Wigner distributions centered at different equilibrium conformations (either at the ground or excited electronic states) and further augmented with interpolated structures. The diversity of our datasets is demonstrated by visualizing the geometries distribution with dimensionality reduction as well as via comparison of statistical features of the QM properties with those available in existing datasets. Our sampling targets broader quantum mechanical distribution of the configurational space than provided by commonly used sampling through classical molecular dynamics, upping the challenge for machine learning models.
Collapse
Affiliation(s)
- Max Pinheiro
- Aix Marseille University, CNRS, ICR, Marseille, France.
| | - Shuang Zhang
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen, China
| | - Pavlo O Dral
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen, China
| | - Mario Barbatti
- Aix Marseille University, CNRS, ICR, Marseille, France.
- Institut Universitaire de France, 75231, Paris, France.
| |
Collapse
|
28
|
Käser S, Vazquez-Salazar LI, Meuwly M, Töpfer K. Neural network potentials for chemistry: concepts, applications and prospects. DIGITAL DISCOVERY 2023; 2:28-58. [PMID: 36798879 PMCID: PMC9923808 DOI: 10.1039/d2dd00102k] [Citation(s) in RCA: 17] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Accepted: 12/20/2022] [Indexed: 12/24/2022]
Abstract
Artificial Neural Networks (NN) are already heavily involved in methods and applications for frequent tasks in the field of computational chemistry such as representation of potential energy surfaces (PES) and spectroscopic predictions. This perspective provides an overview of the foundations of neural network-based full-dimensional potential energy surfaces, their architectures, underlying concepts, their representation and applications to chemical systems. Methods for data generation and training procedures for PES construction are discussed and means for error assessment and refinement through transfer learning are presented. A selection of recent results illustrates the latest improvements regarding accuracy of PES representations and system size limitations in dynamics simulations, but also NN application enabling direct prediction of physical results without dynamics simulations. The aim is to provide an overview for the current state-of-the-art NN approaches in computational chemistry and also to point out the current challenges in enhancing reliability and applicability of NN methods on a larger scale.
Collapse
Affiliation(s)
- Silvan Käser
- Department of Chemistry, University of Basel Klingelbergstrasse 80 CH-4056 Basel Switzerland
| | | | - Markus Meuwly
- Department of Chemistry, University of Basel Klingelbergstrasse 80 CH-4056 Basel Switzerland
| | - Kai Töpfer
- Department of Chemistry, University of Basel Klingelbergstrasse 80 CH-4056 Basel Switzerland
| |
Collapse
|
29
|
Kříž K, Schmidt L, Andersson AT, Walz MM, van der Spoel D. An Imbalance in the Force: The Need for Standardized Benchmarks for Molecular Simulation. J Chem Inf Model 2023; 63:412-431. [PMID: 36630710 PMCID: PMC9875315 DOI: 10.1021/acs.jcim.2c01127] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Indexed: 01/12/2023]
Abstract
Force fields (FFs) for molecular simulation have been under development for more than half a century. As with any predictive model, rigorous testing and comparisons of models critically depends on the availability of standardized data sets and benchmarks. While such benchmarks are rather common in the fields of quantum chemistry, this is not the case for empirical FFs. That is, few benchmarks are reused to evaluate FFs, and development teams rather use their own training and test sets. Here we present an overview of currently available tests and benchmarks for computational chemistry, focusing on organic compounds, including halogens and common ions, as FFs for these are the most common ones. We argue that many of the benchmark data sets from quantum chemistry can in fact be reused for evaluating FFs, but new gas phase data is still needed for compounds containing phosphorus and sulfur in different valence states. In addition, more nonequilibrium interaction energies and forces, as well as molecular properties such as electrostatic potentials around compounds, would be beneficial. For the condensed phases there is a large body of experimental data available, and tools to utilize these data in an automated fashion are under development. If FF developers, as well as researchers in artificial intelligence, would adopt a number of these data sets, it would become easier to compare the relative strengths and weaknesses of different models and to, eventually, restore the balance in the force.
Collapse
Affiliation(s)
- Kristian Kříž
- Department
of Cell and Molecular Biology, Uppsala University, Box 596, SE-75124Uppsala, Sweden
| | - Lisa Schmidt
- Faculty
of Biosciences, University of Heidelberg, Heidelberg69117, Germany
| | - Alfred T. Andersson
- Department
of Cell and Molecular Biology, Uppsala University, Box 596, SE-75124Uppsala, Sweden
| | - Marie-Madeleine Walz
- Department
of Cell and Molecular Biology, Uppsala University, Box 596, SE-75124Uppsala, Sweden
| | - David van der Spoel
- Department
of Cell and Molecular Biology, Uppsala University, Box 596, SE-75124Uppsala, Sweden
| |
Collapse
|
30
|
Chmiela S, Vassilev-Galindo V, Unke OT, Kabylda A, Sauceda HE, Tkatchenko A, Müller KR. Accurate global machine learning force fields for molecules with hundreds of atoms. SCIENCE ADVANCES 2023; 9:eadf0873. [PMID: 36630510 PMCID: PMC9833674 DOI: 10.1126/sciadv.adf0873] [Citation(s) in RCA: 40] [Impact Index Per Article: 40.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Accepted: 11/28/2022] [Indexed: 05/25/2023]
Abstract
Global machine learning force fields, with the capacity to capture collective interactions in molecular systems, now scale up to a few dozen atoms due to considerable growth of model complexity with system size. For larger molecules, locality assumptions are introduced, with the consequence that nonlocal interactions are not described. Here, we develop an exact iterative approach to train global symmetric gradient domain machine learning (sGDML) force fields (FFs) for several hundred atoms, without resorting to any potentially uncontrolled approximations. All atomic degrees of freedom remain correlated in the global sGDML FF, allowing the accurate description of complex molecules and materials that present phenomena with far-reaching characteristic correlation lengths. We assess the accuracy and efficiency of sGDML on a newly developed MD22 benchmark dataset containing molecules from 42 to 370 atoms. The robustness of our approach is demonstrated in nanosecond path-integral molecular dynamics simulations for supramolecular complexes in the MD22 dataset.
Collapse
Affiliation(s)
- Stefan Chmiela
- Machine Learning Group, Technische Universität Berlin, 10587 Berlin, Germany
- Berlin Institute for the Foundations of Learning and Data – BIFOLD, Germany
| | - Valentin Vassilev-Galindo
- Department of Physics and Materials Science, University of Luxembourg, L-1511 Luxembourg City, Luxembourg
| | - Oliver T. Unke
- Machine Learning Group, Technische Universität Berlin, 10587 Berlin, Germany
- Google Research, Brain Team, Berlin, Germany
| | - Adil Kabylda
- Department of Physics and Materials Science, University of Luxembourg, L-1511 Luxembourg City, Luxembourg
| | - Huziel E. Sauceda
- Machine Learning Group, Technische Universität Berlin, 10587 Berlin, Germany
- Berlin Institute for the Foundations of Learning and Data – BIFOLD, Germany
- Departamento de Materia Condensada, Instituto de Física, Universidad Nacional Autónoma de México, Cd. de México C.P. 04510, Mexico
- BASLEARN - TU Berlin/BASF Joint Lab for Machine Learning, Technische Universität Berlin, 10587 Berlin, Germany
| | - Alexandre Tkatchenko
- Department of Physics and Materials Science, University of Luxembourg, L-1511 Luxembourg City, Luxembourg
| | - Klaus-Robert Müller
- Machine Learning Group, Technische Universität Berlin, 10587 Berlin, Germany
- Berlin Institute for the Foundations of Learning and Data – BIFOLD, Germany
- Google Research, Brain Team, Berlin, Germany
- Max Planck Institute for Informatics, Stuhlsatzenhausweg, 66123 Saarbrücken, Germany
- Department of Artificial Intelligence, Korea University, Anam-dong, Seongbuk-gu, Seoul 02841, Korea
| |
Collapse
|
31
|
Eastman P, Behara PK, Dotson DL, Galvelis R, Herr JE, Horton JT, Mao Y, Chodera JD, Pritchard BP, Wang Y, De Fabritiis G, Markland TE. SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials. Sci Data 2023; 10:11. [PMID: 36599873 PMCID: PMC9813265 DOI: 10.1038/s41597-022-01882-6] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Accepted: 12/01/2022] [Indexed: 01/05/2023] Open
Abstract
Machine learning potentials are an important tool for molecular simulation, but their development is held back by a shortage of high quality datasets to train them on. We describe the SPICE dataset, a new quantum chemistry dataset for training potentials relevant to simulating drug-like small molecules interacting with proteins. It contains over 1.1 million conformations for a diverse set of small molecules, dimers, dipeptides, and solvated amino acids. It includes 15 elements, charged and uncharged molecules, and a wide range of covalent and non-covalent interactions. It provides both forces and energies calculated at the ωB97M-D3(BJ)/def2-TZVPPD level of theory, along with other useful quantities such as multipole moments and bond orders. We train a set of machine learning potentials on it and demonstrate that they can achieve chemical accuracy across a broad region of chemical space. It can serve as a valuable resource for the creation of transferable, ready to use potential functions for use in molecular simulations.
Collapse
Affiliation(s)
- Peter Eastman
- Department of Chemistry, Stanford University, Stanford, CA, 94305, USA.
| | - Pavan Kumar Behara
- Department of Pharmaceutical Sciences, University of California, Irvine, CA, 92697, USA
| | - David L Dotson
- The Open Force Field Initiative, Open Molecular Software Foundation, Davis, CA, 95616, USA
| | | | - John E Herr
- Department of Chemistry and Biochemistry, University of Notre Dame, Notre Dame, IN, 46556, USA
| | - Josh T Horton
- School of Natural and Environmental Sciences, Newcastle University, Newcastle upon Tyne, NE1 7RU, United Kingdom
| | - Yuezhi Mao
- Department of Chemistry, Stanford University, Stanford, CA, 94305, USA
| | - John D Chodera
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY, 10065, USA
| | - Benjamin P Pritchard
- Molecular Sciences Software Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24060, USA
| | - Yuanqing Wang
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY, 10065, USA
- Graduate Program in Physiology, Biophysics, and Systems Biology, Weill Cornell Graduate School of Medical Sciences, New York, NY, 10065, USA
| | - Gianni De Fabritiis
- Acellera Labs, Doctor Trueta 183, 08005, Barcelona, Spain
- Computational Science Laboratory, Universitat Pompeu Fabra, Barcelona Biomedical Research Park (PRBB), Carrer Dr. Aiguader 88, 08003, Barcelona, Spain and ICREA, Passeig Lluis Companys 23, 08010, Barcelona, Spain
| | - Thomas E Markland
- Department of Chemistry, Stanford University, Stanford, CA, 94305, USA
| |
Collapse
|
32
|
Heinen S, von Rudorff GF, von Lilienfeld OA. Transition state search and geometry relaxation throughout chemical compound space with quantum machine learning. J Chem Phys 2022; 157:221102. [PMID: 36546806 DOI: 10.1063/5.0112856] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
We use energies and forces predicted within response operator based quantum machine learning (OQML) to perform geometry optimization and transition state search calculations with legacy optimizers but without the need for subsequent re-optimization with quantum chemistry methods. For randomly sampled initial coordinates of small organic query molecules, we report systematic improvement of equilibrium and transition state geometry output as training set sizes increase. Out-of-sample SN2 reactant complexes and transition state geometries have been predicted using the LBFGS and the QST2 algorithms with an root-mean-square deviation (RMSD) of 0.16 and 0.4 Å-after training on up to 200 reactant complex relaxations and transition state search trajectories from the QMrxn20 dataset, respectively. For geometry optimizations, we have also considered relaxation paths up to 5'595 constitutional isomers with sum formula C7H10O2 from the QM9-database. Using the resulting OQML models with an LBFGS optimizer reproduces the minimum geometry with an RMSD of 0.14 Å, only using ∼6000 training points obtained from normal mode sampling along the optimization paths of the training compounds without the need for active learning. For converged equilibrium and transition state geometries, subsequent vibrational normal mode frequency analysis indicates deviation from MP2 reference results by on average 14 and 26 cm-1, respectively. While the numerical cost for OQML predictions is negligible in comparison to density functional theory or MP2, the number of steps until convergence is typically larger in either case. The success rate for reaching convergence, however, improves systematically with training set size, underscoring OQML's potential for universal applicability.
Collapse
Affiliation(s)
- Stefan Heinen
- University of Vienna, Faculty of Physics, Kolingasse 14-16, AT-1090 Wien, Austria
| | | | | |
Collapse
|
33
|
Zhang L, Zhang S, Owens A, Yurchenko SN, Dral PO. VIB5 database with accurate ab initio quantum chemical molecular potential energy surfaces. Sci Data 2022; 9:84. [PMID: 35277513 PMCID: PMC8917215 DOI: 10.1038/s41597-022-01185-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Accepted: 01/19/2022] [Indexed: 11/09/2022] Open
Abstract
High-level ab initio quantum chemical (QC) molecular potential energy surfaces (PESs) are crucial for accurately simulating molecular rotation-vibration spectra. Machine learning (ML) can help alleviate the cost of constructing such PESs, but requires access to the original ab initio PES data, namely potential energies computed on high-density grids of nuclear geometries. In this work, we present a new structured PES database called VIB5, which contains high-quality ab initio data on 5 small polyatomic molecules of astrophysical significance (CH3Cl, CH4, SiH4, CH3F, and NaOH). The VIB5 database is based on previously used PESs, which, however, are either publicly unavailable or lacking key information to make them suitable for ML applications. The VIB5 database provides tens of thousands of grid points for each molecule with theoretical best estimates of potential energies along with their constituent energy correction terms and a data-extraction script. In addition, new complementary QC calculations of energies and energy gradients have been performed to provide a consistent database, which, e.g., can be used for gradient-based ML methods. Measurement(s) | potential energy surfaces | Technology Type(s) | quantum chemistry computational methods |
Collapse
Affiliation(s)
- Lina Zhang
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen, 361005, China
| | - Shuang Zhang
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen, 361005, China
| | - Alec Owens
- Department of Physics and Astronomy, University College London, Gower Street, WC1E 6BT, London, United Kingdom.
| | - Sergei N Yurchenko
- Department of Physics and Astronomy, University College London, Gower Street, WC1E 6BT, London, United Kingdom
| | - Pavlo O Dral
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen, 361005, China.
| |
Collapse
|
34
|
Oliveira AF, Da Silva JLF, Quiles MG. Molecular Property Prediction and Molecular Design Using a Supervised Grammar Variational Autoencoder. J Chem Inf Model 2022; 62:817-828. [PMID: 35174705 DOI: 10.1021/acs.jcim.1c01573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Some of the most common applications of machine learning (ML) algorithms dealing with small molecules usually fall within two distinct domains, namely, the prediction of molecular properties and the design of novel molecules with some desirable property. Here we unite these applications under a single molecular representation and ML algorithm by modifying the grammar variational autoencoder (GVAE) model with the incorporation of property information into its training procedure, thus creating a supervised GVAE (SGVAE). Results indicate that the biased latent space generated by this approach can successfully be used to predict the molecular properties of the input molecules, produce novel and unique molecules with some desired property and also estimate the properties of random sampled molecules. We illustrate these possibilities by sampling novel molecules from the latent space with specific values of the lowest unoccupied molecular orbital (LUMO) energy after training the model using the QM9 data set. Furthermore, the trained model is also used to predict the properties of a hold-out set and the resulting mean absolute error (MAE) shows values close to chemical accuracy for the dipole moment and atomization energies, even outperforming ML models designed to exclusive predict molecular properties using the SMILES as molecular representation. Therefore, these results show that the proposed approach is a viable way to provide generative ML models with molecular property information in a way that the generation of novel molecules is likely to achieve better results, with the benefit that these new molecules can also have their molecular properties accurately predicted.
Collapse
Affiliation(s)
- André F Oliveira
- Associate Laboratory for Computing and Applied Mathematics, National Institute for Space Research, P.O. Box 515, 12227-010, São José dos Campos, SP, Brazil
| | - Juarez L F Da Silva
- São Carlos Institute of Chemistry, University of São Paulo, P.O. Box 780, 13560-970, São Carlos, SP, Brazil
| | - Marcos G Quiles
- Institute of Science and Technology, Federal University of São Paulo, 12247-014, São José dos Campos, SP, Brazil
| |
Collapse
|
35
|
Han Y, Li M, Zhao X. Effects of orbital angles on the modeling of conjugated systems with curvature. Phys Chem Chem Phys 2022; 24:27467-27473. [DOI: 10.1039/d2cp03549a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Models with angle corrections give well predictions of both neutral and charged fullerenes. The integrals of nonparallel orbitals explain why angle features of designed and deep-learning models are necessary to describe conjugated systems.
Collapse
Affiliation(s)
- Yanbo Han
- Institute of Molecular Science and Applied Chemistry, School of Chemistry, Xi’an Jiaotong University, Xi’an 710049, China
| | - Mengyang Li
- School of Physics, Xidian University, Xi’an 710071, China
| | - Xiang Zhao
- Institute of Molecular Science and Applied Chemistry, School of Chemistry, Xi’an Jiaotong University, Xi’an 710049, China
| |
Collapse
|
36
|
Unke OT, Chmiela S, Gastegger M, Schütt KT, Sauceda HE, Müller KR. SpookyNet: Learning force fields with electronic degrees of freedom and nonlocal effects. Nat Commun 2021; 12:7273. [PMID: 34907176 PMCID: PMC8671403 DOI: 10.1038/s41467-021-27504-0] [Citation(s) in RCA: 87] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Accepted: 11/16/2021] [Indexed: 01/12/2023] Open
Abstract
Machine-learned force fields combine the accuracy of ab initio methods with the efficiency of conventional force fields. However, current machine-learned force fields typically ignore electronic degrees of freedom, such as the total charge or spin state, and assume chemical locality, which is problematic when molecules have inconsistent electronic states, or when nonlocal effects play a significant role. This work introduces SpookyNet, a deep neural network for constructing machine-learned force fields with explicit treatment of electronic degrees of freedom and nonlocality, modeled via self-attention in a transformer architecture. Chemically meaningful inductive biases and analytical corrections built into the network architecture allow it to properly model physical limits. SpookyNet improves upon the current state-of-the-art (or achieves similar performance) on popular quantum chemistry data sets. Notably, it is able to generalize across chemical and conformational space and can leverage the learned chemical insights, e.g. by predicting unknown spin states, thus helping to close a further important remaining gap for today's machine learning models in quantum chemistry.
Collapse
Affiliation(s)
- Oliver T Unke
- Machine Learning Group, Technische Universität Berlin, 10587, Berlin, Germany.
- DFG Cluster of Excellence "Unifying Systems in Catalysis" (UniSysCat), Technische Universität Berlin, 10623, Berlin, Germany.
| | - Stefan Chmiela
- Machine Learning Group, Technische Universität Berlin, 10587, Berlin, Germany
| | - Michael Gastegger
- Machine Learning Group, Technische Universität Berlin, 10587, Berlin, Germany
- DFG Cluster of Excellence "Unifying Systems in Catalysis" (UniSysCat), Technische Universität Berlin, 10623, Berlin, Germany
| | - Kristof T Schütt
- Machine Learning Group, Technische Universität Berlin, 10587, Berlin, Germany
| | - Huziel E Sauceda
- Machine Learning Group, Technische Universität Berlin, 10587, Berlin, Germany
- BASLEARN, BASF-TU joint Lab, Technische Universität Berlin, 10587, Berlin, Germany
| | - Klaus-Robert Müller
- Machine Learning Group, Technische Universität Berlin, 10587, Berlin, Germany.
- Department of Artificial Intelligence, Korea University, Anam-dong, Seongbuk-gu, Seoul, 02841, Korea.
- Max Planck Institute for Informatics, Stuhlsatzenhausweg, 66123, Saarbrücken, Germany.
- BIFOLD-Berlin Institute for the Foundations of Learning and Data, Berlin, Germany.
- Google Research, Brain team, Berlin, Germany.
| |
Collapse
|
37
|
Sparrow ZM, Ernst BG, Joo PT, Lao KU, DiStasio RA. NENCI-2021. I. A large benchmark database of non-equilibrium non-covalent interactions emphasizing close intermolecular contacts. J Chem Phys 2021; 155:184303. [PMID: 34773949 DOI: 10.1063/5.0068862] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
In this work, we present NENCI-2021, a benchmark database of ∼8000 Non-Equilibirum Non-Covalent Interaction energies for a large and diverse selection of intermolecular complexes of biological and chemical relevance. To meet the growing demand for large and high-quality quantum mechanical data in the chemical sciences, NENCI-2021 starts with the 101 molecular dimers in the widely used S66 and S101 databases and extends the scope of these works by (i) including 40 cation-π and anion-π complexes, a fundamentally important class of non-covalent interactions that are found throughout nature and pose a substantial challenge to theory, and (ii) systematically sampling all 141 intermolecular potential energy surfaces (PESs) by simultaneously varying the intermolecular distance and intermolecular angle in each dimer. Designed with an emphasis on close contacts, the complexes in NENCI-2021 were generated by sampling seven intermolecular distances along each PES (ranging from 0.7× to 1.1× the equilibrium separation) and nine intermolecular angles per distance (five for each ion-π complex), yielding an extensive database of 7763 benchmark intermolecular interaction energies (Eint) obtained at the coupled-cluster with singles, doubles, and perturbative triples/complete basis set [CCSD(T)/CBS] level of theory. The Eint values in NENCI-2021 span a total of 225.3 kcal/mol, ranging from -38.5 to +186.8 kcal/mol, with a mean (median) Eint value of -1.06 kcal/mol (-2.39 kcal/mol). In addition, a wide range of intermolecular atom-pair distances are also present in NENCI-2021, where close intermolecular contacts involving atoms that are located within the so-called van der Waals envelope are prevalent-these interactions, in particular, pose an enormous challenge for molecular modeling and are observed in many important chemical and biological systems. A detailed symmetry-adapted perturbation theory (SAPT)-based energy decomposition analysis also confirms the diverse and comprehensive nature of the intermolecular binding motifs present in NENCI-2021, which now includes a significant number of primarily induction-bound dimers (e.g., cation-π complexes). NENCI-2021 thus spans all regions of the SAPT ternary diagram, thereby warranting a new four-category classification scheme that includes complexes primarily bound by electrostatics (3499), induction (700), dispersion (1372), or mixtures thereof (2192). A critical error analysis performed on a representative set of intermolecular complexes in NENCI-2021 demonstrates that the Eint values provided herein have an average error of ±0.1 kcal/mol, even for complexes with strongly repulsive Eint values, and maximum errors of ±0.2-0.3 kcal/mol (i.e., ∼±1.0 kJ/mol) for the most challenging cases. For these reasons, we expect that NENCI-2021 will play an important role in the testing, training, and development of next-generation classical and polarizable force fields, density functional theory approximations, wavefunction theory methods, and machine learning based intra- and inter-molecular potentials.
Collapse
Affiliation(s)
- Zachary M Sparrow
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, New York 14853, USA
| | - Brian G Ernst
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, New York 14853, USA
| | - Paul T Joo
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, New York 14853, USA
| | - Ka Un Lao
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, New York 14853, USA
| | - Robert A DiStasio
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, New York 14853, USA
| |
Collapse
|
38
|
Omar ÖH, Del Cueto M, Nematiaram T, Troisi A. High-throughput virtual screening for organic electronics: a comparative study of alternative strategies. JOURNAL OF MATERIALS CHEMISTRY. C 2021; 9:13557-13583. [PMID: 34745630 PMCID: PMC8515942 DOI: 10.1039/d1tc03256a] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Accepted: 09/13/2021] [Indexed: 06/01/2023]
Abstract
We present a review of the field of high-throughput virtual screening for organic electronics materials focusing on the sequence of methodological choices that determine each virtual screening protocol. These choices are present in all high-throughput virtual screenings and addressing them systematically will lead to optimised workflows and improve their applicability. We consider the range of properties that can be computed and illustrate how their accuracy can be determined depending on the quality and size of the experimental datasets. The approaches to generate candidates for virtual screening are also extremely varied and their relative strengths and weaknesses are discussed. The analysis of high-throughput virtual screening is almost never limited to the identification of top candidates and often new patterns and structure-property relations are the most interesting findings of such searches. The review reveals a very dynamic field constantly adapting to match an evolving landscape of applications, methodologies and datasets.
Collapse
Affiliation(s)
- Ömer H Omar
- Department of Chemistry, University of Liverpool Liverpool L69 3BX UK
| | - Marcos Del Cueto
- Department of Chemistry, University of Liverpool Liverpool L69 3BX UK
| | | | - Alessandro Troisi
- Department of Chemistry, University of Liverpool Liverpool L69 3BX UK
| |
Collapse
|
39
|
|
40
|
Keith JA, Vassilev-Galindo V, Cheng B, Chmiela S, Gastegger M, Müller KR, Tkatchenko A. Combining Machine Learning and Computational Chemistry for Predictive Insights Into Chemical Systems. Chem Rev 2021; 121:9816-9872. [PMID: 34232033 PMCID: PMC8391798 DOI: 10.1021/acs.chemrev.1c00107] [Citation(s) in RCA: 223] [Impact Index Per Article: 74.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Indexed: 12/23/2022]
Abstract
Machine learning models are poised to make a transformative impact on chemical sciences by dramatically accelerating computational algorithms and amplifying insights available from computational chemistry methods. However, achieving this requires a confluence and coaction of expertise in computer science and physical sciences. This Review is written for new and experienced researchers working at the intersection of both fields. We first provide concise tutorials of computational chemistry and machine learning methods, showing how insights involving both can be achieved. We follow with a critical review of noteworthy applications that demonstrate how computational chemistry and machine learning can be used together to provide insightful (and useful) predictions in molecular and materials modeling, retrosyntheses, catalysis, and drug design.
Collapse
Affiliation(s)
- John A. Keith
- Department
of Chemical and Petroleum Engineering Swanson School of Engineering, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, United States
| | - Valentin Vassilev-Galindo
- Department
of Physics and Materials Science, University
of Luxembourg, L-1511 Luxembourg City, Luxembourg
| | - Bingqing Cheng
- Accelerate
Programme for Scientific Discovery, Department
of Computer Science and Technology, 15 J. J. Thomson Avenue, Cambridge CB3 0FD, United Kingdom
| | - Stefan Chmiela
- Department
of Software Engineering and Theoretical Computer Science, Technische Universität Berlin, 10587, Berlin, Germany
| | - Michael Gastegger
- Department
of Software Engineering and Theoretical Computer Science, Technische Universität Berlin, 10587, Berlin, Germany
| | - Klaus-Robert Müller
- Machine
Learning Group, Technische Universität
Berlin, 10587, Berlin, Germany
- Department
of Artificial Intelligence, Korea University, Anam-dong, Seongbuk-gu, Seoul, 02841, Korea
- Max-Planck-Institut für Informatik, 66123 Saarbrücken, Germany
- Google Research, Brain Team, 10117 Berlin, Germany
| | - Alexandre Tkatchenko
- Department
of Physics and Materials Science, University
of Luxembourg, L-1511 Luxembourg City, Luxembourg
| |
Collapse
|
41
|
Abstract
Chemical compound space (CCS), the set of all theoretically conceivable combinations of chemical elements and (meta-)stable geometries that make up matter, is colossal. The first-principles based virtual sampling of this space, for example, in search of novel molecules or materials which exhibit desirable properties, is therefore prohibitive for all but the smallest subsets and simplest properties. We review studies aimed at tackling this challenge using modern machine learning techniques based on (i) synthetic data, typically generated using quantum mechanics based methods, and (ii) model architectures inspired by quantum mechanics. Such Quantum mechanics based Machine Learning (QML) approaches combine the numerical efficiency of statistical surrogate models with an ab initio view on matter. They rigorously reflect the underlying physics in order to reach universality and transferability across CCS. While state-of-the-art approximations to quantum problems impose severe computational bottlenecks, recent QML based developments indicate the possibility of substantial acceleration without sacrificing the predictive power of quantum mechanics.
Collapse
Affiliation(s)
- Bing Huang
- Faculty
of Physics, University of Vienna, 1090 Vienna, Austria
| | - O. Anatole von Lilienfeld
- Faculty
of Physics, University of Vienna, 1090 Vienna, Austria
- Institute
of Physical Chemistry and National Center for Computational Design
and Discovery of Novel Materials (MARVEL), Department of Chemistry, University of Basel, 4056 Basel, Switzerland
| |
Collapse
|
42
|
Westermayr J, Maurer RJ. Physically inspired deep learning of molecular excitations and photoemission spectra. Chem Sci 2021; 12:10755-10764. [PMID: 34447563 PMCID: PMC8372319 DOI: 10.1039/d1sc01542g] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Accepted: 06/29/2021] [Indexed: 12/29/2022] Open
Abstract
Modern functional materials consist of large molecular building blocks with significant chemical complexity which limits spectroscopic property prediction with accurate first-principles methods. Consequently, a targeted design of materials with tailored optoelectronic properties by high-throughput screening is bound to fail without efficient methods to predict molecular excited-state properties across chemical space. In this work, we present a deep neural network that predicts charged quasiparticle excitations for large and complex organic molecules with a rich elemental diversity and a size well out of reach of accurate many body perturbation theory calculations. The model exploits the fundamental underlying physics of molecular resonances as eigenvalues of a latent Hamiltonian matrix and is thus able to accurately describe multiple resonances simultaneously. The performance of this model is demonstrated for a range of organic molecules across chemical composition space and configuration space. We further showcase the model capabilities by predicting photoemission spectra at the level of the GW approximation for previously unseen conjugated molecules.
Collapse
Affiliation(s)
- Julia Westermayr
- Department of Chemistry, University of Warwick Gibbet Hill Road Coventry CV4 7AL UK
| | - Reinhard J Maurer
- Department of Chemistry, University of Warwick Gibbet Hill Road Coventry CV4 7AL UK
| |
Collapse
|
43
|
Cesar de Azevedo L, Pinheiro GA, Quiles MG, Da Silva JLF, Prati RC. Systematic Investigation of Error Distribution in Machine Learning Algorithms Applied to the Quantum-Chemistry QM9 Data Set Using the Bias and Variance Decomposition. J Chem Inf Model 2021; 61:4210-4223. [PMID: 34387994 DOI: 10.1021/acs.jcim.1c00503] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Most machine learning applications in quantum-chemistry (QC) data sets rely on a single statistical error parameter such as the mean square error (MSE) to evaluate their performance. However, this approach has limitations or can even yield incorrect interpretations. Here, we report a systematic investigation of the two components of the MSE, i.e., the bias and variance, using the QM9 data set. To this end, we experiment with three descriptors, namely (i) symmetry functions (SF, with two-body and three-body functions), (ii) many-body tensor representation (MBTR, with two- and three-body terms), and (iii) smooth overlap of atomic positions (SOAP), to evaluate the prediction process's performance using different numbers of molecules in training samples and the effect of bias and variance on the final MSE. Overall, low sample sizes are related to higher MSE. Moreover, the bias component strongly influences the larger MSEs. Furthermore, there is little agreement among molecules with higher errors (outliers) across different descriptors. However, there is a high prevalence among the outliers intersection set and the convex hull volume of geometric coordinates (VCH). According to the obtained results with the distribution of MSE (and its components bias and variance) and the appearance of outliers, it is suggested to use ensembles of models with a low bias to minimize the MSE, more specifically when using a small number of molecules in the training set.
Collapse
Affiliation(s)
- Luis Cesar de Azevedo
- Center of Mathematics, Computation and Cognition, Federal University of ABC, Av. dos Estados, 5001, 09210-580 Santo André, SP, Brazil
| | - Gabriel A Pinheiro
- Institute of Science and Technology, Federal University of São Paulo (Unifesp), 12247-014 São José dos Campos, SP, Brazil
| | - Marcos G Quiles
- Institute of Science and Technology, Federal University of São Paulo (Unifesp), 12247-014 São José dos Campos, SP, Brazil
| | - Juarez L F Da Silva
- São Carlos Institute of Chemistry, University of São Paulo, PO Box 780, 13560-970 São Carlos, SP, Brazil
| | - Ronaldo C Prati
- Center of Mathematics, Computation and Cognition, Federal University of ABC, Av. dos Estados, 5001, 09210-580 Santo André, SP, Brazil
| |
Collapse
|
44
|
Westermayr J, Gastegger M, Schütt KT, Maurer RJ. Perspective on integrating machine learning into computational chemistry and materials science. J Chem Phys 2021; 154:230903. [PMID: 34241249 DOI: 10.1063/5.0047760] [Citation(s) in RCA: 67] [Impact Index Per Article: 22.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
Machine learning (ML) methods are being used in almost every conceivable area of electronic structure theory and molecular simulation. In particular, ML has become firmly established in the construction of high-dimensional interatomic potentials. Not a day goes by without another proof of principle being published on how ML methods can represent and predict quantum mechanical properties-be they observable, such as molecular polarizabilities, or not, such as atomic charges. As ML is becoming pervasive in electronic structure theory and molecular simulation, we provide an overview of how atomistic computational modeling is being transformed by the incorporation of ML approaches. From the perspective of the practitioner in the field, we assess how common workflows to predict structure, dynamics, and spectroscopy are affected by ML. Finally, we discuss how a tighter and lasting integration of ML methods with computational chemistry and materials science can be achieved and what it will mean for research practice, software development, and postgraduate training.
Collapse
Affiliation(s)
- Julia Westermayr
- Department of Chemistry, University of Warwick, Gibbet Hill Road, Coventry CV4 7AL, United Kingdom
| | - Michael Gastegger
- Machine Learning Group, Technische Universität Berlin, 10587 Berlin, Germany
| | - Kristof T Schütt
- Machine Learning Group, Technische Universität Berlin, 10587 Berlin, Germany
| | - Reinhard J Maurer
- Department of Chemistry, University of Warwick, Gibbet Hill Road, Coventry CV4 7AL, United Kingdom
| |
Collapse
|
45
|
Abstract
Theoretical simulations of electronic excitations and associated processes in molecules are indispensable for fundamental research and technological innovations. However, such simulations are notoriously challenging to perform with quantum mechanical methods. Advances in machine learning open many new avenues for assisting molecular excited-state simulations. In this Review, we track such progress, assess the current state of the art and highlight the critical issues to solve in the future. We overview a broad range of machine learning applications in excited-state research, which include the prediction of molecular properties, improvements of quantum mechanical methods for the calculations of excited-state properties and the search for new materials. Machine learning approaches can help us understand hidden factors that influence photo-processes, leading to a better control of such processes and new rules for the design of materials for optoelectronic applications.
Collapse
|