1
|
Ferraz-Caetano J, Teixeira F, Cordeiro MNDS. Data-driven, explainable machine learning model for predicting volatile organic compounds' standard vaporization enthalpy. CHEMOSPHERE 2024; 359:142257. [PMID: 38719116 DOI: 10.1016/j.chemosphere.2024.142257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Revised: 04/18/2024] [Accepted: 05/04/2024] [Indexed: 05/21/2024]
Abstract
The accurate prediction of standard vaporization enthalpy (ΔvapHm°) for volatile organic compounds (VOCs) is of paramount importance in environmental chemistry, industrial applications and regulatory compliance. To overcome traditional experimental methods for predicting ΔvapHm° of VOCs, machine learning (ML) models enable a high-throughput, cost-effective property estimation. But despite a rising momentum, existing ML algorithms still present limitations in prediction accuracy and broad chemical applications. In this work, we present a data driven, explainable supervised ML model to predict ΔvapHm° of VOCs. The model was built on an established experimental database of 2410 unique molecules and 223 VOCs categorized by chemical groups. Using supervised ML regression algorithms, the Random Forest successfully predicted VOCs' ΔvapHm° with a mean absolute error of 3.02 kJ mol-1 and a 95% test score. The model was successfully validated through the prediction of ΔvapHm° for a known database of VOCs and through molecular group hold-out tests. Through chemical feature importance analysis, this explainable model revealed that VOC polarizability, connectivity indexes and electrotopological state are key for the model's prediction accuracy. We thus present a replicable and explainable model, which can be further expanded towards the prediction of other thermodynamic properties of VOCs.
Collapse
Affiliation(s)
- José Ferraz-Caetano
- LAQV-REQUIMTE - Department of Chemistry and Biochemistry - Faculty of Sciences, University of Porto - Rua do Campo Alegre, S/N, 4169-007, Porto, Portugal.
| | - Filipe Teixeira
- CQUM - Centre of Chemistry, University of Minho, Campus de Gualtar, 4710-057, Braga, Portugal
| | - M Natália D S Cordeiro
- LAQV-REQUIMTE - Department of Chemistry and Biochemistry - Faculty of Sciences, University of Porto - Rua do Campo Alegre, S/N, 4169-007, Porto, Portugal.
| |
Collapse
|
2
|
Tempke R, Musho T. Autonomous generation of single photon emitting materials. NANOSCALE 2024; 16:10239-10249. [PMID: 38726673 DOI: 10.1039/d3nr04944b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2024]
Abstract
The utilization of machine learning in Materials Science underscores the critical importance of the quality and quantity of data in training models effectively. Unlike fields such as image processing and natural language processing, there is limited availability of atomistic datasets, leading to biases in training data. Particularly in the domain of materials discovery, there exists an issue of continuity in atomistic datasets. Experimental data sourced from literature and patents is usually only available for favorable data, resulting in bias in the training dataset. This study focuses on developing a SMILES-based model for generating synthetic datasets of quantum materials using a variational autoencoder. This study centers on the generation of a synthetic dataset of quantum materials specifically for quantum sensing applications, with a focus on two-level quantum molecules that exhibit a dipole blockade. The proposed technique offers an improved sampling algorithm by incorporating newly generated data into the sampling algorithm to create a more normally distributed dataset. Through this technique, the study was able to generate over 1 000 000 candidate quantum materials from a small dataset of only 8000 materials. The generated dataset identified several iodine-containing molecules as promising single photon emitting materials for potential quantum sensing applications.
Collapse
Affiliation(s)
- Robert Tempke
- Department of Mechanical, Materials and Aerospace Engineering, West Virginia University, P.O. Box 6106, Morgantown, WV, USA.
| | - Terence Musho
- Department of Mechanical, Materials and Aerospace Engineering, West Virginia University, P.O. Box 6106, Morgantown, WV, USA.
| |
Collapse
|
3
|
Ferraz-Caetano J, Teixeira F, Cordeiro MNDS. Explainable Supervised Machine Learning Model To Predict Solvation Gibbs Energy. J Chem Inf Model 2024; 64:2250-2262. [PMID: 37603608 PMCID: PMC11005042 DOI: 10.1021/acs.jcim.3c00544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Indexed: 08/23/2023]
Abstract
Many challenges persist in developing accurate computational models for predicting solvation free energy (ΔGsol). Despite recent developments in Machine Learning (ML) methodologies that outperformed traditional quantum mechanical models, several issues remain concerning explanatory insights for broad chemical predictions with an acceptable speed-accuracy trade-off. To overcome this, we present a novel supervised ML model to predict the ΔGsol for an array of solvent-solute pairs. Using two different ensemble regressor algorithms, we made fast and accurate property predictions using open-source chemical features, encoding complex electronic, structural, and surface area descriptors for every solvent and solute. By integrating molecular properties and chemical interaction features, we have analyzed individual descriptor importance and optimized our model though explanatory information form feature groups. On aqueous and organic solvent databases, ML models revealed the predictive relevance of solutes with increasing polar surface area and decreasing polarizability, yielding better results than state-of-the-art benchmark Neural Network methods (without complex quantum mechanical or molecular dynamic simulations). Both algorithms successfully outperformed previous ΔGsol predictions methods, with a maximum absolute error of 0.22 ± 0.02 kcal mol-1, further validated in an external benchmark database and with solvent hold-out tests. With these explanatory and statistical insights, they allow a thoughtful application of this method for predicting other thermodynamic properties, stressing the relevance of ML modeling for further complex computational chemistry problems.
Collapse
Affiliation(s)
- José Ferraz-Caetano
- Department
of Chemistry and Biochemistry − Faculty of Sciences, University of Porto - Rua do Campo Alegre, S/N, 4169-007 Porto, Portugal
| | - Filipe Teixeira
- Centre
of Chemistry, University of Minho, Campus
de Gualtar, 4710-057 Braga, Portugal
| | - M. Natália D. S. Cordeiro
- Department
of Chemistry and Biochemistry − Faculty of Sciences, University of Porto - Rua do Campo Alegre, S/N, 4169-007 Porto, Portugal
| |
Collapse
|
4
|
Chew PY, Reinhardt A. Phase diagrams-Why they matter and how to predict them. J Chem Phys 2023; 158:030902. [PMID: 36681642 DOI: 10.1063/5.0131028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Understanding the thermodynamic stability and metastability of materials can help us to, for example, gauge whether crystalline polymorphs in pharmaceutical formulations are likely to be durable. It can also help us to design experimental routes to novel phases with potentially interesting properties. In this Perspective, we provide an overview of how thermodynamic phase behavior can be quantified both in computer simulations and machine-learning approaches to determine phase diagrams, as well as combinations of the two. We review the basic workflow of free-energy computations for condensed phases, including some practical implementation advice, ranging from the Frenkel-Ladd approach to thermodynamic integration and to direct-coexistence simulations. We illustrate the applications of such methods on a range of systems from materials chemistry to biological phase separation. Finally, we outline some challenges, questions, and practical applications of phase-diagram determination which we believe are likely to be possible to address in the near future using such state-of-the-art free-energy calculations, which may provide fundamental insight into separation processes using multicomponent solvents.
Collapse
Affiliation(s)
- Pin Yu Chew
- Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | - Aleks Reinhardt
- Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| |
Collapse
|
5
|
Huoyu R, Zhiqiang Z, Guofang J, Zhanggao L, Zhenzhen X. Quantitative Structure-Property Relationship for Critical Temperature of Alkenes with Quantum-Сhemical and Topological Indices. RUSSIAN JOURNAL OF PHYSICAL CHEMISTRY A 2022. [DOI: 10.1134/s0036024422110267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
6
|
Ren Y, Liao Z, Yang Y, Sun J, Jiang B, Wang J, Yang Y. Direct prediction of steam cracking products from naphtha bulk properties: Application of the two sub-networks ANN. FRONTIERS IN CHEMICAL ENGINEERING 2022. [DOI: 10.3389/fceng.2022.983035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Steam cracking of naphtha is an important process for the production of olefins. Applying artificial intelligence helps achieve high-frequency real-time optimization strategy and process control. This work employs an artificial neural network (ANN) model with two sub-networks to simulate the naphtha steam cracking process. In the first feedstock composition ANN, the detailed feedstock compositions are determined from the limited naphtha bulk properties. In the second reactor ANN, the cracking product yields are predicted from the feedstock compositions and operating conditions. The combination of these two sub-networks has the ability to accurately and rapidly predict the product yields directly from naphtha bulk properties. Two different feedstock composition ANN strategies are proposed and compared. The results show that with the special design of dividing the output layer into five groups of PIONA, the prediction accuracy of product yields is significantly improved. The mean absolute error of 11 cracking products is 0.53wt% for 472 test sets. The comparison results show that this indirect feedstock composition ANN has lower product prediction errors, not just the reduction of the total error of the feedstock composition. The critical factor is ensuring that PIONA contents are equal to the actual values. The use of an indirect feedstock composition strategy is a means that can effectively improve the prediction accuracy of the whole ANN model.
Collapse
|
7
|
Zheng P, Yang W, Wu W, Isayev O, Dral PO. Toward Chemical Accuracy in Predicting Enthalpies of Formation with General-Purpose Data-Driven Methods. J Phys Chem Lett 2022; 13:3479-3491. [PMID: 35416675 DOI: 10.1021/acs.jpclett.2c00734] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Enthalpies of formation and reaction are important thermodynamic properties that have a crucial impact on the outcome of chemical transformations. Here we implement the calculation of enthalpies of formation with a general-purpose ANI-1ccx neural network atomistic potential. We demonstrate on a wide range of benchmark sets that both ANI-1ccx and our other general-purpose data-driven method AIQM1 approach the coveted chemical accuracy of 1 kcal/mol with the speed of semiempirical quantum mechanical methods (AIQM1) or faster (ANI-1ccx). It is remarkably achieved without specifically training the machine learning parts of ANI-1ccx or AIQM1 on formation enthalpies. Importantly, we show that these data-driven methods provide statistical means for uncertainty quantification of their predictions, which we use to detect and eliminate outliers and revise reference experimental data. Uncertainty quantification may also help in the systematic improvement of such data-driven methods.
Collapse
Affiliation(s)
- Peikun Zheng
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| | - Wudi Yang
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| | - Wei Wu
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| | - Olexandr Isayev
- Department of Chemistry, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States
| | - Pavlo O Dral
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| |
Collapse
|
8
|
Dobbelaere MR, Plehiers PP, Van de Vijver R, Stevens CV, Van Geem KM. Learning Molecular Representations for Thermochemistry Prediction of Cyclic Hydrocarbons and Oxygenates. J Phys Chem A 2021; 125:5166-5179. [PMID: 34081474 DOI: 10.1021/acs.jpca.1c01956] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Accurate thermochemistry estimation of polycyclic molecules is crucial for kinetic modeling of chemical processes that use renewable and alternative feedstocks. In kinetic model generators, molecular properties are estimated rapidly with group additivity, but this method is known to have limitations for polycyclic structures. This issue has been resolved in our work by combining a geometry-based molecular representation with a deep neural network trained on ab initio data. Each molecule is transformed into a probabilistic vector from its interatomic distances, bond angles, and dihedral angles. The model is tested on a small experimental dataset (200 molecules) from the literature, a new medium-sized set (4000 molecules) with both open-shell and closed-shell species, calculated at the CBS-QB3 level with empirical corrections, and a large G4MP2-level QM9-based dataset (40 000 molecules). Heat capacities between 298.15 and 2500 K are calculated in the medium set with an average deviation of about 1.5 J mol-1 K-1 and the standard entropy at 298.15 K is predicted with an average error below 4 J mol-1 K-1. The standard enthalpy of formation at 298.15 K has an average out-of-sample error below 4 kJ mol-1 on a QM9 training set size of around 15 000 molecules. By fitting NASA polynomials, the enthalpy of formation at higher temperatures can be calculated with the same accuracy as the standard enthalpy of formation. Uncertainty quantification by means of the ensemble standard deviation is included to indicate when molecules that are on the edge or outside of the application range of the model are evaluated.
Collapse
Affiliation(s)
- Maarten R Dobbelaere
- Laboratory for Chemical Technology, Department of Materials, Textiles and Chemical Engineering, Ghent University, Technologiepark 125, 9052 Gent, Belgium
| | - Pieter P Plehiers
- Laboratory for Chemical Technology, Department of Materials, Textiles and Chemical Engineering, Ghent University, Technologiepark 125, 9052 Gent, Belgium
| | - Ruben Van de Vijver
- Laboratory for Chemical Technology, Department of Materials, Textiles and Chemical Engineering, Ghent University, Technologiepark 125, 9052 Gent, Belgium
| | - Christian V Stevens
- SynBioC Research Group, Department of Green Chemistry and Technology, Faculty of Bioscience Engineering, Ghent University, Coupure Links 653, 9000 Gent, Belgium
| | - Kevin M Van Geem
- Laboratory for Chemical Technology, Department of Materials, Textiles and Chemical Engineering, Ghent University, Technologiepark 125, 9052 Gent, Belgium
| |
Collapse
|