1
|
Wang X, Zhang W, Zhang W. Dielectric Ceramics Database Automatically Constructed by Data Mining in the Literature. J Chem Inf Model 2024; 64:5931-5943. [PMID: 39042485 DOI: 10.1021/acs.jcim.4c00282] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/25/2024]
Abstract
Vast published dielectric ceramics literature is a natural database for big-data analysis, discovering structure-property relationships, and property prediction. We constructed a data-mining pipeline based on natural language processing (NLP) to extract property information from about 12,900 published dielectric ceramics articles and normalized more than 20 properties. The micro-F1 scores for sentence classification, named entities recognition, relation extraction (related), and relation extraction (same), are 91.6, 82.4, 91.4, and 88.3%, respectively. We demonstrated the distribution of some essential properties according to the publication years to reveal the tendency. In order to test the reliability of the data extraction, we trained an XGBoost model to predict the dielectric constant and used the SHAP module to interpret the contribution of each feature in order to identify some of the factors that determine the dielectric properties. The result shows that including Q × f in the model can increase the dielectric constant prediction accuracy. Our work can give some hints to experimentalists on their way to improve the performances of cutting-edge materials.
Collapse
Affiliation(s)
- Xiaochao Wang
- School of Integrated Circuits Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, P.R. China
| | - Wanli Zhang
- School of Integrated Circuits Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, P.R. China
| | - Wenxu Zhang
- School of Integrated Circuits Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, P.R. China
| |
Collapse
|
2
|
Matsumoto Y, Gotoh H. Compound Classification and Consideration of Correlation with Chemical Descriptors from Articles on Antioxidant Capacity Using Natural Language Processing. J Chem Inf Model 2024; 64:119-127. [PMID: 38118462 DOI: 10.1021/acs.jcim.3c01826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2023]
Abstract
In recent times, there has been a substantial increase in the number of articles focusing on antioxidants. However, the development of a comprehensive estimator for antioxidant capacity remains elusive due to the challenge of integrating information from these articles. Furthermore, the complexity of the antioxidant mechanism, which involves a multitude of factors, makes it difficult to establish a simple equation or correlation. Hence, there is a pressing need for a model that can effectively interpret the collective knowledge from these articles, especially from a chemistry perspective. In this research, we employed natural language processing techniques, specifically Word2Vec, to analyze articles related to antioxidant capacity. We extracted representation vectors of compound names from these documents and organized them into 10 distinct clusters. In our investigation of two of these clusters, we unveiled that the majority of the compounds in question were flavonoids and flavonoid glycosides. To establish a link between the descriptors and clusters, we utilized kernel density estimation and generated scatter plots to visualize their similarity. These visualizations clearly indicated a strong relationship between the descriptors and clusters, affirming that a tangible connection exists between word vectors and compound descriptors through a document analysis conducted with natural language processing techniques. This study represents a pioneering approach that utilizes document analysis to shed light on the field of antioxidant capacity research, marking a significant advancement in this domain.
Collapse
Affiliation(s)
- Yuto Matsumoto
- Department of Chemistry and Life Science, Yokohama National University, 79-5 Tokiwadai, Hodogaya-ku, Yokohama 240-8501, Japan
| | - Hiroaki Gotoh
- Department of Chemistry and Life Science, Yokohama National University, 79-5 Tokiwadai, Hodogaya-ku, Yokohama 240-8501, Japan
| |
Collapse
|
3
|
Machi K, Akiyama S, Nagata Y, Yoshioka M. OSPAR: A Corpus for Extraction of Organic Synthesis Procedures with Argument Roles. J Chem Inf Model 2023; 63:6619-6628. [PMID: 37859303 PMCID: PMC10647022 DOI: 10.1021/acs.jcim.3c01449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Revised: 10/05/2023] [Accepted: 10/06/2023] [Indexed: 10/21/2023]
Abstract
There is a pressing need for the automated extraction of chemical reaction information because of the rapid growth of scientific documents. The previously reported works in the literature for the procedure extraction either (a) did not consider the semantic relations between the action and argument or (b) defined a detailed schema for the extraction. The former method was insufficient for reproducing the reaction, while the latter methods were too specific to their own schema and did not consider the general semantic relation between the verb and argument. In addition, they did not provide an annotated text that aligned with the structured procedure. Along these lines, in this work, we propose a corpus named organic synthesis procedures with argument roles (OSPAR) that is annotated with rolesets to consider the semantic relation between the verb and argument. We also provide rolesets for chemical reactions, especially for organic synthesis, which represent the argument roles of actions in the corpus. More specifically, we annotated 112 organic synthesis procedures in journal articles from Organic Syntheses and defined 19 new rolesets in addition to 29 rolesets from an existing language resource (Proposition Bank). After that, we constructed a simple deep learning system trained on OSPAR and discussed the usefulness of the corpus by comparing it with chemical description language (XDL) generated by a natural language processing tool, namely, SynthReader. While our system's output required more detailed parsing, it covered comparable information against XDL. Moreover, we confirmed that the validation of the output action sequence was easy as it was aligned with the original text.
Collapse
Affiliation(s)
- Kojiro Machi
- Graduate
School of Information Science and Technology, Hokkaido University, Kita 14, Nishi
9, Kita-ku, Sapporo, Hokkaido 060-0814, Japan
| | - Seiji Akiyama
- Institute
for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University,
Kita 21, Nishi 10, Kita-ku, Sapporo, Hokkaido 001-0021, Japan
| | - Yuuya Nagata
- Institute
for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University,
Kita 21, Nishi 10, Kita-ku, Sapporo, Hokkaido 001-0021, Japan
| | - Masaharu Yoshioka
- Graduate
School of Information Science and Technology, Hokkaido University, Kita 14, Nishi
9, Kita-ku, Sapporo, Hokkaido 060-0814, Japan
- Institute
for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University,
Kita 21, Nishi 10, Kita-ku, Sapporo, Hokkaido 001-0021, Japan
- Faculty
of Information Science and Technology, Hokkaido
University, Kita 14, Nishi 9, Kita-ku, Sapporo, Hokkaido 060-0814, Japan
| |
Collapse
|
4
|
Reid JP, Betinol IO, Kuang Y. Mechanism to model: a physical organic chemistry approach to reaction prediction. Chem Commun (Camb) 2023; 59:10711-10721. [PMID: 37552047 DOI: 10.1039/d3cc03229a] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/09/2023]
Abstract
The application of mechanistic generalizations is at the core of chemical reaction development and application. These strategies are rooted in physical organic chemistry where mechanistic understandings can be derived from one reaction and applied to explain another. Over time these techniques have evolved from rationalizing observed outcomes to leading experimental design through reaction prediction. In parallel, significant progression in asymmetric organocatalysis has expanded the reach of chiral transfer to new reactions with increased efficiency. However, the complex and diverse catalyst structures applied in this arena have rendered the generalization of asymmetric catalytic processes to be exceptionally challenging. Recognizing this, a portion of our research has been focused on understanding the transferability of chemical observations between similar reactions and exploiting this phenomenon as a platform for prediction. Through these experiences, we have relied on a working knowledge of reaction mechanism to guide the development and application of our models which have been advanced from simple qualitative rules to large statistical models for quantitative predictions. In this feature article, we describe the models acquired to generalize organocatalytic reaction mechanisms and demonstrate their use as a powerful approach for accelerating enantioselective synthesis.
Collapse
Affiliation(s)
- Jolene P Reid
- Department of Chemistry, University of British Columbia, 2036 Main Mall, Vancouver, British Columbia, V6T 1Z1, Canada.
| | - Isaiah O Betinol
- Department of Chemistry, University of British Columbia, 2036 Main Mall, Vancouver, British Columbia, V6T 1Z1, Canada.
| | - Yutao Kuang
- Department of Chemistry, University of British Columbia, 2036 Main Mall, Vancouver, British Columbia, V6T 1Z1, Canada.
| |
Collapse
|
5
|
Wang R, Zhong Y, Dong X, Du M, Yuan H, Zou Y, Wang X, Lin Z, Xu D. Data Mining and Graph Network Deep Learning for Band Gap Prediction in Crystalline Borate Materials. Inorg Chem 2023; 62:4716-4726. [PMID: 36888968 DOI: 10.1021/acs.inorgchem.3c00233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/10/2023]
Abstract
Crystalline borates are an important class of functional materials with wide applications in photocatalysis and laser technologies. Obtaining their band gap values in a timely and precise manner is a great challenge in material design due to the issues of computational accuracy and cost of first-principles methods. Although machine learning (ML) techniques have shown great successes in predicting the versatile properties of materials, their practicality is often limited by the data set quality. Here, by using a combination of natural language processing searches and domain knowledge, we built an experimental database of inorganic borates, including their chemical compositions, band gaps, and crystal structures. We performed graph network deep learning to predict the band gaps of borates with accuracy, and the results agreed favorably with experimental measurements from the visible-light to the deep-ultraviolet (DUV) region. For a realistic screening problem, our ML model could correctly identify most of the investigated DUV borates. Furthermore, the extrapolative ability of the model was validated against our newly synthesized borate crystal Ag3B6O10NO3, supplemented by the discussion of an ML-based material design for structural analogues. The applications and interpretability of the ML model were also evaluated extensively. Finally, we implemented a web-based application, which could be utilized conveniently in material engineering for the desired band gap. The philosophy behind this study is to use cost-effective data mining techniques to build high-quality ML models, which can provide useful clues for further material design.
Collapse
Affiliation(s)
- Ruihan Wang
- MOE Key Laboratory of Green Chemistry and Technology, College of Chemistry, Sichuan University, Chengdu, Sichuan 610064, PR China
| | - Yeshuang Zhong
- Department of Physics, School of Biology and Engineering, Guizhou Medical University, Guiyang, Guizhou 550025, PR China
| | - Xuehua Dong
- MOE Key Laboratory of Green Chemistry and Technology, College of Chemistry, Sichuan University, Chengdu, Sichuan 610064, PR China
| | - Meng Du
- MOE Key Laboratory of Green Chemistry and Technology, College of Chemistry, Sichuan University, Chengdu, Sichuan 610064, PR China
| | - Haolun Yuan
- MOE Key Laboratory of Green Chemistry and Technology, College of Chemistry, Sichuan University, Chengdu, Sichuan 610064, PR China
| | - Yurong Zou
- MOE Key Laboratory of Green Chemistry and Technology, College of Chemistry, Sichuan University, Chengdu, Sichuan 610064, PR China
| | - Xin Wang
- MOE Key Laboratory of Green Chemistry and Technology, College of Chemistry, Sichuan University, Chengdu, Sichuan 610064, PR China
| | - Zhien Lin
- MOE Key Laboratory of Green Chemistry and Technology, College of Chemistry, Sichuan University, Chengdu, Sichuan 610064, PR China
| | - Dingguo Xu
- MOE Key Laboratory of Green Chemistry and Technology, College of Chemistry, Sichuan University, Chengdu, Sichuan 610064, PR China.,Research Center for Materials Genome Engineering, Sichuan University, Chengdu, Sichuan 610065, PR China
| |
Collapse
|