1
|
Alshehri AS, Horstmann KA, You F. Versatile Deep Learning Pipeline for Transferable Chemical Data Extraction. J Chem Inf Model 2024; 64:5888-5899. [PMID: 39009039 DOI: 10.1021/acs.jcim.4c00816] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/17/2024]
Abstract
Chemical information disseminated in scientific documents offers an untapped potential for deep learning-assisted insights and breakthroughs. Automated extraction efforts have shifted from resource-intensive manual extraction toward applying machine learning methods to streamline chemical data extraction. While current extraction models and pipelines have ushered in notable efficiency improvements, they often exhibit modest performance, compromising the accuracy of predictive models trained on extracted data. Further, current chemical pipelines lack both transferability─where a model trained on one task can be adapted to another relevant task with limited examples─and extensibility, which enables seamless adaptability for new extraction tasks. Addressing these gaps, we present ChemREL, a versatile chemical data extraction pipeline emphasizing performance, transferability, and extensibility. ChemREL utilizes a custom, diverse data set of chemical documents, labeled through an active learning strategy to extract two properties: normal melting point and lethal dose 50 (LD50). The normal melting point is selected for its prevalence in diverse contexts and wider literature, serving as the foundation for pipeline training. In contrast, LD50 evaluates the pipeline's transferability to an unrelated property, underscoring variance in its biological nature, toxicological context, and units, among other differences. With pretraining and fine-tuning, our pipeline outperforms existing methods and GPT-4, achieving F1-scores of 96.1% for entity identification and 97.0% for relation mapping, culminating in an overall F1-score of 95.4%. More importantly, ChemREL displays high transferability, effectively transitioning from melting point extraction to LD50 extraction with 10 randomly selected training documents. Released as an open-source package, ChemREL aims to broaden access to chemical data extraction, enabling the construction of expansive relational data sets that propel discovery.
Collapse
Affiliation(s)
- Abdulelah S Alshehri
- Robert Frederick Smith School of Chemical and Biomolecular Engineering, Cornell University, Ithaca, New York 14853, United States
- Department of Chemical Engineering, College of Engineering, King Saud University, Riyadh 11421, Saudi Arabia
| | - Kai A Horstmann
- Department of Computer Science, Cornell University, Ithaca, New York 14853, United States
| | - Fengqi You
- Robert Frederick Smith School of Chemical and Biomolecular Engineering, Cornell University, Ithaca, New York 14853, United States
| |
Collapse
|
2
|
Xie T, Wan Y, Zhou Y, Huang W, Liu Y, Linghu Q, Wang S, Kit C, Grazian C, Zhang W, Hoex B. Creation of a structured solar cell material dataset and performance prediction using large language models. PATTERNS (NEW YORK, N.Y.) 2024; 5:100955. [PMID: 38800367 PMCID: PMC11117053 DOI: 10.1016/j.patter.2024.100955] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Revised: 02/26/2024] [Accepted: 02/27/2024] [Indexed: 05/29/2024]
Abstract
Materials scientists usually collect experimental data to summarize experiences and predict improved materials. However, a crucial issue is how to proficiently utilize unstructured data to update existing structured data, particularly in applied disciplines. This study introduces a new natural language processing (NLP) task called structured information inference (SII) to address this problem. We propose an end-to-end approach to summarize and organize the multi-layered device-level information from the literature into structured data. After comparing different methods, we fine-tuned LLaMA with an F1 score of 87.14% to update an existing perovskite solar cell dataset with articles published since its release, allowing its direct use in subsequent data analysis. Using structured information, we developed regression tasks to predict the electrical performance of solar cells. Our results demonstrate comparable performance to traditional machine-learning methods without feature selection and highlight the potential of large language models for scientific knowledge acquisition and material development.
Collapse
Affiliation(s)
- Tong Xie
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW, Australia
- GreenDynamics Pty. Ltd, Kensington, NSW, Australia
| | - Yuwei Wan
- GreenDynamics Pty. Ltd, Kensington, NSW, Australia
- Department of Linguistics and Translation, City University of Hong Kong, Hong Kong, China
| | - Yufei Zhou
- Department of Linguistics and Translation, City University of Hong Kong, Hong Kong, China
| | - Wei Huang
- School of Computer Science and Engineering, University of New South Wales, Kensington, NSW, Australia
| | - Yixuan Liu
- GreenDynamics Pty. Ltd, Kensington, NSW, Australia
| | - Qingyuan Linghu
- GreenDynamics Pty. Ltd, Kensington, NSW, Australia
- School of Computer Science and Engineering, University of New South Wales, Kensington, NSW, Australia
| | - Shaozhou Wang
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW, Australia
- GreenDynamics Pty. Ltd, Kensington, NSW, Australia
| | - Chunyu Kit
- Department of Linguistics and Translation, City University of Hong Kong, Hong Kong, China
| | - Clara Grazian
- School of Mathematics and Statistics, University of Sydney, Camperdown, NSW, Australia
- DARE ARC Training Centre in Data Analytics for Resources and Environments, South Eveleigh, NSW, Australia
| | - Wenjie Zhang
- School of Computer Science and Engineering, University of New South Wales, Kensington, NSW, Australia
| | - Bram Hoex
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW, Australia
| |
Collapse
|
3
|
Isazawa T, Cole JM. How Beneficial Is Pretraining on a Narrow Domain-Specific Corpus for Information Extraction about Photocatalytic Water Splitting? J Chem Inf Model 2024; 64:3205-3212. [PMID: 38544337 PMCID: PMC11040717 DOI: 10.1021/acs.jcim.4c00063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Revised: 03/16/2024] [Accepted: 03/18/2024] [Indexed: 04/23/2024]
Abstract
Language models trained on domain-specific corpora have been employed to increase the performance in specialized tasks. However, little previous work has been reported on how specific a "domain-specific" corpus should be. Here, we test a number of language models trained on varyingly specific corpora by employing them in the task of extracting information from photocatalytic water splitting. We find that more specific corpora can benefit performance on downstream tasks. Furthermore, PhotocatalysisBERT, a pretrained model from scratch on scientific papers on photocatalytic water splitting, demonstrates improved performance over previous work in associating the correct photocatalyst with the correct photocatalytic activity during information extraction, achieving a precision of 60.8(+11.5)% and a recall of 37.2(+4.5)%.
Collapse
Affiliation(s)
- Taketomo Isazawa
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.
| | - Jacqueline M. Cole
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.
- ISIS
Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.
| |
Collapse
|
4
|
Deebansok S, Deng J, Le Calvez E, Zhu Y, Crosnier O, Brousse T, Fontaine O. Capacitive tendency concept alongside supervised machine-learning toward classifying electrochemical behavior of battery and pseudocapacitor materials. Nat Commun 2024; 15:1133. [PMID: 38326356 PMCID: PMC10850137 DOI: 10.1038/s41467-024-45394-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2023] [Accepted: 01/19/2024] [Indexed: 02/09/2024] Open
Abstract
In recent decades, more than 100,000 scientific articles have been devoted to the development of electrode materials for supercapacitors and batteries. However, there is still intense debate surrounding the criteria for determining the electrochemical behavior involved in Faradaic reactions, as the issue is often complicated by the electrochemical signals produced by various electrode materials and their different physicochemical properties. The difficulty lies in the inability to determine which electrode type (battery vs. pseudocapacitor) these materials belong to via simple binary classification. To overcome this difficulty, we apply supervised machine learning for image classification to electrochemical shape analysis (over 5500 Cyclic Voltammetry curves and 2900 Galvanostatic Charge-Discharge curves), with the predicted confidence percentage reflecting the shape trend of the curve and thus defined as a manufacturer. It's called "capacitive tendency". This predictor not only transcends the limitations of human-based classification but also provides statistical trends regarding electrochemical behavior. Of note, and of particular importance to the electrochemical energy storage community, which publishes over a hundred articles per week, we have created an online tool to easily categorize their data.
Collapse
Affiliation(s)
- Siraprapha Deebansok
- Molecular Electrochemistry for Energy laboratory, VISTEC, Institute of Science and Technology, Rayong, 21210, Thailand
| | - Jie Deng
- Institute for Advanced Study & College of Food and Biological Engineering, Chengdu University, Chengdu, 610106, China
| | - Etienne Le Calvez
- Nantes Université, CNRS, Institut des Matériaux de Nantes Jean Rouxel, IMN, 44000, Nantes, France
- Réseau sur le Stockage Électrochimique de l'Énergie (RS2E), CNRS FR 3459, 33 rue Saint Leu, 80039, Amiens, France
| | - Yachao Zhu
- ICGM, Université de Montpellier, CNRS, 34293, Montpellier, France
| | - Olivier Crosnier
- Nantes Université, CNRS, Institut des Matériaux de Nantes Jean Rouxel, IMN, 44000, Nantes, France
- Réseau sur le Stockage Électrochimique de l'Énergie (RS2E), CNRS FR 3459, 33 rue Saint Leu, 80039, Amiens, France
| | - Thierry Brousse
- Nantes Université, CNRS, Institut des Matériaux de Nantes Jean Rouxel, IMN, 44000, Nantes, France
- Réseau sur le Stockage Électrochimique de l'Énergie (RS2E), CNRS FR 3459, 33 rue Saint Leu, 80039, Amiens, France
| | - Olivier Fontaine
- Molecular Electrochemistry for Energy laboratory, VISTEC, Institute of Science and Technology, Rayong, 21210, Thailand.
- Institut Universitaire de France, 75005, Paris, France.
| |
Collapse
|
5
|
Taqieddin A, Sarrouf S, Ehsan MF, Alshawabkeh AN. New Insights on Designing the Next-Generation Materials for Electrochemical Synthesis of Reactive Oxidative Species Towards Efficient and Scalable Water Treatment: A Review and Perspectives. JOURNAL OF ENVIRONMENTAL CHEMICAL ENGINEERING 2023; 11:111384. [PMID: 38186676 PMCID: PMC10769459 DOI: 10.1016/j.jece.2023.111384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Electrochemical water remediation technologies offer several advantages and flexibility for water treatment and degradation of contaminants. These technologies generate reactive oxidative species (ROS) that degrade pollutants. For the implementation of these technologies at an industrial scale, efficient, scalable, and cost-effective in-situ ROS synthesis is necessary to degrade complex pollutant mixtures, treat large amount of contaminated water, and clean water in a reasonable amount of time and cost. These targets are directly dependent on the materials used to generate the ROS, such as electrodes and catalysts. Here, we review the key design aspects of electrocatalytic materials for efficient in-situ ROS generation. We present a mechanistic understanding of ROS generation, including their reaction pathways, and integrate this with the key design considerations of the materials and the overall electrochemical reactor/cell. This involves tunning the interfacial interactions between the electrolyte and electrode which can enhance the ROS generation rate up to ~ 40% as discussed in this review. We also summarized the current and emerging materials for water remediation cells and created a structured dataset of about 500 electrodes and 130 catalysts used for ROS generation and water treatment. A perspective on accelerating the discovery and designing of the next generation electrocatalytic materials is discussed through the application of integrated experimental and computational workflows. Overall, this article provides a comprehensive review and perspectives on designing and discovering materials for ROS synthesis, which are critical not only for successful implementation of electrochemical water remediation technologies but also for other electrochemical applications.
Collapse
Affiliation(s)
- Amir Taqieddin
- Department of Mechanical & Industrial Engineering, Northeastern University, Boston, MA 02115
| | - Stephanie Sarrouf
- Department of Civil & Environmental Engineering, Northeastern University, Boston, MA 02115
| | - Muhammad Fahad Ehsan
- Department of Civil & Environmental Engineering, Northeastern University, Boston, MA 02115
| | - Akram N. Alshawabkeh
- Department of Civil & Environmental Engineering, Northeastern University, Boston, MA 02115
| |
Collapse
|
6
|
Wilary D, Cole JM. ReactionDataExtractor 2.0: A Deep Learning Approach for Data Extraction from Chemical Reaction Schemes. J Chem Inf Model 2023; 63:6053-6067. [PMID: 37729111 PMCID: PMC10565829 DOI: 10.1021/acs.jcim.3c00422] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Indexed: 09/22/2023]
Abstract
Knowledge in the chemical domain is often disseminated graphically via chemical reaction schemes. The task of describing chemical transformations is greatly simplified by introducing reaction schemes that are composed of chemical diagrams and symbols. While intuitively understood by any chemist, like most graphical representations, such drawings are not easily understood by machines; this poses a challenge in the context of data extraction. Currently available tools are limited in their scope of extraction and require manual preprocessing, thus slowing down the speed of data extraction. We present a new tool, ReactionDataExtractor v2.0, which uses a combination of neural networks and symbolic artificial intelligence to effectively remove this barrier. We have evaluated our tool on a test set composed of reaction schemes that were taken from open-source journal articles and realized F1 score metrics between 75 and 96%. These evaluation metrics can be further improved by tuning our object-detection models to a specific chemical subdomain thanks to a data-driven approach that we have adopted with synthetically generated data. The system architecture of our tool is modular, which allows it to balance speed and accuracy to afford an autonomous, high-throughput solution for image-based chemical data extraction.
Collapse
Affiliation(s)
- Damian
M. Wilary
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K.
| | - Jacqueline M. Cole
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K.
- ISIS
Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.
| |
Collapse
|
7
|
Glasby L, Gubsch K, Bence R, Oktavian R, Isoko K, Moosavi SM, Cordiner JL, Cole JC, Moghadam PZ. DigiMOF: A Database of Metal-Organic Framework Synthesis Information Generated via Text Mining. CHEMISTRY OF MATERIALS : A PUBLICATION OF THE AMERICAN CHEMICAL SOCIETY 2023; 35:4510-4524. [PMID: 37332681 PMCID: PMC10269341 DOI: 10.1021/acs.chemmater.3c00788] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Revised: 04/29/2023] [Indexed: 06/20/2023]
Abstract
The vastness of materials space, particularly that which is concerned with metal-organic frameworks (MOFs), creates the critical problem of performing efficient identification of promising materials for specific applications. Although high-throughput computational approaches, including the use of machine learning, have been useful in rapid screening and rational design of MOFs, they tend to neglect descriptors related to their synthesis. One way to improve the efficiency of MOF discovery is to data-mine published MOF papers to extract the materials informatics knowledge contained within journal articles. Here, by adapting the chemistry-aware natural language processing tool, ChemDataExtractor (CDE), we generated an open-source database of MOFs focused on their synthetic properties: the DigiMOF database. Using the CDE web scraping package alongside the Cambridge Structural Database (CSD) MOF subset, we automatically downloaded 43,281 unique MOF journal articles, extracted 15,501 unique MOF materials, and text-mined over 52,680 associated properties including the synthesis method, solvent, organic linker, metal precursor, and topology. Additionally, we developed an alternative data extraction technique to obtain and transform the chemical names assigned to each CSD entry in order to determine linker types for each structure in the CSD MOF subset. This data enabled us to match MOFs to a list of known linkers provided by Tokyo Chemical Industry UK Ltd. (TCI) and analyze the cost of these important chemicals. This centralized, structured database reveals the MOF synthetic data embedded within thousands of MOF publications and contains further topology, metal type, accessible surface area, largest cavity diameter, pore limiting diameter, open metal sites, and density calculations for all 3D MOFs in the CSD MOF subset. The DigiMOF database and associated software are publicly available for other researchers to rapidly search for MOFs with specific properties, conduct further analysis of alternative MOF production pathways, and create additional parsers to search for additional desirable properties.
Collapse
Affiliation(s)
- Lawson
T. Glasby
- Department
of Chemical and Biological Engineering, The University of Sheffield, Sheffield S1 3JD, U.K.
| | - Kristian Gubsch
- Department
of Chemical and Biological Engineering, The University of Sheffield, Sheffield S1 3JD, U.K.
| | - Rosalee Bence
- Department
of Chemical and Biological Engineering, The University of Sheffield, Sheffield S1 3JD, U.K.
| | - Rama Oktavian
- Department
of Chemical and Biological Engineering, The University of Sheffield, Sheffield S1 3JD, U.K.
| | - Kesler Isoko
- Department
of Chemical and Biological Engineering, The University of Sheffield, Sheffield S1 3JD, U.K.
| | - Seyed Mohamad Moosavi
- Chemical
Engineering & Applied Chemistry, University
of Toronto, Toronto, Ontario M5S 3E5, Canada
| | - Joan L. Cordiner
- Department
of Chemical and Biological Engineering, The University of Sheffield, Sheffield S1 3JD, U.K.
| | - Jason C. Cole
- Cambridge
Crystallographic Data Centre, Cambridge CB2 1EZ, U.K.
| | - Peyman Z. Moghadam
- Department
of Chemical and Biological Engineering, The University of Sheffield, Sheffield S1 3JD, U.K.
- Department
of Chemical Engineering, University College
London, London WC1E 7JE, U.K.
| |
Collapse
|