1
|
Kevlishvili I, St Michel RG, Garrison AG, Toney JW, Adamji H, Jia H, Román-Leshkov Y, Kulik HJ. Leveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexes. Faraday Discuss 2024. [PMID: 39301698 DOI: 10.1039/d4fd00087k] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/22/2024]
Abstract
The breadth of transition metal chemical space covered by databases such as the Cambridge Structural Database and the derived computational database tmQM is not conducive to application-specific modeling and the development of structure-property relationships. Here, we employ both supervised and unsupervised natural language processing (NLP) techniques to link experimentally synthesized compounds in the tmQM database to their respective applications. Leveraging NLP models, we curate four distinct datasets: tmCAT for catalysis, tmPHOTO for photophysical activity, tmBIO for biological relevance, and tmSCO for magnetism. Analyzing the chemical substructures within each dataset reveals common chemical motifs in each of the designated applications. We then use these common chemical structures to augment our initial datasets for each application, yielding a total of 21 631 compounds in tmCAT, 4599 in tmPHOTO, 2782 in tmBIO, and 983 in tmSCO. These datasets are expected to accelerate the more targeted computational screening and development of refined structure-property relationships with machine learning.
Collapse
Affiliation(s)
- Ilia Kevlishvili
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Roland G St Michel
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
- Department of Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Aaron G Garrison
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Jacob W Toney
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Husain Adamji
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Haojun Jia
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Yuriy Román-Leshkov
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Heather J Kulik
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| |
Collapse
|
2
|
Ai Q, Meng F, Shi J, Pelkie B, Coley CW. Extracting structured data from organic synthesis procedures using a fine-tuned large language model. DIGITAL DISCOVERY 2024; 3:1822-1831. [PMID: 39157760 PMCID: PMC11322921 DOI: 10.1039/d4dd00091a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Accepted: 07/30/2024] [Indexed: 08/20/2024]
Abstract
The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and despite the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we fine-tune a large language model (LLM) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD "messages" (e.g., full compound, workups, or condition definitions) and 92.25% for individual data fields (e.g., compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.
Collapse
Affiliation(s)
- Qianxiang Ai
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| | - Fanwang Meng
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| | - Jiale Shi
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| | - Brenden Pelkie
- Department of Chemical Engineering, University of Washington Seattle WA USA
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| |
Collapse
|
3
|
Huang Z, Li X, Li A, Yang Y, He L, Zhang Z, Wu S, Wang Y, Cai S, He Y, Liu X. MPNTEXT: An Interactive Platform for Automatically Extracting Metal-Polyphenol Networks and Their Applications from Scientific Literature. J Chem Inf Model 2024. [PMID: 39258795 DOI: 10.1021/acs.jcim.4c01093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/12/2024]
Abstract
In recent years, metal-polyphenol networks (MPNs) have gained significant attention due to their unique properties and broad applications across various fields. However, the burgeoning volume of MPN literature necessitates the automation of chemical information extraction from the extensive corpus of unstructured data, including scientific publications. To address this challenge, we proposed a platform named MPNTEXT, which utilized natural language processing techniques and machine learning algorithms to efficiently identify and extract pertinent information, thereby assisting users in comprehending complex MPNs and their textual descriptions of applications. Users can enter keywords, such as "Fe", "drug delivery", or "tannic acid", to retrieve relevant information, which is then presented in a structured format. This study aims to provide a user-friendly tool for collecting and retrieving MPN data and promotes data-driven material design. The platform offers researchers a more convenient and efficient way to design versatile MPNs and explore their applications.
Collapse
Affiliation(s)
- Zihui Huang
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Xinyi Li
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Andi Li
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Yuhang Yang
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Liqiang He
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Zhiwen Zhang
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Siwei Wu
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Yang Wang
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Shuting Cai
- School of Integrated Circuits, Guangdong University of Technology, Guangzhou 510006, China
| | - Yan He
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Xujie Liu
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| |
Collapse
|
4
|
Liu W, Chen J, Wang H, Fu Z, Peijnenburg WJGM, Hong H. Perspectives on Advancing Multimodal Learning in Environmental Science and Engineering Studies. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2024. [PMID: 39226136 DOI: 10.1021/acs.est.4c03088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
The environment faces increasing anthropogenic impacts, resulting in a rapid increase in environmental issues that undermine the natural capital essential for human wellbeing. These issues are complex and often influenced by various factors represented by data with different modalities. While machine learning (ML) provides data-driven tools for addressing the environmental issues, the current ML models in environmental science and engineering (ES&E) often neglect the utilization of multimodal data. With the advancement in deep learning, multimodal learning (MML) holds promise for comprehensive descriptions of the environmental issues by harnessing data from diverse modalities. This advancement has the potential to significantly elevate the accuracy and robustness of prediction models in ES&E studies, providing enhanced solutions for various environmental modeling tasks. This perspective summarizes MML methodologies and proposes potential applications of MML models in ES&E studies, including environmental quality assessment, prediction of chemical hazards, and optimization of pollution control techniques. Additionally, we discuss the challenges associated with implementing MML in ES&E and propose future research directions in this domain.
Collapse
Affiliation(s)
- Wenjia Liu
- Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Jingwen Chen
- Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Haobo Wang
- Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Zhiqiang Fu
- Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Willie J G M Peijnenburg
- Institute of Environmental Sciences (CML), Leiden University, Leiden 2300 RA, The Netherlands
- Centre for Safety of Substances and Products, National Institute of Public Health and the Environment (RIVM), Bilthoven 3720 BA, The Netherlands
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas 72079, United States
| |
Collapse
|
5
|
Ahmad F, Muhmood T. Clinical translation of nanomedicine with integrated digital medicine and machine learning interventions. Colloids Surf B Biointerfaces 2024; 241:114041. [PMID: 38897022 DOI: 10.1016/j.colsurfb.2024.114041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Revised: 06/11/2024] [Accepted: 06/13/2024] [Indexed: 06/21/2024]
Abstract
Nanomaterials based therapeutics transform the ways of disease prevention, diagnosis and treatment with increasing sophistications in nanotechnology at a breakneck pace, but very few could reach to the clinic due to inconsistencies in preclinical studies followed by regulatory hinderances. To tackle this, integrating the nanomedicine discovery with digital medicine provide technologies as tools of specific biological activity measurement. Hence, overcome the redundancies in nanomedicine discovery by the on-site data acquisition and analytics through integrating intelligent sensors and artificial intelligence (AI) or machine learning (ML). Integrated AI/ML wearable sensors directly gather clinically relevant biochemical information from the subject's body and process data for physicians to make right clinical decision(s) in a time and cost-effective way. This review summarizes insights and recommend the infusion of actionable big data computation enabled sensors in burgeoning field of nanomedicine at academia, research institutes, and pharmaceutical industries, with a potential of clinical translation. Furthermore, many blind spots are present in modern clinically relevant computation, one of which could prevent ML-guided low-cost new nanomedicine development from being successfully translated into the clinic was also discussed.
Collapse
Affiliation(s)
- Farooq Ahmad
- State Key Laboratory of Chemistry and Utilization of Carbon Based Energy Resources, College of Chemistry, Xinjiang University, Urumqi 830017, China.
| | - Tahir Muhmood
- International Iberian Nanotechnology Laboratory (INL), Avenida Mestre José Veiga, Braga 4715-330, Portugal.
| |
Collapse
|
6
|
Chen C, Li SL, Xu YY, Liu J, Graham DW, Zhu YG. Characterising global antimicrobial resistance research explains why One Health solutions are slow in development: An application of AI-based gap analysis. ENVIRONMENT INTERNATIONAL 2024; 187:108680. [PMID: 38723455 DOI: 10.1016/j.envint.2024.108680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2024] [Revised: 04/16/2024] [Accepted: 04/19/2024] [Indexed: 05/19/2024]
Abstract
The global health crisis posed by increasing antimicrobial resistance (AMR) implicitly requires solutions based a One Health approach, yet multisectoral, multidisciplinary research on AMR is rare and huge knowledge gaps exist to guide integrated action. This is partly because a comprehensive survey of past research activity has never performed due to the massive scale and diversity of published information. Here we compiled 254,738 articles on AMR using Artificial Intelligence (AI; i.e., Natural Language Processing, NLP) methods to create a database and information retrieval system for knowledge extraction on research perfomed over the last 20 years. Global maps were created that describe regional, methodological, and sectoral AMR research activities that confirm limited intersectoral research has been performed, which is key to guiding science-informed policy solutions to AMR, especially in low-income countries (LICs). Further, we show greater harmonisation in research methods across sectors and regions is urgently needed. For example, differences in analytical methods used among sectors in AMR research, such as employing culture-based versus genomic methods, results in poor communication between sectors and partially explains why One Health-based solutions are not ensuing. Therefore, our analysis suggest that performing culture-based and genomic AMR analysis in tandem in all sectors is crucial for data integration and holistic One Health solutions. Finally, increased investment in capacity development in LICs should be prioritised as they are places where the AMR burden is often greatest. Our open-access database and AI methodology can be used to further develop, disseminate, and create new tools and practices for AMR knowledge and information sharing.
Collapse
Affiliation(s)
- Cai Chen
- Key Laboratory of Urban Environment and Health, Ningbo Observation and Research Station, Institute of Urban Environment, Chinese Academy of Sciences, Xiamen 361021, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Shu-Le Li
- Key Laboratory of Urban Environment and Health, Ningbo Observation and Research Station, Institute of Urban Environment, Chinese Academy of Sciences, Xiamen 361021, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yao-Yang Xu
- Key Laboratory of Urban Environment and Health, Ningbo Observation and Research Station, Institute of Urban Environment, Chinese Academy of Sciences, Xiamen 361021, China; Zhejiang Key Laboratory of Urban Environmental Processes and Pollution Control, CAS Haixi Industrial Technology Innovation Center in Beilun, Ningbo 315830, China
| | - Jue Liu
- Department of Epidemiology and Biostatistics, School of Public Health, Peking University, Beijing 100191, China; Institute for Global Health and Development, Peking University, Beijing 100191, China
| | - David W Graham
- School of Engineering, Newcastle University, Newcastle, UK.
| | - Yong-Guan Zhu
- Key Laboratory of Urban Environment and Health, Ningbo Observation and Research Station, Institute of Urban Environment, Chinese Academy of Sciences, Xiamen 361021, China; Zhejiang Key Laboratory of Urban Environmental Processes and Pollution Control, CAS Haixi Industrial Technology Innovation Center in Beilun, Ningbo 315830, China; State Key Laboratory of Urban and Regional Ecology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China.
| |
Collapse
|
7
|
Blakey M, Pearman-Kanza S, Frey JG. Zombie cheminformatics: extraction and conversion of Wiswesser Line Notation (WLN) from chemical documents. J Cheminform 2024; 16:42. [PMID: 38622746 PMCID: PMC11017645 DOI: 10.1186/s13321-024-00831-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Accepted: 03/23/2024] [Indexed: 04/17/2024] Open
Abstract
PURPOSE Wiswesser Line Notation (WLN) is a old line notation for encoding chemical compounds for storage and processing by computers. Whilst the notation itself has long since been surpassed by SMILES and InChI, distribution of WLN during its active years was extensive. In the context of modernising chemical data, we present a comprehensive WLN parser developed using the OpenBabel toolkit, capable of translating WLN strings into various formats supported by the library. Furthermore, we have devised a specialised Finite State Machine l, constructed from the rules of WLN, enabling the recognition and extraction of chemical strings out of large bodies of text. Available open-access WLN data with corresponding SMILES or InChI notation is rare, however ChEMBL, ChemSpider and PubChem all contain WLN records which were used for conversion scoring. Our investigation revealed a notable proportion of inaccuracies within the database entries, and we have taken steps to rectify these errors whenever feasible. SCIENTIFIC CONTRIBUTION Tools for both the extraction and conversion of WLN from chemical documents have been successfully developed. Both the Deterministic Finite Automaton (DFA) and parser handle the majority of WLN rules officially endorsed in the three major WLN manuals, with the parser showing a clear jump in accuracy and chemical coverage over previous submissions. The GitHub repository can be found here: https://github.com/Mblakey/wiswesser .
Collapse
Affiliation(s)
- Michael Blakey
- Department of Chemistry, University of Southampton, University Road, Southampton, Hampshire, SO17 1BJ, UK.
| | - Samantha Pearman-Kanza
- Department of Chemistry, University of Southampton, University Road, Southampton, Hampshire, SO17 1BJ, UK
| | - Jeremy G Frey
- Department of Chemistry, University of Southampton, University Road, Southampton, Hampshire, SO17 1BJ, UK
| |
Collapse
|
8
|
Arora S, Chettri S, Percha V, Kumar D, Latwal M. Artifical intelligence: a virtual chemist for natural product drug discovery. J Biomol Struct Dyn 2024; 42:3826-3835. [PMID: 37232451 DOI: 10.1080/07391102.2023.2216295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 05/12/2023] [Indexed: 05/27/2023]
Abstract
Nature is full of a bundle of medicinal substances and its product perceived as a prerogative structure to collaborate with protein drug targets. The natural product's (NPs) structure heterogeneity and eccentric characteristics inspired scientists to work on natural product-inspired medicine. To gear NP drug-finding artificial intelligence (AI) to confront and excavate unexplored opportunities. Natural product-inspired drug discoveries based on AI to act as an innovative tool for molecular design and lead discovery. Various models of machine learning produce quickly synthesizable mimetics of the natural products templates. The invention of novel natural products mimetics by computer-assisted technology provides a feasible strategy to get the natural product with defined bio-activities. AI's hit rate makes its high importance by improving trail patterns such as dose selection, trail life span, efficacy parameters, and biomarkers. Along these lines, AI methods can be a successful tool in a targeted way to formulate advanced medicinal applications for natural products. 'Prediction of future of natural product based drug discovery is not magic, actually its artificial intelligence'Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Shefali Arora
- Department of Chemistry, University of Petroleum and Energy Studies, Dehradun, Uttarakhand, India
| | - Sukanya Chettri
- Department of Chemistry, University of Petroleum and Energy Studies, Dehradun, Uttarakhand, India
| | - Versha Percha
- Department of Pharmaceutical Chemistry, Dolphin(PG) Institute of Biomedical and Natural Sciences, Dehradun, Uttarakhand, India
| | - Deepak Kumar
- Department of Pharmaceutical Chemistry, Dolphin(PG) Institute of Biomedical and Natural Sciences, Dehradun, Uttarakhand, India
| | - Mamta Latwal
- Department of Chemistry, University of Petroleum and Energy Studies, Dehradun, Uttarakhand, India
| |
Collapse
|
9
|
Zhang X, Zhou Z, Ming C, Sun YY. GPT-Assisted Learning of Structure-Property Relationships by Graph Neural Networks: Application to Rare-Earth-Doped Phosphors. J Phys Chem Lett 2023; 14:11342-11349. [PMID: 38064589 DOI: 10.1021/acs.jpclett.3c02848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2023]
Abstract
Two challenges facing machine learning tasks in materials science are data set construction and descriptor design. Graph neural networks circumvent the need for empirical descriptors by encoding geometric information in graphs. Large language models have shown promise for database construction via text extraction. Here, we apply OpenAI's Generative Pre-trained Transformer 4 (GPT-4) and the Crystal Graph Convolutional Neural Network (CGCNN) to the problem of discovering rare-earth-doped phosphors for solid-state lighting. We used GPT-4 to datamine the chemical formulas and emission wavelengths of 264 Eu2+-doped phosphors from 274 articles. A CGCNN model was trained on the acquired data set, achieving a test R2 of 0.77. Using this model, we predicted the emission wavelengths of over 40 000 inorganic materials. We also used transfer learning to fine-tune a bandgap-predicting CGCNN model for emission wavelength prediction. The workflow requires minimal human supervision and is generalizable to other fields.
Collapse
Affiliation(s)
- Xiang Zhang
- State Key Laboratory of High Performance Ceramics and Superfine Microstructure, Shanghai Institute of Ceramics, Chinese Academy of Sciences, Shanghai 201899, People's Republic of China
| | - Zichun Zhou
- State Key Laboratory of High Performance Ceramics and Superfine Microstructure, Shanghai Institute of Ceramics, Chinese Academy of Sciences, Shanghai 201899, People's Republic of China
| | - Chen Ming
- State Key Laboratory of High Performance Ceramics and Superfine Microstructure, Shanghai Institute of Ceramics, Chinese Academy of Sciences, Shanghai 201899, People's Republic of China
| | - Yi-Yang Sun
- State Key Laboratory of High Performance Ceramics and Superfine Microstructure, Shanghai Institute of Ceramics, Chinese Academy of Sciences, Shanghai 201899, People's Republic of China
| |
Collapse
|
10
|
Zhang K, Zhou X, Li S, Zhao L, Hu W, Cai A, Zeng Y, Wang Q, Wu M, Li G, Liu J, Ji H, Qin Y, Wu L. A General Strategy for Developing Ultrasensitive "Transistor-Like" Thermochromic Fluorescent Materials for Multilevel Information Encryption. ADVANCED MATERIALS (DEERFIELD BEACH, FLA.) 2023; 35:e2305472. [PMID: 37437082 DOI: 10.1002/adma.202305472] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 07/10/2023] [Indexed: 07/14/2023]
Abstract
Thermochromic fluorescent materials (TFMs) exhibit great potential in information encryption applications but are limited by low thermosensitivity, poor color tunability, and a wide temperature-responsive range. Herein, a novel strategy for constructing highly sensitive TFMs with tunable emission (450-650 nm) toward multilevel information encryption is proposed, which employs polarity-sensitive fluorophores with donor-acceptor-donor (D-A-D) type structures as emitters and long-chain alkanes as thermosensitive loading matrixes. The structure-function relationships between the performance of TFMs and the structures of both fluorescent emitters and phase-change molecules are systematically studied. Benefiting from the above design, the obtained TFMs exhibit over 9500-fold fluorescence enhancement toward the temperature change, as well as ultrahigh relative temperature sensitivity up to 80% K-1 , which are first confirmed. Thanks to the superior transducing performance, the above-prepared TFMs can be further developed as information-storage platforms within a relatively narrow interval of temperature variation, including temperature-dominated multicolored information display and multilevel information encryption. This work will not only provide a novel perspective for designing superior TFMs for information encryption but also bring inspiration to the design and preparation of other response-switching-type fluorescent probes with ultrahigh conversion efficiency.
Collapse
Affiliation(s)
- Ke Zhang
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Xiaobo Zhou
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Shijie Li
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Lingfeng Zhao
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Wenqi Hu
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Aiting Cai
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Yuhan Zeng
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Qi Wang
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Mingmin Wu
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Guo Li
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Jinxia Liu
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Haiwei Ji
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Yuling Qin
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| | - Li Wu
- Nantong Key Laboratory of Public Health and Medical Analysis, School of Public Health, Nantong University, Nantong, Jiangsu, 226019, China
| |
Collapse
|
11
|
Li S, Zhang Y, Fang Z, Meng K, Tian R, He H, Sun S. Extracting the Synthetic Route of Pd-Based Catalysts in Methanol Steam Reforming from the Scientific Literature. J Chem Inf Model 2023; 63:6249-6260. [PMID: 37807535 DOI: 10.1021/acs.jcim.3c01442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]
Abstract
The structured material synthesis route is crucial for chemists in performing experiments and modern applications such as machine learning material design. With the exponential growth of the chemical literature in recent years, manual extraction from the published literature is time-consuming and labor-intensive. This study focuses on developing an automated method for extracting Pd-based catalyst synthesis routes from the chemical literature. First, a paragraph classification model based on regular expressions is employed to identify paragraphs that contain material synthesis processes. The identified paragraphs are verified using machine learning techniques. Second, natural language processing techniques are applied to automatically parse the material synthesis routes from the identified paragraphs, generate regularized flowcharts, and output structured data. Lastly, we utilized the structured data of the synthesis routes to train machine learning models and predict the performance of the materials. The extracted material entities include the product, preparation method, precursor, support, loading, synthesis operation, and operation condition. This method avoids extensive manual data annotation and improves the scientific literature information acquisition efficiency. The accuracy of the 11 material entities exceeds 80%, and the accuracy of the method, support, precursor, drying time, and reduction time exceeds 90%.
Collapse
Affiliation(s)
- Shuyuan Li
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Yunjiang Zhang
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Zhaolin Fang
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Kong Meng
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Rui Tian
- Beijing Engineering Research Center for IoT Software and Systems, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
| | - Hong He
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| | - Shaorui Sun
- Beijing Key Laboratory for Green Catalysis and Separation, Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, China
| |
Collapse
|
12
|
Kiseleva OI, Kurbatov IY, Arzumanian VA, Ilgisonis EV, Zakharov SV, Poverennaya EV. The Expectation and Reality of the HepG2 Core Metabolic Profile. Metabolites 2023; 13:908. [PMID: 37623852 PMCID: PMC10456947 DOI: 10.3390/metabo13080908] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 07/29/2023] [Accepted: 08/01/2023] [Indexed: 08/26/2023] Open
Abstract
To represent the composition of small molecules circulating in HepG2 cells and the formation of the "core" of characteristic metabolites that often attract researchers' attention, we conducted a meta-analysis of 56 datasets obtained through metabolomic profiling via mass spectrometry and NMR. We highlighted the 288 most commonly studied compounds of diverse chemical nature and analyzed metabolic processes involving these small molecules. Building a complete map of the metabolome of a cell, which encompasses the diversity of possible impacts on it, is a severe challenge for the scientific community, which is faced not only with natural limitations of experimental technologies, but also with the absence of transparent and widely accepted standards for processing and presenting the obtained metabolomic data. Formulating our research design, we aimed to reveal metabolites crucial to the Hepg2 cell line, regardless of all chemical and/or physical impact factors. Unfortunately, the existing paradigm of data policy leads to a streetlight effect. When analyzing and reporting only target metabolites of interest, the community ignores the changes in the metabolomic landscape that hide many molecular secrets.
Collapse
Affiliation(s)
- Olga I. Kiseleva
- Institute of Biomedical Chemistry, Pogodinskaya Street, 10, 119121 Moscow, Russia (E.V.I.); (E.V.P.)
| | - Ilya Y. Kurbatov
- Institute of Biomedical Chemistry, Pogodinskaya Street, 10, 119121 Moscow, Russia (E.V.I.); (E.V.P.)
| | - Viktoriia A. Arzumanian
- Institute of Biomedical Chemistry, Pogodinskaya Street, 10, 119121 Moscow, Russia (E.V.I.); (E.V.P.)
| | - Ekaterina V. Ilgisonis
- Institute of Biomedical Chemistry, Pogodinskaya Street, 10, 119121 Moscow, Russia (E.V.I.); (E.V.P.)
| | - Svyatoslav V. Zakharov
- Chemistry Department, Lomonosov Moscow State University, Leninskie gory Street, 1/3, 119991 Moscow, Russia;
| | - Ekaterina V. Poverennaya
- Institute of Biomedical Chemistry, Pogodinskaya Street, 10, 119121 Moscow, Russia (E.V.I.); (E.V.P.)
| |
Collapse
|
13
|
Xie W, Fan K, Zhang S, Li L. Multiple sampling schemes and deep learning improve active learning performance in drug-drug interaction information retrieval analysis from the literature. J Biomed Semantics 2023; 14:5. [PMID: 37248476 PMCID: PMC10228061 DOI: 10.1186/s13326-023-00287-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Accepted: 04/29/2023] [Indexed: 05/31/2023] Open
Abstract
BACKGROUND Drug-drug interaction (DDI) information retrieval (IR) is an important natural language process (NLP) task from the PubMed literature. For the first time, active learning (AL) is studied in DDI IR analysis. DDI IR analysis from PubMed abstracts faces the challenges of relatively small positive DDI samples among overwhelmingly large negative samples. Random negative sampling and positive sampling are purposely designed to improve the efficiency of AL analysis. The consistency of random negative sampling and positive sampling is shown in the paper. RESULTS PubMed abstracts are divided into two pools. Screened pool contains all abstracts that pass the DDI keywords query in PubMed, while unscreened pool includes all the other abstracts. At a prespecified recall rate of 0.95, DDI IR analysis precision is evaluated and compared. In screened pool IR analysis using supporting vector machine (SVM), similarity sampling plus uncertainty sampling improves the precision over uncertainty sampling, from 0.89 to 0.92 respectively. In the unscreened pool IR analysis, the integrated random negative sampling, positive sampling, and similarity sampling improve the precision over uncertainty sampling along, from 0.72 to 0.81 respectively. When we change the SVM to a deep learning method, all sampling schemes consistently improve DDI AL analysis in both screened pool and unscreened pool. Deep learning has significant improvement of precision over SVM, 0.96 vs. 0.92 in screened pool, and 0.90 vs. 0.81 in the unscreened pool, respectively. CONCLUSIONS By integrating various sampling schemes and deep learning algorithms into AL, the DDI IR analysis from literature is significantly improved. The random negative sampling and positive sampling are highly effective methods in improving AL analysis where the positive and negative samples are extremely imbalanced.
Collapse
Affiliation(s)
- Weixin Xie
- Department of Biomedical Informatics, Ohio State University, Columbus, OH 43210 USA
| | - Kunjie Fan
- Department of Biomedical Informatics, Ohio State University, Columbus, OH 43210 USA
| | - Shijun Zhang
- Department of Biomedical Informatics, Ohio State University, Columbus, OH 43210 USA
| | - Lang Li
- Department of Biomedical Informatics, Ohio State University, Columbus, OH 43210 USA
| |
Collapse
|
14
|
Wang L, Gao Y, Chen X, Cui W, Zhou Y, Luo X, Xu S, Du Y, Wang B. A corpus of CO 2 electrocatalytic reduction process extracted from the scientific literature. Sci Data 2023; 10:175. [PMID: 36991006 PMCID: PMC10060421 DOI: 10.1038/s41597-023-02089-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2023] [Accepted: 03/19/2023] [Indexed: 03/31/2023] Open
Abstract
The electrocatalytic CO2 reduction process has gained enormous attention for both environmental protection and chemicals production. Thereinto, the design of new electrocatalysts with high activity and selectivity can draw inspiration from the abundant scientific literature. An annotated and verified corpus made from massive literature can assist the development of natural language processing (NLP) models, which can offer insight to help guide the understanding of these underlying mechanisms. To facilitate data mining in this direction, we present a benchmark corpus of 6,086 records manually extracted from 835 electrocatalytic publications, along with an extended corpus with 145,179 records in this article. In this corpus, nine types of knowledge such as material, regulation method, product, faradaic efficiency, cell setup, electrolyte, synthesis method, current density, and voltage are provided by either annotating or extracting. Machine learning algorithms can be applied to the corpus to help scientists find new and effective electrocatalysts. Furthermore, researchers familiar with NLP can use this corpus to design domain-specific named entity recognition (NER) models.
Collapse
Affiliation(s)
- Ludi Wang
- Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100083, China
| | - Yang Gao
- CAS Key Laboratory of Nanosystem and Hierarchical Fabrication, National Center for Nanoscience and Technology (NCNST), Beijing, 100190, China
| | - Xueqing Chen
- Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100083, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Wenjuan Cui
- Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100083, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Yuanchun Zhou
- Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100083, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Xinying Luo
- CAS Key Laboratory of Nanosystem and Hierarchical Fabrication, National Center for Nanoscience and Technology (NCNST), Beijing, 100190, China
| | - Shuaishuai Xu
- CAS Key Laboratory of Nanosystem and Hierarchical Fabrication, National Center for Nanoscience and Technology (NCNST), Beijing, 100190, China
| | - Yi Du
- Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100083, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Bin Wang
- CAS Key Laboratory of Nanosystem and Hierarchical Fabrication, National Center for Nanoscience and Technology (NCNST), Beijing, 100190, China.
| |
Collapse
|
15
|
He K, Mao R, Gong T, Cambria E, Li C. JCBIE: a joint continual learning neural network for biomedical information extraction. BMC Bioinformatics 2022; 23:549. [PMID: 36536280 PMCID: PMC9761970 DOI: 10.1186/s12859-022-05096-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2022] [Accepted: 12/05/2022] [Indexed: 12/23/2022] Open
Abstract
Extracting knowledge from heterogeneous data sources is fundamental for the construction of structured biomedical knowledge graphs (BKGs), where entities and relations are represented as nodes and edges in the graphs, respectively. Previous biomedical knowledge extraction methods simply considered limited entity types and relations by using a task-specific training set, which is insufficient for large-scale BKGs development and downstream task applications in different scenarios. To alleviate this issue, we propose a joint continual learning biomedical information extraction (JCBIE) network to extract entities and relations from different biomedical information datasets. By empirically studying different joint learning and continual learning strategies, the proposed JCBIE can learn and expand different types of entities and relations from different datasets. JCBIE uses two separated encoders in joint-feature extraction, hence can effectively avoid the feature confusion problem comparing with using one hard-parameter sharing encoder. Specifically, it allows us to adopt entity augmented inputs to establish the interaction between named entity recognition and relation extraction. Finally, a novel evaluation mechanism is proposed for measuring cross-corpus generalization errors, which was ignored by traditional evaluation methods. Our empirical studies show that JCBIE achieves promising performance when continual learning strategy is adopted with multiple corpora.
Collapse
Affiliation(s)
- Kai He
- grid.43169.390000 0001 0599 1243School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi China ,grid.43169.390000 0001 0599 1243Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi China ,grid.43169.390000 0001 0599 1243National Engineering Lab for Big Data Analytics, Xi’an Jiaotong University, Xi’an, Shaanxi China
| | - Rui Mao
- grid.59025.3b0000 0001 2224 0361School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| | - Tieliang Gong
- grid.43169.390000 0001 0599 1243School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi China ,grid.43169.390000 0001 0599 1243Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi China ,grid.43169.390000 0001 0599 1243National Engineering Lab for Big Data Analytics, Xi’an Jiaotong University, Xi’an, Shaanxi China
| | - Erik Cambria
- grid.59025.3b0000 0001 2224 0361School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| | - Chen Li
- grid.43169.390000 0001 0599 1243School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi China ,grid.43169.390000 0001 0599 1243Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi China ,grid.43169.390000 0001 0599 1243National Engineering Lab for Big Data Analytics, Xi’an Jiaotong University, Xi’an, Shaanxi China
| |
Collapse
|
16
|
Duan Y, Rosaleny LE, Coutinho JT, Giménez-Santamarina S, Scheie A, Baldoví JJ, Cardona-Serra S, Gaita-Ariño A. Data-driven design of molecular nanomagnets. Nat Commun 2022; 13:7626. [PMID: 36494346 PMCID: PMC9734471 DOI: 10.1038/s41467-022-35336-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Accepted: 11/29/2022] [Indexed: 12/13/2022] Open
Abstract
Three decades of research in molecular nanomagnets have raised their magnetic memories from liquid helium to liquid nitrogen temperature thanks to a wise choice of the magnetic ion and coordination environment. Still, serendipity and chemical intuition played a main role. In order to establish a powerful framework for statistically driven chemical design, here we collected chemical and physical data for lanthanide-based nanomagnets, catalogued over 1400 published experiments, developed an interactive dashboard (SIMDAVIS) to visualise the dataset, and applied inferential statistical analysis. Our analysis shows that the Arrhenius energy barrier correlates unexpectedly well with the magnetic memory. Furthermore, as both Orbach and Raman processes can be affected by vibronic coupling, chemical design of the coordination scheme may be used to reduce the relaxation rates. Indeed, only bis-phthalocyaninato sandwiches and metallocenes, with rigid ligands, consistently present magnetic memory up to high temperature. Analysing magnetostructural correlations, we offer promising strategies for improvement, in particular for the preparation of pentagonal bipyramids, where even softer complexes are protected against molecular vibrations.
Collapse
Affiliation(s)
- Yan Duan
- Instituto de Ciencia Molecular (ICMol), Universitat de València, C/Catedrático José Beltrán 2, 46980, Paterna, Spain
- Spin-X Institute, South China University of Technology, 510641, Guangzhou, People's Republic of China
| | - Lorena E Rosaleny
- Instituto de Ciencia Molecular (ICMol), Universitat de València, C/Catedrático José Beltrán 2, 46980, Paterna, Spain.
| | - Joana T Coutinho
- Instituto de Ciencia Molecular (ICMol), Universitat de València, C/Catedrático José Beltrán 2, 46980, Paterna, Spain.
- Centre for Rapid and Sustainable Product Development, Polytechnic of Leiria, 2430-028, Marinha Grande, Portugal.
| | - Silvia Giménez-Santamarina
- Instituto de Ciencia Molecular (ICMol), Universitat de València, C/Catedrático José Beltrán 2, 46980, Paterna, Spain
| | - Allen Scheie
- Neutron Scattering Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
| | - José J Baldoví
- Instituto de Ciencia Molecular (ICMol), Universitat de València, C/Catedrático José Beltrán 2, 46980, Paterna, Spain
| | - Salvador Cardona-Serra
- Instituto de Ciencia Molecular (ICMol), Universitat de València, C/Catedrático José Beltrán 2, 46980, Paterna, Spain
| | - Alejandro Gaita-Ariño
- Instituto de Ciencia Molecular (ICMol), Universitat de València, C/Catedrático José Beltrán 2, 46980, Paterna, Spain.
| |
Collapse
|
17
|
Dinh TT, Vo-Chanh TP, Nguyen C, Huynh VQ, Vo N, Nguyen HD. Extract antibody and antigen names from biomedical literature. BMC Bioinformatics 2022; 23:524. [PMID: 36474140 PMCID: PMC9727932 DOI: 10.1186/s12859-022-04993-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 10/18/2022] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND The roles of antibody and antigen are indispensable in targeted diagnosis, therapy, and biomedical discovery. On top of that, massive numbers of new scientific articles about antibodies and/or antigens are published each year, which is a precious knowledge resource but has yet been exploited to its full potential. We, therefore, aim to develop a biomedical natural language processing tool that can automatically identify antibody and antigen entities from articles. RESULTS We first annotated an antibody-antigen corpus including 3210 relevant PubMed abstracts using a semi-automatic approach. The Inter-Annotator Agreement score of 3 annotators ranges from 91.46 to 94.31%, indicating that the annotations are consistent and the corpus is reliable. We then used the corpus to develop and optimize BiLSTM-CRF-based and BioBERT-based models. The models achieved overall F1 scores of 62.49% and 81.44%, respectively, which showed potential for newly studied entities. The two models served as foundation for development of a named entity recognition (NER) tool that automatically recognizes antibody and antigen names from biomedical literature. CONCLUSIONS Our antibody-antigen NER models enable users to automatically extract antibody and antigen names from scientific articles without manually scanning through vast amounts of data and information in the literature. The output of NER can be used to automatically populate antibody-antigen databases, support antibody validation, and facilitate researchers with the most appropriate antibodies of interest. The packaged NER model is available at https://github.com/TrangDinh44/ABAG_BioBERT.git .
Collapse
Affiliation(s)
- Thuy Trang Dinh
- grid.454160.20000 0004 0642 8526Center for Bioscience and Biotechnology, University of Science, Ho Chi Minh City, Vietnam ,grid.444808.40000 0001 2037 434XVietnam National University, Ho Chi Minh City, Vietnam
| | - Trang Phuong Vo-Chanh
- grid.454160.20000 0004 0642 8526Center for Bioscience and Biotechnology, University of Science, Ho Chi Minh City, Vietnam ,grid.444808.40000 0001 2037 434XVietnam National University, Ho Chi Minh City, Vietnam
| | - Chau Nguyen
- grid.454160.20000 0004 0642 8526Center for Bioscience and Biotechnology, University of Science, Ho Chi Minh City, Vietnam ,grid.444808.40000 0001 2037 434XVietnam National University, Ho Chi Minh City, Vietnam
| | - Viet Quoc Huynh
- grid.454160.20000 0004 0642 8526Center for Bioscience and Biotechnology, University of Science, Ho Chi Minh City, Vietnam ,grid.444808.40000 0001 2037 434XVietnam National University, Ho Chi Minh City, Vietnam
| | - Nam Vo
- grid.454160.20000 0004 0642 8526Center for Bioscience and Biotechnology, University of Science, Ho Chi Minh City, Vietnam ,grid.444808.40000 0001 2037 434XVietnam National University, Ho Chi Minh City, Vietnam ,grid.454160.20000 0004 0642 8526Laboratory of Molecular Biotechnology, University of Science, Ho Chi Minh City, Vietnam
| | - Hoang Duc Nguyen
- grid.454160.20000 0004 0642 8526Center for Bioscience and Biotechnology, University of Science, Ho Chi Minh City, Vietnam ,grid.444808.40000 0001 2037 434XVietnam National University, Ho Chi Minh City, Vietnam
| |
Collapse
|
18
|
Mai H, Le TC, Chen D, Winkler DA, Caruso RA. Machine Learning in the Development of Adsorbents for Clean Energy Application and Greenhouse Gas Capture. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2022; 9:e2203899. [PMID: 36285802 PMCID: PMC9798988 DOI: 10.1002/advs.202203899] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 09/27/2022] [Indexed: 06/04/2023]
Abstract
Addressing climate change challenges by reducing greenhouse gas levels requires innovative adsorbent materials for clean energy applications. Recent progress in machine learning has stimulated technological breakthroughs in the discovery, design, and deployment of materials with potential for high-performance and low-cost clean energy applications. This review summarizes basic machine learning methods-data collection, featurization, model generation, and model evaluation-and reviews their use in the development of robust adsorbent materials. Key case studies are provided where these methods are used to accelerate adsorbent materials design and discovery, optimize synthesis conditions, and understand complex feature-property relationships. The review provides a concise resource for researchers wishing to use machine learning methods to rapidly develop effective adsorbent materials with a positive impact on the environment.
Collapse
Affiliation(s)
- Haoxin Mai
- Applied Chemistry and Environmental ScienceSchool of ScienceSTEM CollegeRMIT UniversityMelbourneVictoria3001Australia
| | - Tu C. Le
- School of EngineeringSTEM CollegeRMIT UniversityGPO Box 2476MelbourneVictoria3001Australia
| | - Dehong Chen
- Applied Chemistry and Environmental ScienceSchool of ScienceSTEM CollegeRMIT UniversityMelbourneVictoria3001Australia
| | - David A. Winkler
- Monash Institute of Pharmaceutical SciencesMonash UniversityParkvilleVIC3052Australia
- School of Biochemistry and ChemistryLa Trobe UniversityKingsbury DriveBundoora3042Australia
- School of PharmacyUniversity of NottinghamNottinghamNG7 2RDUK
| | - Rachel A. Caruso
- Applied Chemistry and Environmental ScienceSchool of ScienceSTEM CollegeRMIT UniversityMelbourneVictoria3001Australia
| |
Collapse
|
19
|
Islamaj R, Leaman R, Cissel D, Coss C, Denicola J, Fisher C, Guzman R, Kochar PG, Miliaras N, Punske Z, Sekiya K, Trinh D, Whitman D, Schmidt S, Lu Z. NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles. Database (Oxford) 2022; 2022:baac102. [PMID: 36458799 PMCID: PMC9716560 DOI: 10.1093/database/baac102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Revised: 10/17/2022] [Accepted: 11/28/2022] [Indexed: 12/03/2022]
Abstract
The automatic recognition of chemical names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. The task is even more challenging when considering the identification of these entities in the article's full text and, furthermore, the identification of candidate substances for that article's metadata [Medical Subject Heading (MeSH) article indexing]. The National Library of Medicine (NLM)-Chem track at BioCreative VII aimed to foster the development of algorithms that can predict with high quality the chemical entities in the biomedical literature and further identify the chemical substances that are candidates for article indexing. As a result of this challenge, the NLM-Chem track produced two comprehensive, manually curated corpora annotated with chemical entities and indexed with chemical substances: the chemical identification corpus and the chemical indexing corpus. The NLM-Chem BioCreative VII (NLM-Chem-BC7) Chemical Identification corpus consists of 204 full-text PubMed Central (PMC) articles, fully annotated for chemical entities by 12 NLM indexers for both span (i.e. named entity recognition) and normalization (i.e. entity linking) using MeSH. This resource was used for the training and testing of the Chemical Identification task to evaluate the accuracy of algorithms in predicting chemicals mentioned in recently published full-text articles. The NLM-Chem-BC7 Chemical Indexing corpus consists of 1333 recently published PMC articles, equipped with chemical substance indexing by manual experts at the NLM. This resource was used for the evaluation of the Chemical Indexing task, which evaluated the accuracy of algorithms in predicting the chemicals that should be indexed, i.e. appear in the listing of MeSH terms for the document. This set was further enriched after the challenge in two ways: (i) 11 NLM indexers manually verified each of the candidate terms appearing in the prediction results of the challenge participants, but not in the MeSH indexing, and the chemical indexing terms appearing in the MeSH indexing list, but not in the prediction results, and (ii) the challenge organizers algorithmically merged the chemical entity annotations in the full text for all predicted chemical entities and used a statistical approach to keep those with the highest degree of confidence. As a result, the NLM-Chem-BC7 Chemical Indexing corpus is a gold-standard corpus for chemical indexing of journal articles and a silver-standard corpus for chemical entity identification in full-text journal articles. Together, these resources are currently the most comprehensive resources for chemical entity recognition, and we demonstrate improvements in the chemical entity recognition algorithms. We detail the characteristics of these novel resources and make them available for the community. Database URL: https://ftp.ncbi.nlm.nih.gov/pub/lu/NLM-Chem-BC7-corpus/.
Collapse
Affiliation(s)
- Rezarta Islamaj
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Robert Leaman
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - David Cissel
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Cathleen Coss
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Joseph Denicola
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Carol Fisher
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Rob Guzman
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Preeti Gokal Kochar
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Nicholas Miliaras
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Zoe Punske
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Keiko Sekiya
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Dorothy Trinh
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Deborah Whitman
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Susan Schmidt
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
20
|
Zhang X, Mao R, Cambria E. A survey on syntactic processing techniques. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10300-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
21
|
Abstract
Artificial intelligence (AI) methods have been and are now being increasingly integrated in prediction software implemented in bioinformatics and its glycoscience branch known as glycoinformatics. AI techniques have evolved in the past decades, and their applications in glycoscience are not yet widespread. This limited use is partly explained by the peculiarities of glyco-data that are notoriously hard to produce and analyze. Nonetheless, as time goes, the accumulation of glycomics, glycoproteomics, and glycan-binding data has reached a point where even the most recent deep learning methods can provide predictors with good performance. We discuss the historical development of the application of various AI methods in the broader field of glycoinformatics. A particular focus is placed on shining a light on challenges in glyco-data handling, contextualized by lessons learnt from related disciplines. Ending on the discussion of state-of-the-art deep learning approaches in glycoinformatics, we also envision the future of glycoinformatics, including development that need to occur in order to truly unleash the capabilities of glycoscience in the systems biology era.
Collapse
Affiliation(s)
- Daniel Bojar
- Department
of Chemistry and Molecular Biology, University
of Gothenburg, Gothenburg 41390, Sweden
- Wallenberg
Centre for Molecular and Translational Medicine, University of Gothenburg, Gothenburg 41390, Sweden
| | - Frederique Lisacek
- Proteome
Informatics Group, Swiss Institute of Bioinformatics, CH-1227 Geneva, Switzerland
- Computer
Science Department & Section of Biology, University of Geneva, route de Drize 7, CH-1227, Geneva, Switzerland
| |
Collapse
|
22
|
Mroz A, Posligua V, Tarzia A, Wolpert EH, Jelfs KE. Into the Unknown: How Computation Can Help Explore Uncharted Material Space. J Am Chem Soc 2022; 144:18730-18743. [PMID: 36206484 PMCID: PMC9585593 DOI: 10.1021/jacs.2c06833] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Indexed: 11/28/2022]
Abstract
Novel functional materials are urgently needed to help combat the major global challenges facing humanity, such as climate change and resource scarcity. Yet, the traditional experimental materials discovery process is slow and the material space at our disposal is too vast to effectively explore using intuition-guided experimentation alone. Most experimental materials discovery programs necessarily focus on exploring the local space of known materials, so we are not fully exploiting the enormous potential material space, where more novel materials with unique properties may exist. Computation, facilitated by improvements in open-source software and databases, as well as computer hardware has the potential to significantly accelerate the rational development of materials, but all too often is only used to postrationalize experimental observations. Thus, the true predictive power of computation, where theory leads experimentation, is not fully utilized. Here, we discuss the challenges to successful implementation of computation-driven materials discovery workflows, and then focus on the progress of the field, with a particular emphasis on the challenges to reaching novel materials.
Collapse
Affiliation(s)
- Austin
M. Mroz
- Department
of Chemistry, Molecular Sciences Research Hub, Imperial College London, White City Campus,
Wood Lane, London, W12 0BZ, U.K.
| | - Victor Posligua
- Department
of Chemistry, Molecular Sciences Research Hub, Imperial College London, White City Campus,
Wood Lane, London, W12 0BZ, U.K.
| | - Andrew Tarzia
- Department
of Chemistry, Molecular Sciences Research Hub, Imperial College London, White City Campus,
Wood Lane, London, W12 0BZ, U.K.
| | - Emma H. Wolpert
- Department
of Chemistry, Molecular Sciences Research Hub, Imperial College London, White City Campus,
Wood Lane, London, W12 0BZ, U.K.
| | - Kim E. Jelfs
- Department
of Chemistry, Molecular Sciences Research Hub, Imperial College London, White City Campus,
Wood Lane, London, W12 0BZ, U.K.
| |
Collapse
|
23
|
Parastar H, Tauler R. Big (Bio)Chemical Data Mining Using Chemometric Methods: A Need for Chemists. Angew Chem Int Ed Engl 2022. [DOI: 10.1002/ange.201801134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Hadi Parastar
- Department of Chemistry Sharif University of Technology Tehran Iran
| | - Roma Tauler
- Department of Environmental Chemistry IDAEA-CSIC 08034 Barcelona Spain
| |
Collapse
|
24
|
Ohms J. Validity of PubChem compounds supplied by Patentscope or SureChEMBL. WORLD PATENT INFORMATION 2022. [DOI: 10.1016/j.wpi.2022.102134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
25
|
Ghosh S, Lu K. Band gap information extraction from materials science literature – a pilot study. ASLIB J INFORM MANAG 2022. [DOI: 10.1108/ajim-03-2022-0141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeThe purpose of this paper is to present a preliminary work on extracting band gap information of materials from academic papers. With increasing demand for renewable energy, band gap information will help material scientists design and implement novel photovoltaic (PV) cells.Design/methodology/approachThe authors collected 1.44 million titles and abstracts of scholarly articles related to materials science, and then filtered the collection to 11,939 articles that potentially contain relevant information about materials and their band gap values. ChemDataExtractor was extended to extract information about PV materials and their band gap information. Evaluation was performed on randomly sampled information records of 415 papers.FindingsThe findings of this study show that the current system is able to correctly extract information for 51.32% articles, with partially correct extraction for 36.62% articles and incorrect for 12.04%. The authors have also identified the errors belonging to three main categories pertaining to chemical entity identification, band gap information and interdependency resolution. Future work will focus on addressing these errors to improve the performance of the system.Originality/valueThe authors did not find any literature to date on band gap information extraction from academic text using automated methods. This work is unique and original. Band gap information is of importance to materials scientists in applications such as solar cells, light emitting diodes and laser diodes.
Collapse
|
26
|
Tarasova OA, Rudik AV, Biziukova NY, Filimonov DA, Poroikov VV. Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach. J Cheminform 2022; 14:55. [PMID: 35964150 PMCID: PMC9375066 DOI: 10.1186/s13321-022-00633-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 07/12/2022] [Indexed: 11/24/2022] Open
Abstract
Motivation Application of chemical named entity recognition (CNER) algorithms allows retrieval of information from texts about chemical compound identifiers and creates associations with physical–chemical properties and biological activities. Scientific texts represent low-formalized sources of information. Most methods aimed at CNER are based on machine learning approaches, including conditional random fields and deep neural networks. In general, most machine learning approaches require either vector or sparse word representation of texts. Chemical named entities (CNEs) constitute only a small fraction of the whole text, and the datasets used for training are highly imbalanced. Methods and results We propose a new method for extracting CNEs from texts based on the naïve Bayes classifier combined with specially developed filters. In contrast to the earlier developed CNER methods, our approach uses the representation of the data as a set of fragments of text (FoTs) with the subsequent preparati`on of a set of multi-n-grams (sequences from one to n symbols) for each FoT. Our approach may provide the recognition of novel CNEs. For CHEMDNER corpus, the values of the sensitivity (recall) was 0.95, precision was 0.74, specificity was 0.88, and balanced accuracy was 0.92 based on five-fold cross validation. We applied the developed algorithm to the extracted CNEs of potential Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) main protease (Mpro) inhibitors. A set of CNEs corresponding to the chemical substances evaluated in the biochemical assays used for the discovery of Mpro inhibitors was retrieved. Manual analysis of the appropriate texts showed that CNEs of potential SARS-CoV-2 Mpro inhibitors were successfully identified by our method. Conclusion The obtained results show that the proposed method can be used for filtering out words that are not related to CNEs; therefore, it can be successfully applied to the extraction of CNEs for the purposes of cheminformatics and medicinal chemistry. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-022-00633-4.
Collapse
Affiliation(s)
- O A Tarasova
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia.
| | - A V Rudik
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia
| | - N Yu Biziukova
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia
| | - D A Filimonov
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia
| | - V V Poroikov
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia
| |
Collapse
|
27
|
Mai H, Le TC, Chen D, Winkler DA, Caruso RA. Machine Learning for Electrocatalyst and Photocatalyst Design and Discovery. Chem Rev 2022; 122:13478-13515. [PMID: 35862246 DOI: 10.1021/acs.chemrev.2c00061] [Citation(s) in RCA: 72] [Impact Index Per Article: 36.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Electrocatalysts and photocatalysts are key to a sustainable future, generating clean fuels, reducing the impact of global warming, and providing solutions to environmental pollution. Improved processes for catalyst design and a better understanding of electro/photocatalytic processes are essential for improving catalyst effectiveness. Recent advances in data science and artificial intelligence have great potential to accelerate electrocatalysis and photocatalysis research, particularly the rapid exploration of large materials chemistry spaces through machine learning. Here a comprehensive introduction to, and critical review of, machine learning techniques used in electrocatalysis and photocatalysis research are provided. Sources of electro/photocatalyst data and current approaches to representing these materials by mathematical features are described, the most commonly used machine learning methods summarized, and the quality and utility of electro/photocatalyst models evaluated. Illustrations of how machine learning models are applied to novel electro/photocatalyst discovery and used to elucidate electrocatalytic or photocatalytic reaction mechanisms are provided. The review offers a guide for materials scientists on the selection of machine learning methods for electrocatalysis and photocatalysis research. The application of machine learning to catalysis science represents a paradigm shift in the way advanced, next-generation catalysts will be designed and synthesized.
Collapse
Affiliation(s)
- Haoxin Mai
- Applied Chemistry and Environmental Science, School of Science, STEM College, RMIT University, GPO Box 2476, Melbourne, Victoria 3001, Australia
| | - Tu C Le
- School of Engineering, STEM College, RMIT University, GPO Box 2476, Melbourne, Victoria 3001, Australia
| | - Dehong Chen
- Applied Chemistry and Environmental Science, School of Science, STEM College, RMIT University, GPO Box 2476, Melbourne, Victoria 3001, Australia
| | - David A Winkler
- Monash Institute of Pharmaceutical Sciences, Monash University, Parkville, Victoria 3052, Australia.,Biochemistry and Chemistry, La Trobe University, Kingsbury Drive, Bundoora, Victoria 3042, Australia.,School of Pharmacy, University of Nottingham, Nottingham NG7 2RD, United Kingdom
| | - Rachel A Caruso
- Applied Chemistry and Environmental Science, School of Science, STEM College, RMIT University, GPO Box 2476, Melbourne, Victoria 3001, Australia
| |
Collapse
|
28
|
Yan R, Jiang X, Wang W, Dang D, Su Y. Materials information extraction via automatically generated corpus. Sci Data 2022; 9:401. [PMID: 35831367 PMCID: PMC9279422 DOI: 10.1038/s41597-022-01492-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Accepted: 06/28/2022] [Indexed: 11/12/2022] Open
Abstract
Information Extraction (IE) in Natural Language Processing (NLP) aims to extract structured information from unstructured text to assist a computer in understanding natural language. Machine learning-based IE methods bring more intelligence and possibilities but require an extensive and accurate labeled corpus. In the materials science domain, giving reliable labels is a laborious task that requires the efforts of many professionals. To reduce manual intervention and automatically generate materials corpus during IE, in this work, we propose a semi-supervised IE framework for materials via automatically generated corpus. Taking the superalloy data extraction in our previous work as an example, the proposed framework using Snorkel automatically labels the corpus containing property values. Then Ordered Neurons-Long Short-Term Memory (ON-LSTM) network is adopted to train an information extraction model on the generated corpus. The experimental results show that the F1-score of γ’ solvus temperature, density and solidus temperature of superalloys are 83.90%, 94.02%, 89.27%, respectively. Furthermore, we conduct similar experiments on other materials, the experimental results show that the proposed framework is universal in the field of materials.
Collapse
Affiliation(s)
- Rongen Yan
- School of Artificial Intelligence, Beijing Normal University, Beijing, 100875, China
| | - Xue Jiang
- Beijing Advanced Innovation Center for Materials Genome Engineering, Institute for Advanced Materials and Technology, University of Science and Technology Beijing, Beijing, 100083, China.,Collaborative Innovation Center of Steel Technology, University of Science and Technology Beijing, Beijing, 100083, China
| | - Weiren Wang
- Beijing Advanced Innovation Center for Materials Genome Engineering, Institute for Advanced Materials and Technology, University of Science and Technology Beijing, Beijing, 100083, China
| | - Depeng Dang
- School of Artificial Intelligence, Beijing Normal University, Beijing, 100875, China.
| | - Yanjing Su
- Beijing Advanced Innovation Center for Materials Genome Engineering, Institute for Advanced Materials and Technology, University of Science and Technology Beijing, Beijing, 100083, China.
| |
Collapse
|
29
|
Abstract
AbstractThe emerging field of material-based data science requires information-rich databases to generate useful results which are currently sparse in the stress engineering domain. To this end, this study uses the’materials-aware’ text-mining toolkit, ChemDataExtractor, to auto-generate databases of yield-strength and grain-size values by extracting such information from the literature. The precision of the extracted data is 83.0% for yield strength and 78.8% for grain size. The automatically-extracted data were organised into four databases: a Yield Strength, Grain Size, Engineering-Ready Yield Strength and Combined database. For further validation of the databases, the Combined database was used to plot the Hall-Petch relationship for, the alloy, AZ31, and similar results to the literature were found, demonstrating how one can make use of these automatically-extracted datasets.
Collapse
|
30
|
Li M, Tian S, Meng F, Yin M, Yue Q, Wang S, Bu W, Luo L. Continuously Multiplexed Ultrastrong Raman Probes by Precise Isotopic Polymer Backbone Doping for Multidimensional Information Storage and Encryption. NANO LETTERS 2022; 22:4544-4551. [PMID: 35604007 DOI: 10.1021/acs.nanolett.2c01443] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Raman-based super multiplexing has attracted great interest in imaging, biological analysis, identity security, and information storage. It still remains a great challenge to synthesize a large number of different Raman-active molecules to fulfill the Raman color palette. Here, we report a facile and systematic strategy to construct continuously multiplexed ultrastrong Raman probes. By precisely incorporating different ratios of 13C isotope into the backbone of poly(deca-4,6-diynedioic acid) (PDDA), we can obtain a library of PDDAs with tunable double-bond Raman frequencies and adjustable intensity ratios of two triple-bond (13C≡13C and 12C≡12C) Raman peaks, while retaining the ultrastrong Raman signals and physicochemical properties of the polymer. We also demonstrate the successful application of 13C-doped PDDAs as security inks to generate a novel 3D matrix barcode system for information encryption and high-density data storage. The isotopically doped PDDA series herein pave a new way to advance Raman-based super multiplexing for diverse applications.
Collapse
Affiliation(s)
- Mengyang Li
- National Engineering Research Center for Nanomedicine, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, P. R. China
| | - Sidan Tian
- National Engineering Research Center for Nanomedicine, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, P. R. China
| | - Fanling Meng
- National Engineering Research Center for Nanomedicine, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, P. R. China
- Key Laboratory of Molecular Biophysics of Minister of Education, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, P. R. China
| | - Mingming Yin
- National Engineering Research Center for Nanomedicine, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, P. R. China
| | - Qiang Yue
- National Engineering Research Center for Nanomedicine, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, P. R. China
| | - Shun Wang
- MOE Key Laboratory of Fundamental Physical Quantities Measurement & Hubei Key Laboratory of Gravitation and Quantum Physics, PGMF and School of Physics, Huazhong University of Science and Technology, Wuhan 430074, P. R. China
| | - Wenting Bu
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, P. R. China
| | - Liang Luo
- National Engineering Research Center for Nanomedicine, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, P. R. China
- Key Laboratory of Molecular Biophysics of Minister of Education, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, P. R. China
| |
Collapse
|
31
|
Text-mined dataset of gold nanoparticle synthesis procedures, morphologies, and size entities. Sci Data 2022; 9:234. [PMID: 35618761 PMCID: PMC9135747 DOI: 10.1038/s41597-022-01321-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Accepted: 04/08/2022] [Indexed: 12/13/2022] Open
Abstract
Gold nanoparticles are highly desired for a range of technological applications due to their tunable properties, which are dictated by the size and shape of the constituent particles. Many heuristic methods for controlling the morphological characteristics of gold nanoparticles are well known. However, the underlying mechanisms controlling their size and shape remain poorly understood, partly due to the immense range of possible combinations of synthesis parameters. Data-driven methods can offer insight to help guide understanding of these underlying mechanisms, so long as sufficient synthesis data are available. To facilitate data mining in this direction, we have constructed and made publicly available a dataset of codified gold nanoparticle synthesis protocols and outcomes extracted directly from the nanoparticle materials science literature using natural language processing and text-mining techniques. This dataset contains 5,154 data records, each representing a single gold nanoparticle synthesis article, filtered from a database of 4,973,165 publications. Each record contains codified synthesis protocols and extracted morphological information from a total of 7,608 experimental and 12,519 characterization paragraphs. Measurement(s) | gold nanoparticle morphology • gold nanoparticle size • gold nanoparticle synthesis data | Technology Type(s) | natural language processing |
Collapse
|
32
|
Dataset of solution-based inorganic materials synthesis procedures extracted from the scientific literature. Sci Data 2022; 9:231. [PMID: 35614129 PMCID: PMC9132903 DOI: 10.1038/s41597-022-01317-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2021] [Accepted: 04/05/2022] [Indexed: 11/10/2022] Open
Abstract
The development of a materials synthesis route is usually based on heuristics and experience. A possible new approach would be to apply data-driven approaches to learn the patterns of synthesis from past experience and use them to predict the syntheses of novel materials. However, this route is impeded by the lack of a large-scale database of synthesis formulations. In this work, we applied advanced machine learning and natural language processing techniques to construct a dataset of 35,675 solution-based synthesis procedures extracted from the scientific literature. Each procedure contains essential synthesis information including the precursors and target materials, their quantities, and the synthesis actions and corresponding attributes. Every procedure is also augmented with the reaction formula. Through this work, we are making freely available the first large dataset of solution-based inorganic materials synthesis procedures. Measurement(s) | solution-based inorganic synthesis data | Technology Type(s) | natural language processing |
Collapse
|
33
|
Serov N, Vinogradov V. Inverse Material Search and Synthesis Verification by Hand Drawings via Transfer Learning and Contour Detection. SMALL METHODS 2022; 6:e2101619. [PMID: 35285181 DOI: 10.1002/smtd.202101619] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Revised: 02/12/2022] [Indexed: 06/14/2023]
Abstract
Nano- and micromaterials of various morphologies and compositions have extensive use in many different areas. However, the search for procedures giving custom nanomaterials with the desired structure, shape, and size remains a challenge and is often implemented by manual article screening. Here, for the first time, scanning and transmission electron microscopy inverse image search and hand drawing-based search via transfer learning are developed, namely, VGG16 convolution neural network repurposing for image features extraction and image similarity determination. Moreover, the case use of this platform is demonstrated on the calcium carbonate system, where the data are acquired by random high throughput experimental synthesis, and on Au nanoparticles data extracted from the articles. This approach can be used for advanced nanomaterials search, synthesis procedure verification, and can be further combined with machine learning solutions to provide data-driven nanomaterials discovery.
Collapse
Affiliation(s)
- Nikita Serov
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, Saint Petersburg, 191002, Russian Federation
| | - Vladimir Vinogradov
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, Saint Petersburg, 191002, Russian Federation
| |
Collapse
|
34
|
Trewartha A, Walker N, Huo H, Lee S, Cruse K, Dagdelen J, Dunn A, Persson KA, Ceder G, Jain A. Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science. PATTERNS (NEW YORK, N.Y.) 2022; 3:100488. [PMID: 35465225 PMCID: PMC9024010 DOI: 10.1016/j.patter.2022.100488] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2021] [Revised: 01/21/2022] [Accepted: 03/15/2022] [Indexed: 11/03/2022]
Abstract
A bottleneck in efficiently connecting new materials discoveries to established literature has arisen due to an increase in publications. This problem may be addressed by using named entity recognition (NER) to extract structured summary-level data from unstructured materials science text. We compare the performance of four NER models on three materials science datasets. The four models include a bidirectional long short-term memory (BiLSTM) and three transformer models (BERT, SciBERT, and MatBERT) with increasing degrees of domain-specific materials science pre-training. MatBERT improves over the other two BERTBASE-based models by 1%∼12%, implying that domain-specific pre-training provides measurable advantages. Despite relative architectural simplicity, the BiLSTM model consistently outperforms BERT, perhaps due to its domain-specific pre-trained word embeddings. Furthermore, MatBERT and SciBERT models outperform the original BERT model to a greater extent in the small data limit. MatBERT’s higher-quality predictions should accelerate the extraction of structured data from materials science literature. Efficient extraction of information from materials science literature is needed Domain-specific materials science pre-training improves results Even simpler domain-specific models can outperform more complex general models
A bottleneck in efficiently connecting new materials discoveries to established literature has arisen due to a massive increase in publications. Four different language models are trained to automatically collect important information from materials science articles. We compare a simple model (BiLSTM) with materials science knowledge to three variants of a more complex model: one with general knowledge (BERT), one with general scientific knowledge (SciBERT), and one with materials science knowledge (MatBERT). We find that MatBERT performs the best overall. This implies that language models with greater extents of materials science knowledge will perform better on materials science-related tasks. The simpler model even consistently outperforms BERT. Furthermore, the performance gaps grow when the models are given fewer examples of information extraction to learn from. MatBERT’s higher-quality results should accelerate the collection of information from materials science literature.
Collapse
Affiliation(s)
- Amalie Trewartha
- Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - Nicholas Walker
- Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - Haoyan Huo
- Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - Sanghoon Lee
- Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - Kevin Cruse
- Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - John Dagdelen
- Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - Alexander Dunn
- Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - Kristin A Persson
- Molecular Foundry, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - Gerbrand Ceder
- Materials Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.,Department of Materials Science and Engineering, University of California, Berkeley, 210 Hearst Memorial Mining Building, Berkeley, CA 94720, USA
| | - Anubhav Jain
- Energy Technologies Area, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| |
Collapse
|
35
|
Zhu M, Cole JM. PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format. J Chem Inf Model 2022; 62:1633-1643. [PMID: 35349259 PMCID: PMC9049592 DOI: 10.1021/acs.jcim.1c01198] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
![]()
The
layout of portable document format (PDF) files is constant
to any screen, and the metadata therein are latent, compared to mark-up
languages such as HTML and XML. No semantic tags are usually provided,
and a PDF file is not designed to be edited or its data interpreted
by software. However, data held in PDF files need to be extracted
in order to comply with open-source data requirements that are now
government-regulated. In the chemical domain, related chemical and
property data also need to be found, and their correlations need to
be exploited to enable data science in areas such as data-driven materials
discovery. Such relationships may be realized using text-mining software
such as the “chemistry-aware” natural-language-processing
tool, ChemDataExtractor; however, this tool has limited data-extraction
capabilities from PDF files. This study presents the PDFDataExtractor
tool, which can act as a plug-in to ChemDataExtractor. It outperforms
other PDF-extraction tools for the chemical literature by coupling
its functionalities to the chemical-named entity-recognition capabilities
of ChemDataExtractor. The intrinsic PDF-reading abilities of ChemDataExtractor
are much improved. The system features a template-based architecture.
This enables semantic information to be extracted from the PDF files
of scientific articles in order to reconstruct the logical structure
of articles. While other existing PDF-extracting tools focus on quantity
mining, this template-based system is more focused on quality mining
on different layouts. PDFDataExtractor outputs information in JSON
and plain text, including the metadata of a PDF file, such as paper
title, authors, affiliation, email, abstract, keywords, journal, year,
document object identifier (DOI), reference, and issue number. With
a self-created evaluation article set, PDFDataExtractor achieved promising
precision for all key assessed metadata areas of the document text.
Collapse
Affiliation(s)
- Miao Zhu
- Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K
| | - Jacqueline M Cole
- Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.,ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.,Department of Chemical Engineering and Biotechnology, University of Cambridge, West Cambridge Site, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K
| |
Collapse
|
36
|
Nandy A, Terrones G, Arunachalam N, Duan C, Kastner DW, Kulik HJ. MOFSimplify, machine learning models with extracted stability data of three thousand metal-organic frameworks. Sci Data 2022; 9:74. [PMID: 35277533 PMCID: PMC8917177 DOI: 10.1038/s41597-022-01181-0] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Accepted: 01/17/2022] [Indexed: 11/09/2022] Open
Abstract
We report a workflow and the output of a natural language processing (NLP)-based procedure to mine the extant metal–organic framework (MOF) literature describing structurally characterized MOFs and their solvent removal and thermal stabilities. We obtain over 2,000 solvent removal stability measures from text mining and 3,000 thermal decomposition temperatures from thermogravimetric analysis data. We assess the validity of our NLP methods and the accuracy of our extracted data by comparing to a hand-labeled subset. Machine learning (ML, i.e. artificial neural network) models trained on this data using graph- and pore-geometry-based representations enable prediction of stability on new MOFs with quantified uncertainty. Our web interface, MOFSimplify, provides users access to our curated data and enables them to harness that data for predictions on new MOFs. MOFSimplify also encourages community feedback on existing data and on ML model predictions for community-based active learning for improved MOF stability models. Measurement(s) | thermal decomposition | Technology Type(s) | thermogravimetry |
Collapse
Affiliation(s)
- Aditya Nandy
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.,Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Gianmarco Terrones
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Naveen Arunachalam
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Chenru Duan
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.,Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - David W Kastner
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.,Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Heather J Kulik
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
| |
Collapse
|
37
|
Affiliation(s)
- Leo H. Chiang
- Core R&D The Dow Chemical Company Lake Jackson Texas 77566 USA
| | - Birgit Braun
- Core R&D The Dow Chemical Company Lake Jackson Texas 77566 USA
| | - Zhenyu Wang
- Chemometrics, AI & Statistics The Dow Chemical Company Lake Jackson Texas 77566 USA
| | - Ivan Castillo
- Chemometrics, AI & Statistics The Dow Chemical Company Lake Jackson Texas 77566 USA
| |
Collapse
|
38
|
Saldívar-González FI, Aldas-Bulos VD, Medina-Franco JL, Plisson F. Natural product drug discovery in the artificial intelligence era. Chem Sci 2022; 13:1526-1546. [PMID: 35282622 PMCID: PMC8827052 DOI: 10.1039/d1sc04471k] [Citation(s) in RCA: 50] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Accepted: 12/10/2021] [Indexed: 12/19/2022] Open
Abstract
Natural products (NPs) are primarily recognized as privileged structures to interact with protein drug targets. Their unique characteristics and structural diversity continue to marvel scientists for developing NP-inspired medicines, even though the pharmaceutical industry has largely given up. High-performance computer hardware, extensive storage, accessible software and affordable online education have democratized the use of artificial intelligence (AI) in many sectors and research areas. The last decades have introduced natural language processing and machine learning algorithms, two subfields of AI, to tackle NP drug discovery challenges and open up opportunities. In this article, we review and discuss the rational applications of AI approaches developed to assist in discovering bioactive NPs and capturing the molecular "patterns" of these privileged structures for combinatorial design or target selectivity.
Collapse
Affiliation(s)
- F I Saldívar-González
- DIFACQUIM Research Group, School of Chemistry, Department of Pharmacy, Universidad Nacional Autónoma de México Avenida Universidad 3000 04510 Mexico Mexico
| | - V D Aldas-Bulos
- Unidad de Genómica Avanzada, Laboratorio Nacional de Genómica para la Biodiversidad (Langebio), Centro de Investigación y de Estudios Avanzados del IPN Irapuato Guanajuato Mexico
| | - J L Medina-Franco
- DIFACQUIM Research Group, School of Chemistry, Department of Pharmacy, Universidad Nacional Autónoma de México Avenida Universidad 3000 04510 Mexico Mexico
| | - F Plisson
- CONACYT - Unidad de Genómica Avanzada, Laboratorio Nacional de Genómica para la Biodiversidad (Langebio), Centro de Investigación y de Estudios Avanzados del IPN Irapuato Guanajuato Mexico
| |
Collapse
|
39
|
Li Y, Yu L, Liu J, Guo L, Wu Y, Wu X. NetDPO: (delta, gamma)-approximate pattern matching with gap constraints under one-off condition. APPL INTELL 2022. [DOI: 10.1007/s10489-021-03000-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
|
40
|
Designing a multilayer film via machine learning of scientific literature. Sci Rep 2022; 12:930. [PMID: 35042971 PMCID: PMC8766440 DOI: 10.1038/s41598-022-05010-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 01/04/2022] [Indexed: 12/23/2022] Open
Abstract
Scientists who design chemical substances often use materials informatics (MI), a data-driven approach with either computer simulation or artificial intelligence (AI). MI is a valuable technique, but applying it to layered structures is difficult. Most of the proposed computer-aided material search techniques use atomic or molecular simulations, which are limited to small areas. Some AI approaches have planned layered structures, but they require a physical theory or abundant experimental results. There is no universal design tool for multilayer films in MI. Here, we show a multilayer film can be designed through machine learning (ML) of experimental procedures extracted from chemical-coating articles. We converted material names according to International Union of Pure and Applied Chemistry rules and stored them in databases for each fabrication step without any physicochemical theory. Compared with experimental results which depend on authors, experimental protocol is superiority at almost unified and less data loss. Connecting scientific knowledge through ML enables us to predict untrained film structures. This suggests that AI imitates research activity, which is normally inspired by other scientific achievements and can thus be used as a general design technique.
Collapse
|
41
|
Cai X, Wang N, Yang L, Mei X. Global-local neighborhood based network representation for citation recommendation. APPL INTELL 2022. [DOI: 10.1007/s10489-021-02964-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
42
|
Chen K, Tian H, Li B, Rangarajan S. A chemistry‐inspired neural network kinetic model for oxidative coupling of methane from high‐throughput data. AIChE J 2022. [DOI: 10.1002/aic.17584] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Affiliation(s)
- Kexin Chen
- Department of Chemical and Biomolecular Engineering Lehigh University Bethlehem Pennsylvania USA
| | - Huijie Tian
- Department of Chemical and Biomolecular Engineering Lehigh University Bethlehem Pennsylvania USA
| | - Bowen Li
- Department of Chemical and Biomolecular Engineering Lehigh University Bethlehem Pennsylvania USA
| | - Srinivas Rangarajan
- Department of Chemical and Biomolecular Engineering Lehigh University Bethlehem Pennsylvania USA
| |
Collapse
|
43
|
Sun C, Yang Z, Wang L, Zhang Y, Lin H, Wang J. Deep learning with language models improves named entity recognition for PharmaCoNER. BMC Bioinformatics 2021; 22:602. [PMID: 34920700 PMCID: PMC8684061 DOI: 10.1186/s12859-021-04260-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2021] [Accepted: 05/31/2021] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND The recognition of pharmacological substances, compounds and proteins is essential for biomedical relation extraction, knowledge graph construction, drug discovery, as well as medical question answering. Although considerable efforts have been made to recognize biomedical entities in English texts, to date, only few limited attempts were made to recognize them from biomedical texts in other languages. PharmaCoNER is a named entity recognition challenge to recognize pharmacological entities from Spanish texts. Because there are currently abundant resources in the field of natural language processing, how to leverage these resources to the PharmaCoNER challenge is a meaningful study. METHODS Inspired by the success of deep learning with language models, we compare and explore various representative BERT models to promote the development of the PharmaCoNER task. RESULTS The experimental results show that deep learning with language models can effectively improve model performance on the PharmaCoNER dataset. Our method achieves state-of-the-art performance on the PharmaCoNER dataset, with a max F1-score of 92.01%. CONCLUSION For the BERT models on the PharmaCoNER dataset, biomedical domain knowledge has a greater impact on model performance than the native language (i.e., Spanish). The BERT models can obtain competitive performance by using WordPiece to alleviate the out of vocabulary limitation. The performance on the BERT model can be further improved by constructing a specific vocabulary based on domain knowledge. Moreover, the character case also has a certain impact on model performance.
Collapse
Affiliation(s)
- Cong Sun
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China.
| | - Lei Wang
- Beijing Institute of Health Administration and Medical Information, Beijing, China.
| | - Yin Zhang
- Beijing Institute of Health Administration and Medical Information, Beijing, China
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Jian Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China
| |
Collapse
|
44
|
Mai H, Chen D, Tachibana Y, Suzuki H, Abe R, Caruso RA. Developing sustainable, high-performance perovskites in photocatalysis: design strategies and applications. Chem Soc Rev 2021; 50:13692-13729. [PMID: 34842873 DOI: 10.1039/d1cs00684c] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Solar energy is attractive because it is free, renewable, abundant and sustainable. Photocatalysis is one of the feasible routes to utilize solar energy for the degradation of pollutants and the production of fuel. Perovskites and their derivatives have received substantial attention in both photocatalytic wastewater treatment and energy production because of their highly tailorable structural and physicochemical properties. This review illustrates the basic principles of photocatalytic reactions and the application of these principles to the design of robust and sustainable perovskite photocatalysts. It details the structures of the perovskites and the physics and chemistry behind photocatalytic reactions and describes the advantages and limitations of popular strategies for the design of photoactive perovskites. This is followed by examples of how these strategies are applied to enhance the photocatalytic efficiency of oxide, halide and oxyhalide perovskites, with a focus on materials with potential for practical application, that is, not containing scarce or toxic elements. It is expected that this overview of the development of photocatalysts and deeper understanding of photocatalytic principles will accelerate the exploitation of efficient perovskite photocatalysts and bring about effective solutions to the energy and environmental crisis.
Collapse
Affiliation(s)
- Haoxin Mai
- Applied Chemistry and Environmental Science, School of Science, STEM College, RMIT University, GPO Box 2476, Melbourne, Victoria 3001, Australia.
| | - Dehong Chen
- Applied Chemistry and Environmental Science, School of Science, STEM College, RMIT University, GPO Box 2476, Melbourne, Victoria 3001, Australia.
| | - Yasuhiro Tachibana
- School of Engineering, STEM College, RMIT University, Bundoora, Victoria 3083, Australia
| | - Hajime Suzuki
- Department of Energy and Hydrocarbon Chemistry, Graduate School of Engineering, Kyoto University, Katsura, Nishikyo-ku, Kyoto 615-8510, Japan
| | - Ryu Abe
- Department of Energy and Hydrocarbon Chemistry, Graduate School of Engineering, Kyoto University, Katsura, Nishikyo-ku, Kyoto 615-8510, Japan
| | - Rachel A Caruso
- Applied Chemistry and Environmental Science, School of Science, STEM College, RMIT University, GPO Box 2476, Melbourne, Victoria 3001, Australia.
| |
Collapse
|
45
|
Yakimovich A, Beaugnon A, Huang Y, Ozkirimli E. Labels in a haystack: Approaches beyond supervised learning in biomedical applications. PATTERNS (NEW YORK, N.Y.) 2021; 2:100383. [PMID: 34950904 PMCID: PMC8672145 DOI: 10.1016/j.patter.2021.100383] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Recent advances in biomedical machine learning demonstrate great potential for data-driven techniques in health care and biomedical research. However, this potential has thus far been hampered by both the scarcity of annotated data in the biomedical domain and the diversity of the domain's subfields. While unsupervised learning is capable of finding unknown patterns in the data by design, supervised learning requires human annotation to achieve the desired performance through training. With the latter performing vastly better than the former, the need for annotated datasets is high, but they are costly and laborious to obtain. This review explores a family of approaches existing between the supervised and the unsupervised problem setting. The goal of these algorithms is to make more efficient use of the available labeled data. The advantages and limitations of each approach are addressed and perspectives are provided.
Collapse
Affiliation(s)
- Artur Yakimovich
- Roche Pharma International Informatics, Roche Products Limited, Welwyn Garden City, UK
| | - Anaël Beaugnon
- Roche Pharma International Informatics, Roche, Boulogne-Billancourt, France
| | - Yi Huang
- Roche Pharma International Informatics, Roche (China) Holding Ltd., Shanghai, China
| | - Elif Ozkirimli
- Roche Pharma International Informatics, F. Hoffmann-La Roche AG, Kaiseraugst, Switzerland
| |
Collapse
|
46
|
Zhang H, Li Q, Yang Y, Ji X, Sessler JL. Unlocking Chemically Encrypted Information Using Three Types of External Stimuli. J Am Chem Soc 2021; 143:18635-18642. [PMID: 34719924 DOI: 10.1021/jacs.1c08558] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Encryption is critical to information security; however, existing chemical-based information encryption strategies are still in their infancy. We report here a new approach to chemical encryption involving a supramolecular gel QR (quick response) code with multiple encryption functions. Three color "turn-on" supramolecular polymer gels, G1-G3, were prepared that produce pink, purple, and yellow colors when subject to treatment with acetic acid vapor, UV light, and methanolic FeCl3, respectively. As the result of hydrogen-bonding interactions at the gel interfaces, the three gels can be assembled to produce gel G4. Engraving a QR code pattern onto G4 then gave gel G5. When one or two stimuli are applied to the individual pieces corresponding to the QR engraved versions of the gels G1-G3 making up G5, a complete scannable pattern is not displayed, and the stored information cannot be recognized. Only when three different stimuli are applied at the same time does G5 give a complete recognizable pattern allowing the stored information to be retrieved. This strategy was applied to the decryption-based opening of a coded lock.
Collapse
Affiliation(s)
- Hanwei Zhang
- Key Laboratory of Material Chemistry for Energy Conversion and Storage, Ministry of Education, Hubei Key Laboratory of Material Chemistry and Service Failure, School of Chemistry and Chemical Engineering, Huazhong University of Science and Technology, Wuhan 430074, People's Republic of China
| | - Qingyun Li
- Key Laboratory of Material Chemistry for Energy Conversion and Storage, Ministry of Education, Hubei Key Laboratory of Material Chemistry and Service Failure, School of Chemistry and Chemical Engineering, Huazhong University of Science and Technology, Wuhan 430074, People's Republic of China
| | - Yabi Yang
- Key Laboratory of Material Chemistry for Energy Conversion and Storage, Ministry of Education, Hubei Key Laboratory of Material Chemistry and Service Failure, School of Chemistry and Chemical Engineering, Huazhong University of Science and Technology, Wuhan 430074, People's Republic of China
| | - Xiaofan Ji
- Key Laboratory of Material Chemistry for Energy Conversion and Storage, Ministry of Education, Hubei Key Laboratory of Material Chemistry and Service Failure, School of Chemistry and Chemical Engineering, Huazhong University of Science and Technology, Wuhan 430074, People's Republic of China
| | - Jonathan L Sessler
- Department of Chemistry, The University of Texas at Austin, 105 E. 24th Street A5300, Austin, Texas 78712, United States
| |
Collapse
|
47
|
Joshi RP, Kumar N. Artificial Intelligence for Autonomous Molecular Design: A Perspective. Molecules 2021; 26:6761. [PMID: 34833853 PMCID: PMC8619999 DOI: 10.3390/molecules26226761] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Revised: 10/23/2021] [Accepted: 10/29/2021] [Indexed: 11/23/2022] Open
Abstract
Domain-aware artificial intelligence has been increasingly adopted in recent years to expedite molecular design in various applications, including drug design and discovery. Recent advances in areas such as physics-informed machine learning and reasoning, software engineering, high-end hardware development, and computing infrastructures are providing opportunities to build scalable and explainable AI molecular discovery systems. This could improve a design hypothesis through feedback analysis, data integration that can provide a basis for the introduction of end-to-end automation for compound discovery and optimization, and enable more intelligent searches of chemical space. Several state-of-the-art ML architectures are predominantly and independently used for predicting the properties of small molecules, their high throughput synthesis, and screening, iteratively identifying and optimizing lead therapeutic candidates. However, such deep learning and ML approaches also raise considerable conceptual, technical, scalability, and end-to-end error quantification challenges, as well as skepticism about the current AI hype to build automated tools. To this end, synergistically and intelligently using these individual components along with robust quantum physics-based molecular representation and data generation tools in a closed-loop holds enormous promise for accelerated therapeutic design to critically analyze the opportunities and challenges for their more widespread application. This article aims to identify the most recent technology and breakthrough achieved by each of the components and discusses how such autonomous AI and ML workflows can be integrated to radically accelerate the protein target or disease model-based probe design that can be iteratively validated experimentally. Taken together, this could significantly reduce the timeline for end-to-end therapeutic discovery and optimization upon the arrival of any novel zoonotic transmission event. Our article serves as a guide for medicinal, computational chemistry and biology, analytical chemistry, and the ML community to practice autonomous molecular design in precision medicine and drug discovery.
Collapse
Affiliation(s)
| | - Neeraj Kumar
- Computational Biology Group, Biological Science Division, Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, WA 99352, USA;
| |
Collapse
|
48
|
Weber JM, Guo Z, Zhang C, Schweidtmann AM, Lapkin AA. Chemical data intelligence for sustainable chemistry. Chem Soc Rev 2021; 50:12013-12036. [PMID: 34520507 DOI: 10.1039/d1cs00477h] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
This study highlights new opportunities for optimal reaction route selection from large chemical databases brought about by the rapid digitalisation of chemical data. The chemical industry requires a transformation towards more sustainable practices, eliminating its dependencies on fossil fuels and limiting its impact on the environment. However, identifying more sustainable process alternatives is, at present, a cumbersome, manual, iterative process, based on chemical intuition and modelling. We give a perspective on methods for automated discovery and assessment of competitive sustainable reaction routes based on renewable or waste feedstocks. Three key areas of transition are outlined and reviewed based on their state-of-the-art as well as bottlenecks: (i) data, (ii) evaluation metrics, and (iii) decision-making. We elucidate their synergies and interfaces since only together these areas can bring about the most benefit. The field of chemical data intelligence offers the opportunity to identify the inherently more sustainable reaction pathways and to identify opportunities for a circular chemical economy. Our review shows that at present the field of data brings about most bottlenecks, such as data completion and data linkage, but also offers the principal opportunity for advancement.
Collapse
Affiliation(s)
- Jana M Weber
- Department of Chemical Engineering and Biotechnology, University of Cambridge, West Cambridge Site, Philippa Fawcett Drive, Cambridge CB3 0AS, UK. .,Chemical Data Intelligence (CDI) Pte Ltd, Robinson Road, #02-00, 068898, Singapore
| | - Zhen Guo
- Chemical Data Intelligence (CDI) Pte Ltd, Robinson Road, #02-00, 068898, Singapore.,Cambridge Centre for Advanced Research and Education in Singapore, CARES Ltd. 1 CREATE Way, CREATE Tower #05-05, 138602, Singapore
| | - Chonghuan Zhang
- Department of Chemical Engineering and Biotechnology, University of Cambridge, West Cambridge Site, Philippa Fawcett Drive, Cambridge CB3 0AS, UK.
| | - Artur M Schweidtmann
- Department of Chemical Engineering, Delft University of Technology, Van der Maasweg 9, Delft 2629 HZ, The Netherlands
| | - Alexei A Lapkin
- Department of Chemical Engineering and Biotechnology, University of Cambridge, West Cambridge Site, Philippa Fawcett Drive, Cambridge CB3 0AS, UK. .,Chemical Data Intelligence (CDI) Pte Ltd, Robinson Road, #02-00, 068898, Singapore.,Cambridge Centre for Advanced Research and Education in Singapore, CARES Ltd. 1 CREATE Way, CREATE Tower #05-05, 138602, Singapore
| |
Collapse
|
49
|
Chang YC, Chiu YW, Chuang TW. Linguistic Pattern-infused Dual-channel BiLSTM with Attention for Dengue Case Summary Generation from ProMED-mail database (Preprint). JMIR Public Health Surveill 2021; 8:e34583. [PMID: 35830225 PMCID: PMC9491834 DOI: 10.2196/34583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2021] [Revised: 04/15/2022] [Accepted: 05/27/2022] [Indexed: 11/13/2022] Open
Affiliation(s)
- Yung-Chun Chang
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
- Clinical Big Data Research Center, Taipei Medical University Hospital, Taipei, Taiwan
| | - Yu-Wen Chiu
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
- Department of Molecular Parasitology and Tropical Diseases, School of Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
| | - Ting-Wu Chuang
- Department of Molecular Parasitology and Tropical Diseases, School of Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
| |
Collapse
|
50
|
NERWS: Towards Improving Information Retrieval of Digital Library Management System Using Named Entity Recognition and Word Sense. BIG DATA AND COGNITIVE COMPUTING 2021. [DOI: 10.3390/bdcc5040059] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
An information retrieval (IR) system is the core of many applications, including digital library management systems (DLMS). The IR-based DLMS depends on either the title with keywords or content as symbolic strings. In contrast, it ignores the meaning of the content or what it indicates. Many researchers tried to improve IR systems either using the named entity recognition (NER) technique or the words’ meaning (word sense) and implemented the improvements with a specific language. However, they did not test the IR system using NER and word sense disambiguation together to study the behavior of this system in the presence of these techniques. This paper aims to improve the information retrieval system used by the DLMS by adding the NER and word sense disambiguation (WSD) together for the English and Arabic languages. For NER, a voting technique was used among three completely different classifiers: rules-based, conditional random field (CRF), and bidirectional LSTM-CNN. For WSD, an examples-based method was used to implement it for the first time with the English language. For the IR system, a vector space model (VSM) was used to test the information retrieval system, and it was tested on samples from the library of the University of Kufa for the Arabic and English languages. The overall system results show that the precision, recall, and F-measures were increased from 70.9%, 74.2%, and 72.5% to 89.7%, 91.5%, and 90.6% for the English language and from 66.3%, 69.7%, and 68.0% to 89.3%, 87.1%, and 88.2% for the Arabic language.
Collapse
|