1
|
Ting JM, Tamayo-Mendoza T, Petersen SR, Van Reet J, Ahmed UA, Snell NJ, Fisher JD, Stern M, Oviedo F. Frontiers in nonviral delivery of small molecule and genetic drugs, driven by polymer chemistry and machine learning for materials informatics. Chem Commun (Camb) 2023; 59:14197-14209. [PMID: 37955165 DOI: 10.1039/d3cc04705a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2023]
Abstract
Materials informatics (MI) has immense potential to accelerate the pace of innovation and new product development in biotechnology. Close collaborations between skilled physical and life scientists with data scientists are being established in pursuit of leveraging MI tools in automation and artificial intelligence (AI) to predict material properties in vitro and in vivo. However, the scarcity of large, standardized, and labeled materials data for connecting structure-function relationships represents one of the largest hurdles to overcome. In this Highlight, focus is brought to emerging developments in polymer-based therapeutic delivery platforms, where teams generate large experimental datasets around specific therapeutics and successfully establish a design-to-deployment cycle of specialized nanocarriers. Three select collaborations demonstrate how custom-built polymers protect and deliver small molecules, nucleic acids, and proteins, representing ideal use-cases for machine learning to understand how molecular-level interactions impact drug stabilization and release. We conclude with our perspectives on how MI innovations in automation efficiencies and digitalization of data-coupled with fundamental insight and creativity from the polymer science community-can accelerate translation of more gene therapies into lifesaving medicines.
Collapse
|
2
|
Shetty P, Rajan AC, Kuenneth C, Gupta S, Panchumarti LP, Holm L, Zhang C, Ramprasad R. A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing. NPJ COMPUTATIONAL MATERIALS 2023; 9:52. [PMID: 37033291 PMCID: PMC10073792 DOI: 10.1038/s41524-023-01003-w] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Accepted: 03/16/2023] [Indexed: 06/19/2023]
Abstract
The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from literature. We used natural language processing methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available at polymerscholar.org which can be used to locate material property data recorded in abstracts. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with extracted material property information.
Collapse
Affiliation(s)
- Pranav Shetty
- School of Computational Science & Engineering, Atlanta, GA USA
| | - Arunkumar Chitteth Rajan
- School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, 30332 GA USA
| | - Chris Kuenneth
- School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, 30332 GA USA
| | - Sonakshi Gupta
- Department of Metallurgy Engineering and Materials Science, Indian Institute of Technology, Indore, Madhya Pradesh India
| | - Lakshmi Prerana Panchumarti
- School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, 30332 GA USA
| | - Lauren Holm
- School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, 30332 GA USA
| | - Chao Zhang
- School of Computational Science & Engineering, Atlanta, GA USA
| | - Rampi Ramprasad
- School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, 30332 GA USA
| |
Collapse
|
3
|
Steinmann SN, Wang Q, Seh ZW. How machine learning can accelerate electrocatalysis discovery and optimization. MATERIALS HORIZONS 2023; 10:393-406. [PMID: 36541226 DOI: 10.1039/d2mh01279k] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Advances in machine learning (ML) provide the means to bypass bottlenecks in the discovery of new electrocatalysts using traditional approaches. In this review, we highlight the currently achieved work in ML-accelerated discovery and optimization of electrocatalysts via a tight collaboration between computational models and experiments. First, the applicability of available methods for constructing machine-learned potentials (MLPs), which provide accurate energies and forces for atomistic simulations, are discussed. Meanwhile, the current challenges for MLPs in the context of electrocatalysis are highlighted. Then, we review the recent progress in predicting catalytic activities using surrogate models, including microkinetic simulations and more global proxies thereof. Several typical applications of using ML to rationalize thermodynamic proxies and predict the adsorption and activation energies are also discussed. Next, recent developments of ML-assisted experiments for catalyst characterization, synthesis optimization and reaction condition optimization are illustrated. In particular, the applications in ML-enhanced spectra analysis and the use of ML to interpret experimental kinetic data are highlighted. Additionally, we also show how robotics are applied to high-throughput synthesis, characterization and testing of electrocatalysts to accelerate the materials exploration process and how this equipment can be assembled into self-driven laboratories.
Collapse
Affiliation(s)
| | - Qing Wang
- Univ Lyon, ENS de Lyon, CNRS, Laboratoire de Chimie UMR 5182, Lyon, France.
| | - Zhi Wei Seh
- Institute of Materials Research and Engineering, Agency for Science, Technology and Research (A*STAR), 2 Fusionopolis Way, Innovis, 138634, Singapore.
| |
Collapse
|
4
|
Martin TB, Audus DJ. Emerging Trends in Machine Learning: A Polymer Perspective. ACS POLYMERS AU 2023. [DOI: 10.1021/acspolymersau.2c00053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Affiliation(s)
- Tyler B. Martin
- National Institute of Standards and Technology, Gaithersburg, Maryland20899, United States
| | - Debra J. Audus
- National Institute of Standards and Technology, Gaithersburg, Maryland20899, United States
| |
Collapse
|
5
|
Huang S, Cole JM. BatteryBERT: A Pretrained Language Model for Battery Database Enhancement. J Chem Inf Model 2022; 62:6365-6377. [PMID: 35533012 PMCID: PMC9795558 DOI: 10.1021/acs.jcim.2c00035] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
A great number of scientific papers are published every year in the field of battery research, which forms a huge textual data source. However, it is difficult to explore and retrieve useful information efficiently from these large unstructured sets of text. The Bidirectional Encoder Representations from Transformers (BERT) model, trained on a large data set in an unsupervised way, provides a route to process the scientific text automatically with minimal human effort. To this end, we realized six battery-related BERT models, namely, BatteryBERT, BatteryOnlyBERT, and BatterySciBERT, each of which consists of both cased and uncased models. They have been trained specifically on a corpus of battery research papers. The pretrained BatteryBERT models were then fine-tuned on downstream tasks, including battery paper classification and extractive question-answering for battery device component classification that distinguishes anode, cathode, and electrolyte materials. Our BatteryBERT models were found to outperform the original BERT models on the specific battery tasks. The fine-tuned BatteryBERT was then used to perform battery database enhancement. We also provide a website application for its interactive use and visualization.
Collapse
Affiliation(s)
- Shu Huang
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, J.J. Thomson Avenue, Cambridge CB3 0HE, U.K.
| | - Jacqueline M. Cole
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, J.J. Thomson Avenue, Cambridge CB3 0HE, U.K.,ISIS
Neutron and Muon Source, Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.,
| |
Collapse
|
6
|
Abstract
The application of machine learning to the materials domain has traditionally struggled with two major challenges: a lack of large, curated data sets and the need to understand the physics behind the machine-learning prediction. The former problem is particularly acute in the polymers domain. Here we aim to simultaneously tackle these challenges through the incorporation of scientific knowledge, thus, providing improved predictions for smaller data sets, both under interpolation and extrapolation, and a degree of explainability. We focus on imperfect theories, as they are often readily available and easier to interpret. Using a system of a polymer in different solvent qualities, we explore numerous methods for incorporating theory into machine learning using different machine-learning models, including Gaussian process regression. Ultimately, we find that encoding the functional form of the theory performs best followed by an encoding of the numeric values of the theory.
Collapse
Affiliation(s)
- Debra J Audus
- Materials Science and Engineering Division, National Institute of Standards and Technology, Gaithersburg, Maryland 20899, United States
| | - Austin McDannald
- Materials Measurement Science Division, National Institute of Standards and Technology, Gaithersburg, Maryland 20899, United States
| | - Brian DeCost
- Materials Measurement Science Division, National Institute of Standards and Technology, Gaithersburg, Maryland 20899, United States
| |
Collapse
|
7
|
Kumar R. Materiomically Designed Polymeric Vehicles for Nucleic Acids: Quo Vadis? ACS APPLIED BIO MATERIALS 2022; 5:2507-2535. [PMID: 35642794 DOI: 10.1021/acsabm.2c00346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Despite rapid advances in molecular biology, particularly in site-specific genome editing technologies, such as CRISPR/Cas9 and base editing, financial and logistical challenges hinder a broad population from accessing and benefiting from gene therapy. To improve the affordability and scalability of gene therapy, we need to deploy chemically defined, economical, and scalable materials, such as synthetic polymers. For polymers to deliver nucleic acids efficaciously to targeted cells, they must optimally combine design attributes, such as architecture, length, composition, spatial distribution of monomers, basicity, hydrophilic-hydrophobic phase balance, or protonation degree. Designing polymeric vectors for specific nucleic acid payloads is a multivariate optimization problem wherein even minuscule deviations from the optimum are poorly tolerated. To explore the multivariate polymer design space rapidly, efficiently, and fruitfully, we must integrate parallelized polymer synthesis, high-throughput biological screening, and statistical modeling. Although materiomics approaches promise to streamline polymeric vector development, several methodological ambiguities must be resolved. For instance, establishing a flexible polymer ontology that accommodates recent synthetic advances, enforcing uniform polymer characterization and data reporting standards, and implementing multiplexed in vitro and in vivo screening studies require considerable planning, coordination, and effort. This contribution will acquaint readers with the challenges associated with materiomics approaches to polymeric gene delivery and offers guidelines for overcoming these challenges. Here, we summarize recent developments in combinatorial polymer synthesis, high-throughput screening of polymeric vectors, omics-based approaches to polymer design, barcoding schemes for pooled in vitro and in vivo screening, and identify materiomics-inspired research directions that will realize the long-unfulfilled clinical potential of polymeric carriers in gene therapy.
Collapse
Affiliation(s)
- Ramya Kumar
- Department of Chemical & Biological Engineering, Colorado School of Mines, 1613 Illinois St, Golden, Colorado 80401, United States
| |
Collapse
|
8
|
Teruya E, Takeuchi T, Morita H, Hayashi T, Ono K. ARTS: autonomous research topic selection system using word embeddings and network analysis. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2022. [DOI: 10.1088/2632-2153/ac61eb] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Abstract
The materials science research process has become increasingly autonomous due to the remarkable progress in artificial intelligence. However, autonomous research topic selection (ARTS) has not yet been fully explored due to the difficulty of estimating its promise and the lack of previous research. This paper introduces an ARTS system that autonomously selects potential research topics that are likely to reveal new scientific facts yet have not been the subject of much previous research by analyzing vast numbers of articles. Potential research topics are selected by analyzing the difference between two research concept networks constructed from research information in articles: one that represents the promise of research topics and is constructed from word embeddings, and one that represents known facts and past research activities and is constructed from statistical information on the appearance patterns of research concepts. The ARTS system is also equipped with functions to search and visualize information about selected research topics to assist in the final determination of a research topic by a scientist. We developed the ARTS system using approximately 100 00 articles published in the Computational Materials Science journal. The results of our evaluation demonstrated that research topics studied after 2016 could be generated autonomously from an analysis of the articles published before 2015. This suggests that potential research topics can be effectively selected by using the ARTS system.
Collapse
|
9
|
Shetty P, Ramprasad R. Machine-Guided Polymer Knowledge Extraction Using Natural Language Processing: The Example of Named Entity Normalization. J Chem Inf Model 2021; 61:5377-5385. [PMID: 34752101 DOI: 10.1021/acs.jcim.1c00554] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
A rich body of literature has emerged in recent years that discusses the extraction of structured information from materials science text through named entity recognition models. Relatively little work has been done to address the "normalization" of extracted entities, that is, recognizing that two or more seemingly different entities actually refer to the same entity in reality. In this work, we address the normalization of polymer named entities, polymers being a class of materials that often have a variety of common names for the same material in addition to the IUPAC name. We have trained supervised clustering models using Word2Vec and fastText word embeddings reported in previous work so that named entities referring to the same polymer are categorized within the same cluster in the word embedding space. We report the use of parameterized cosine distance functions to cluster and normalize textually derived entities, achieving an F1 score of 0.85. Furthermore, a labeled data set of polymer names was utilized to train our model and to infer the true total number of unique polymers that are actively reported in the literature. For ∼15,500 polymer named entities extracted from our corpus of 0.5 million papers, we detected 6734 unique clusters (i.e., unique polymers), 632 of which were manually curated to train the normalization model. This work will serve as a critical ingredient in a natural language processing-based pipeline for the automatic and efficient extraction of knowledge from the polymer literature.
Collapse
Affiliation(s)
- Pranav Shetty
- School of Computational Science & Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, Georgia 30332, United States
| | - Rampi Ramprasad
- School of Materials Science and Engineering, Georgia Institute of Technology, 771 Ferst Drive NW, Atlanta, Georgia 30332, United States
| |
Collapse
|
10
|
IP Analytics and Machine Learning Applied to Create Process Visualization Graphs for Chemical Utility Patents. Processes (Basel) 2021. [DOI: 10.3390/pr9081342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Researchers must read and understand a large volume of technical papers, including patent documents, to fully grasp the state-of-the-art technological progress in a given domain. Chemical research is particularly challenging with the fast growth of newly registered utility patents (also known as intellectual property or IP) that provide detailed descriptions of the processes used to create a new chemical or a new process to manufacture a known chemical. The researcher must be able to understand the latest patents and literature in order to develop new chemicals and processes that do not infringe on existing claims and processes. This research uses text mining, integrated machine learning, and knowledge visualization techniques to effectively and accurately support the extraction and graphical presentation of chemical processes disclosed in patent documents. The computer framework trains a machine learning model called ALBERT for automatic paragraph text classification. ALBERT separates chemical and non-chemical descriptive paragraphs from a patent for effective chemical term extraction. The ChemDataExtractor is used to classify chemical terms, such as inputs, units, and reactions from the chemical paragraphs. A computer-supported graph-based knowledge representation interface is developed to plot the extracted chemical terms and their chemical process links as a network of nodes with connecting arcs. The computer-supported chemical knowledge visualization approach helps researchers to quickly understand the innovative and unique chemical or processes of any chemical patent of interest.
Collapse
|
11
|
Kuenneth C, Schertzer W, Ramprasad R. Copolymer Informatics with Multitask Deep Neural Networks. Macromolecules 2021. [DOI: 10.1021/acs.macromol.1c00728] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Affiliation(s)
- Christopher Kuenneth
- School of Materials Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| | - William Schertzer
- School of Materials Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| | - Rampi Ramprasad
- School of Materials Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| |
Collapse
|
12
|
Siraj A, Lim DY, Tayara H, Chong KT. UbiComb: A Hybrid Deep Learning Model for Predicting Plant-Specific Protein Ubiquitylation Sites. Genes (Basel) 2021; 12:genes12050717. [PMID: 34064731 PMCID: PMC8151217 DOI: 10.3390/genes12050717] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2021] [Revised: 05/06/2021] [Accepted: 05/07/2021] [Indexed: 12/11/2022] Open
Abstract
Protein ubiquitylation is an essential post-translational modification process that performs a critical role in a wide range of biological functions, even a degenerative role in certain diseases, and is consequently used as a promising target for the treatment of various diseases. Owing to the significant role of protein ubiquitylation, these sites can be identified by enzymatic approaches, mass spectrometry analysis, and combinations of multidimensional liquid chromatography and tandem mass spectrometry. However, these large-scale experimental screening techniques are time consuming, expensive, and laborious. To overcome the drawbacks of experimental methods, machine learning and deep learning-based predictors were considered for prediction in a timely and cost-effective manner. In the literature, several computational predictors have been published across species; however, predictors are species-specific because of the unclear patterns in different species. In this study, we proposed a novel approach for predicting plant ubiquitylation sites using a hybrid deep learning model by utilizing convolutional neural network and long short-term memory. The proposed method uses the actual protein sequence and physicochemical properties as inputs to the model and provides more robust predictions. The proposed predictor achieved the best result with accuracy values of 80% and 81% and F-scores of 79% and 82% on the 10-fold cross-validation and an independent dataset, respectively. Moreover, we also compared the testing of the independent dataset with popular ubiquitylation predictors; the results demonstrate that our model significantly outperforms the other methods in prediction classification results.
Collapse
Affiliation(s)
- Arslan Siraj
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea; (A.S.); (D.Y.L.)
| | - Dae Yeong Lim
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea; (A.S.); (D.Y.L.)
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Korea
- Correspondence: (H.T.); (K.T.C.)
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea; (A.S.); (D.Y.L.)
- Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Korea
- Correspondence: (H.T.); (K.T.C.)
| |
Collapse
|