1
|
Haas BC, Kalyani D, Sigman MS. Applying statistical modeling strategies to sparse datasets in synthetic chemistry. SCIENCE ADVANCES 2025; 11:eadt3013. [PMID: 39742471 DOI: 10.1126/sciadv.adt3013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/20/2024] [Accepted: 11/20/2024] [Indexed: 01/03/2025]
Abstract
The application of statistical modeling in organic chemistry is emerging as a standard practice for probing structure-activity relationships and as a predictive tool for many optimization objectives. This review is aimed as a tutorial for those entering the area of statistical modeling in chemistry. We provide case studies to highlight the considerations and approaches that can be used to successfully analyze datasets in low data regimes, a common situation encountered given the experimental demands of organic chemistry. Statistical modeling hinges on the data (what is being modeled), descriptors (how data are represented), and algorithms (how data are modeled). Herein, we focus on how various reaction outputs (e.g., yield, rate, selectivity, solubility, stability, and turnover number) and data structures (e.g., binned, heavily skewed, and distributed) influence the choice of algorithm used for constructing predictive and chemically insightful statistical models.
Collapse
Affiliation(s)
- Brittany C Haas
- Department of Chemistry, University of Utah, Salt Lake City, UT 84112, USA
| | | | - Matthew S Sigman
- Department of Chemistry, University of Utah, Salt Lake City, UT 84112, USA
| |
Collapse
|
2
|
Yu Y, Hossain MM, Sikder R, Qi Z, Huo L, Chen R, Dou W, Shi B, Ye T. Exploring the potential of machine learning to understand the occurrence and health risks of haloacetic acids in a drinking water distribution system. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024; 951:175573. [PMID: 39153609 DOI: 10.1016/j.scitotenv.2024.175573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Revised: 08/07/2024] [Accepted: 08/14/2024] [Indexed: 08/19/2024]
Abstract
Determining the occurrence of disinfection byproducts (DBPs) in drinking water distribution system (DWDS) remains challenging. Predicting DBPs using readily available water quality parameters can help to understand DBPs associated risks and capture the complex interrelationships between water quality and DBP occurrence. In this study, we collected drinking water samples from a distribution network throughout a year and measured the related water quality parameters (WQPs) and haloacetic acids (HAAs). 12 machine learning (ML) algorithms were evaluated. Random Forest (RF) achieved the best performance (i.e., R2 of 0.78 and RMSE of 7.74) for predicting HAAs concentration. Instead of using cytotoxicity or genotoxicity separately as the surrogate for evaluating toxicity associated with HAAs, we created a health risk index (HRI) that was calculated as the sum of cytotoxicity and genotoxicity of HAAs following the widely used Tic-Tox approach. Similarly, ML models were developed to predict the HRI, and RF model was found to perform the best, obtaining R2 of 0.69 and RMSE of 0.38. To further explore advanced ML approaches, we developed 3 models using uncertainty-based active learning. Our findings revealed that Categorical Boosting Regression (CAT) model developed through active learning substantially outperformed other models, achieving R2 of 0.87 and 0.82 for predicting concentration and the HRI, respectively. Feature importance analysis with the CAT model revealed that temperature, ions (e.g., chloride and nitrate), and DOC concentration in the distribution network had a significant impact on the occurrence of HAAs. Meanwhile, chloride ion, pH, ORP, and free chlorine were found as the most important features for HRI prediction. This study demonstrates that ML has the potential in the prediction of HAA occurrence and toxicity. By identifying key WQPs impacting HAA occurrence and toxicity, this research offers valuable insights for targeted DBP mitigation strategies.
Collapse
Affiliation(s)
- Ying Yu
- School of Environmental Science and Engineering, Xiamen University of Technology, Xiamen 361024, China; Drinking Water Science and Technology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China; Key Laboratory of Water Resources Utilization and Protection, Xiamen city, Xiamen 361005, China
| | - Md Mahjib Hossain
- Department of Civil and Environmental Engineering, South Dakota School of Mines and Technology, Rapid City, SD 57701, USA
| | - Rabbi Sikder
- Department of Civil and Environmental Engineering, South Dakota School of Mines and Technology, Rapid City, SD 57701, USA
| | - Zhenguo Qi
- Drinking Water Science and Technology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China
| | - Lixin Huo
- Drinking Water Science and Technology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China
| | - Ruya Chen
- School of Environmental Science and Engineering, Zhejiang Gongshang University, Hangzhou 310018, Zhejiang, China.
| | - Wenyue Dou
- Key Laboratory of Industrial Pollution Control and Reuse of Jiangsu Province, College of Environmental Engineering, Xuzhou University of Technology, Xuzhou 221018, China
| | - Baoyou Shi
- Drinking Water Science and Technology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China
| | - Tao Ye
- Department of Civil and Environmental Engineering, South Dakota School of Mines and Technology, Rapid City, SD 57701, USA.
| |
Collapse
|
3
|
Nur A, Lai JY, Ch'ng ACW, Choong YS, Wan Isa WYH, Lim TS. A review of in vitro stochastic and non-stochastic affinity maturation strategies for phage display derived monoclonal antibodies. Int J Biol Macromol 2024; 277:134217. [PMID: 39069045 DOI: 10.1016/j.ijbiomac.2024.134217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 07/24/2024] [Accepted: 07/25/2024] [Indexed: 07/30/2024]
Abstract
Monoclonal antibodies identified using display technologies like phage display occasionally suffers from a lack of affinity making it unsuitable for application. This drawback is circumvented with the application of affinity maturation. Affinity maturation is an essential step in the natural evolution of antibodies in the immune system. The evolution of molecular based methods has seen the development of various mutagenesis approaches. This allows for the natural evolutionary process during somatic hypermutation to be replicated in the laboratories for affinity maturation to fine-tune the affinity and selectivity of antibodies. In this review, we will discuss affinity maturation strategies for mAbs generated through phage display systems. The review will highlight various in vitro stochastic and non-stochastic affinity maturation approaches that includes but are not limited to random mutagenesis, site-directed mutagenesis, and gene synthesis.
Collapse
Affiliation(s)
- Alia Nur
- Institute for Research in Molecular Medicine, Universiti Sains Malaysia, 11800 Penang, Malaysia
| | - Jing Yi Lai
- Institute for Research in Molecular Medicine, Universiti Sains Malaysia, 11800 Penang, Malaysia
| | - Angela Chiew Wen Ch'ng
- Institute for Research in Molecular Medicine, Universiti Sains Malaysia, 11800 Penang, Malaysia
| | - Yee Siew Choong
- Institute for Research in Molecular Medicine, Universiti Sains Malaysia, 11800 Penang, Malaysia
| | - Wan Yus Haniff Wan Isa
- School of Medical Sciences, Department of Medicine, Universiti Sains Malaysia, 16150 Kubang Kerian, Kelantan, Malaysia
| | - Theam Soon Lim
- Institute for Research in Molecular Medicine, Universiti Sains Malaysia, 11800 Penang, Malaysia; Analytical Biochemistry Research Centre, Universiti Sains Malaysia, 11800 Penang, Malaysia.
| |
Collapse
|
4
|
Xu H, Zhao Y, Zhang Y, Han J, Zan P, He S, Bo X. Deep active learning with high structural discriminability for molecular mutagenicity prediction. Commun Biol 2024; 7:1071. [PMID: 39217273 PMCID: PMC11366013 DOI: 10.1038/s42003-024-06758-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Accepted: 08/21/2024] [Indexed: 09/04/2024] Open
Abstract
The assessment of mutagenicity is essential in drug discovery, as it may lead to cancer and germ cells damage. Although in silico methods have been proposed for mutagenicity prediction, their performance is hindered by the scarcity of labeled molecules. However, experimental mutagenicity testing can be time-consuming and costly. One solution to reduce the annotation cost is active learning, where the algorithm actively selects the most valuable molecules from a vast chemical space and presents them to the oracle (e.g., a human expert) for annotation, thereby rapidly improving the model's predictive performance with a smaller annotation cost. In this paper, we propose muTOX-AL, a deep active learning framework, which can actively explore the chemical space and identify the most valuable molecules, resulting in competitive performance with a small number of labeled samples. The experimental results show that, compared to the random sampling strategy, muTOX-AL can reduce the number of training molecules by about 57%. Additionally, muTOX-AL exhibits outstanding molecular structural discriminability, allowing it to pick molecules with high structural similarity but opposite properties.
Collapse
Affiliation(s)
- Huiyan Xu
- Shanghai Key Laboratory of Power Station Automation Technology, School of Mechatronics Engineering and Automation, Shanghai University, Shanghai, China
- Academy of Military Medical Sciences, Beijing, China
| | - Yanpeng Zhao
- Academy of Military Medical Sciences, Beijing, China
| | - Yixin Zhang
- Academy of Military Medical Sciences, Beijing, China
| | - Junshan Han
- Academy of Military Medical Sciences, Beijing, China
| | - Peng Zan
- Shanghai Key Laboratory of Power Station Automation Technology, School of Mechatronics Engineering and Automation, Shanghai University, Shanghai, China.
| | - Song He
- Academy of Military Medical Sciences, Beijing, China.
| | - Xiaochen Bo
- Academy of Military Medical Sciences, Beijing, China.
| |
Collapse
|
5
|
Tom G, Schmid SP, Baird SG, Cao Y, Darvish K, Hao H, Lo S, Pablo-García S, Rajaonson EM, Skreta M, Yoshikawa N, Corapi S, Akkoc GD, Strieth-Kalthoff F, Seifrid M, Aspuru-Guzik A. Self-Driving Laboratories for Chemistry and Materials Science. Chem Rev 2024; 124:9633-9732. [PMID: 39137296 PMCID: PMC11363023 DOI: 10.1021/acs.chemrev.4c00055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/15/2024]
Abstract
Self-driving laboratories (SDLs) promise an accelerated application of the scientific method. Through the automation of experimental workflows, along with autonomous experimental planning, SDLs hold the potential to greatly accelerate research in chemistry and materials discovery. This review provides an in-depth analysis of the state-of-the-art in SDL technology, its applications across various scientific disciplines, and the potential implications for research and industry. This review additionally provides an overview of the enabling technologies for SDLs, including their hardware, software, and integration with laboratory infrastructure. Most importantly, this review explores the diverse range of scientific domains where SDLs have made significant contributions, from drug discovery and materials science to genomics and chemistry. We provide a comprehensive review of existing real-world examples of SDLs, their different levels of automation, and the challenges and limitations associated with each domain.
Collapse
Affiliation(s)
- Gary Tom
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
| | - Stefan P. Schmid
- Department
of Chemistry and Applied Biosciences, ETH
Zurich, Vladimir-Prelog-Weg 1, CH-8093 Zurich, Switzerland
| | - Sterling G. Baird
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
| | - Yang Cao
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
| | - Kourosh Darvish
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
| | - Han Hao
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
| | - Stanley Lo
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
| | - Sergio Pablo-García
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
| | - Ella M. Rajaonson
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
| | - Marta Skreta
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
| | - Naruki Yoshikawa
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
| | - Samantha Corapi
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
| | - Gun Deniz Akkoc
- Forschungszentrum
Jülich GmbH, Helmholtz Institute
for Renewable Energy Erlangen-Nürnberg, Cauerstr. 1, 91058 Erlangen, Germany
- Department
of Chemical and Biological Engineering, Friedrich-Alexander Universität Erlangen-Nürnberg, Egerlandstr. 3, 91058 Erlangen, Germany
| | - Felix Strieth-Kalthoff
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- School of
Mathematics and Natural Sciences, University
of Wuppertal, Gaußstraße
20, 42119 Wuppertal, Germany
| | - Martin Seifrid
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Department
of Materials Science and Engineering, North
Carolina State University, Raleigh, North Carolina 27695, United States of America
| | - Alán Aspuru-Guzik
- Department
of Chemistry, University of Toronto, 80 St. George St, Toronto, Ontario M5S 3H6, Canada
- Department
of Computer Science, University of Toronto, 40 St. George St, Toronto, Ontario M5S 2E4, Canada
- Vector Institute
for Artificial Intelligence, 661 University Ave Suite 710, Toronto, Ontario M5G 1M1, Canada
- Acceleration
Consortium, 80 St. George
St, Toronto, Ontario M5S 3H6, Canada
- Department
of Chemical Engineering & Applied Chemistry, University of Toronto, Toronto, Ontario M5S 3E5, Canada
- Department
of Materials Science & Engineering, University of Toronto, Toronto, Ontario M5S 3E4, Canada
- Lebovic
Fellow, Canadian Institute for Advanced
Research (CIFAR), 661
University Ave, Toronto, Ontario M5G 1M1, Canada
| |
Collapse
|
6
|
Su Y, Wang X, Ye Y, Xie Y, Xu Y, Jiang Y, Wang C. Automation and machine learning augmented by large language models in a catalysis study. Chem Sci 2024; 15:12200-12233. [PMID: 39118602 PMCID: PMC11304797 DOI: 10.1039/d3sc07012c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2023] [Accepted: 06/21/2024] [Indexed: 08/10/2024] Open
Abstract
Recent advancements in artificial intelligence and automation are transforming catalyst discovery and design from traditional trial-and-error manual mode into intelligent, high-throughput digital methodologies. This transformation is driven by four key components, including high-throughput information extraction, automated robotic experimentation, real-time feedback for iterative optimization, and interpretable machine learning for generating new knowledge. These innovations have given rise to the development of self-driving labs and significantly accelerated materials research. Over the past two years, the emergence of large language models (LLMs) has added a new dimension to this field, providing unprecedented flexibility in information integration, decision-making, and interacting with human researchers. This review explores how LLMs are reshaping catalyst design, heralding a revolutionary change in the fields.
Collapse
Affiliation(s)
- Yuming Su
- iChem, State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Chemistry and Chemical Engineering, Xiamen University Xiamen 361005 P. R. China
- Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM) Xiamen 361005 P. R. China
| | - Xue Wang
- iChem, State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Chemistry and Chemical Engineering, Xiamen University Xiamen 361005 P. R. China
| | - Yuanxiang Ye
- Institute of Artificial Intelligence, Xiamen University Xiamen 361005 P. R. China
| | - Yibo Xie
- Institute of Artificial Intelligence, Xiamen University Xiamen 361005 P. R. China
| | - Yujing Xu
- iChem, State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Chemistry and Chemical Engineering, Xiamen University Xiamen 361005 P. R. China
| | - Yibin Jiang
- Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM) Xiamen 361005 P. R. China
| | - Cheng Wang
- iChem, State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Chemistry and Chemical Engineering, Xiamen University Xiamen 361005 P. R. China
- Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM) Xiamen 361005 P. R. China
| |
Collapse
|
7
|
Voinarovska V, Kabeshov M, Dudenko D, Genheden S, Tetko IV. When Yield Prediction Does Not Yield Prediction: An Overview of the Current Challenges. J Chem Inf Model 2024; 64:42-56. [PMID: 38116926 PMCID: PMC10778086 DOI: 10.1021/acs.jcim.3c01524] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 11/29/2023] [Accepted: 11/30/2023] [Indexed: 12/21/2023]
Abstract
Machine Learning (ML) techniques face significant challenges when predicting advanced chemical properties, such as yield, feasibility of chemical synthesis, and optimal reaction conditions. These challenges stem from the high-dimensional nature of the prediction task and the myriad essential variables involved, ranging from reactants and reagents to catalysts, temperature, and purification processes. Successfully developing a reliable predictive model not only holds the potential for optimizing high-throughput experiments but can also elevate existing retrosynthetic predictive approaches and bolster a plethora of applications within the field. In this review, we systematically evaluate the efficacy of current ML methodologies in chemoinformatics, shedding light on their milestones and inherent limitations. Additionally, a detailed examination of a representative case study provides insights into the prevailing issues related to data availability and transferability in the discipline.
Collapse
Affiliation(s)
- Varvara Voinarovska
- Molecular
AI, Discovery Sciences R&D, AstraZeneca, 431 83 Gothenburg, Sweden
- TUM
Graduate School, Faculty of Chemistry, Technical
University of Munich, 85748 Garching, Germany
| | - Mikhail Kabeshov
- Molecular
AI, Discovery Sciences R&D, AstraZeneca, 431 83 Gothenburg, Sweden
| | - Dmytro Dudenko
- Enamine
Ltd., 78 Chervonotkatska str., 02094 Kyiv, Ukraine
| | - Samuel Genheden
- Molecular
AI, Discovery Sciences R&D, AstraZeneca, 431 83 Gothenburg, Sweden
| | - Igor V. Tetko
- Molecular
Targets and Therapeutics Center, Helmholtz Munich − Deutsches
Forschungszentrum für Gesundheit und Umwelt (GmbH), Institute of Structural Biology, 85764 Neuherberg, Germany
| |
Collapse
|
8
|
Raghavan P, Haas BC, Ruos ME, Schleinitz J, Doyle AG, Reisman SE, Sigman MS, Coley CW. Dataset Design for Building Models of Chemical Reactivity. ACS CENTRAL SCIENCE 2023; 9:2196-2204. [PMID: 38161380 PMCID: PMC10755851 DOI: 10.1021/acscentsci.3c01163] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Revised: 11/06/2023] [Accepted: 11/15/2023] [Indexed: 01/03/2024]
Abstract
Models can codify our understanding of chemical reactivity and serve a useful purpose in the development of new synthetic processes via, for example, evaluating hypothetical reaction conditions or in silico substrate tolerance. Perhaps the most determining factor is the composition of the training data and whether it is sufficient to train a model that can make accurate predictions over the full domain of interest. Here, we discuss the design of reaction datasets in ways that are conducive to data-driven modeling, emphasizing the idea that training set diversity and model generalizability rely on the choice of molecular or reaction representation. We additionally discuss the experimental constraints associated with generating common types of chemistry datasets and how these considerations should influence dataset design and model building.
Collapse
Affiliation(s)
- Priyanka Raghavan
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Brittany C. Haas
- Department
of Chemistry, University of Utah, Salt Lake City, Utah 84112, United States
| | - Madeline E. Ruos
- Department
of Chemistry & Biochemistry, University
of California, Los Angeles, Los Angeles, California 90095, United States
| | - Jules Schleinitz
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Abigail G. Doyle
- Department
of Chemistry & Biochemistry, University
of California, Los Angeles, Los Angeles, California 90095, United States
| | - Sarah E. Reisman
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Matthew S. Sigman
- Department
of Chemistry, University of Utah, Salt Lake City, Utah 84112, United States
| | - Connor W. Coley
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
9
|
Ghadi YY, Mazhar T, Shah SFA, Haq I, Ahmad W, Ouahada K, Hamam H. Integration of federated learning with IoT for smart cities applications, challenges, and solutions. PeerJ Comput Sci 2023; 9:e1657. [PMID: 38192447 PMCID: PMC10773731 DOI: 10.7717/peerj-cs.1657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 09/29/2023] [Indexed: 01/10/2024]
Abstract
In the past few years, privacy concerns have grown, making the financial models of businesses more vulnerable to attack. In many cases, it is hard to emphasize the importance of monitoring things in real-time with data from Internet of Things (IoT) devices. The people who make the IoT devices and those who use them face big problems when they try to use Artificial Intelligence (AI) techniques in real-world applications, where data must be collected and processed at a central location. Federated learning (FL) has made a decentralized, cooperative AI system that can be used by many IoT apps that use AI. It is possible because it can train AI on IoT devices that are spread out and do not need to share data. FL allows local models to be trained on local data and share their knowledge to improve a global model. Also, shared learning allows models from all over the world to be trained using data from all over the world. This article looks at the IoT in all of its forms, including "smart" businesses, "smart" cities, "smart" transportation, and "smart" healthcare. This study looks at the safety problems that the federated learning with IoT (FL-IoT) area has brought to market. This research is needed to explore because federated learning is a new technique, and a small amount of work is done on challenges faced during integration with IoT. This research also helps in the real world in such applications where encrypted data must be sent from one place to another. Researchers and graduate students are the audience of our article.
Collapse
Affiliation(s)
- Yazeed Yasin Ghadi
- Department of Computer Science and Software Engineering, Al Ain University, Abu Dhabi, UAE
| | - Tehseen Mazhar
- Department of Computer Science, Virtual University of Pakistan, Lahore, Punjab, Pakistan
| | - Syed Faisal Abbas Shah
- Department of Computer Science, Virtual University of Pakistan, Lahore, Punjab, Pakistan
| | - Inayatul Haq
- School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou, Henan, China
| | - Wasim Ahmad
- Department of Computer Science and Information Technology, University of Malakand, Chakdara, Dir, Pakistan
| | - Khmaies Ouahada
- School of Electrical Engineering, Department of Electrical and Electronic Engineering Science, University of Johannesburg, Johannesburg, South Africa
| | - Habib Hamam
- School of Electrical Engineering, Department of Electrical and Electronic Engineering Science, University of Johannesburg, Johannesburg, South Africa
- Commune d’Akanda, International Institute of Technology and Management, BP Libreville, Estuaire, Gabon
- Faculty of Engineering, University of Moncton, Moncton, New Brunswick, Canada
- College of Computer Science and Engineering, University of Ha’il, Ha’il, Saudi Arabia
- Production & Skills Development, Spectrum of Knowledge Production & Skills Development, Sfax, Tunisia
| |
Collapse
|
10
|
Luo Y, Liu Y, Peng J. Calibrated geometric deep learning improves kinase-drug binding predictions. NAT MACH INTELL 2023; 5:1390-1401. [PMID: 38962391 PMCID: PMC11221792 DOI: 10.1038/s42256-023-00751-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 09/29/2023] [Indexed: 07/05/2024]
Abstract
Protein kinases regulate various cellular functions and hold significant pharmacological promise in cancer and other diseases. Although kinase inhibitors are one of the largest groups of approved drugs, much of the human kinome remains unexplored but potentially druggable. Computational approaches, such as machine learning, offer efficient solutions for exploring kinase-compound interactions and uncovering novel binding activities. Despite the increasing availability of three-dimensional (3D) protein and compound structures, existing methods predominantly focus on exploiting local features from one-dimensional protein sequences and two-dimensional molecular graphs to predict binding affinities, overlooking the 3D nature of the binding process. Here we present KDBNet, a deep learning algorithm that incorporates 3D protein and molecule structure data to predict binding affinities. KDBNet uses graph neural networks to learn structure representations of protein binding pockets and drug molecules, capturing the geometric and spatial characteristics of binding activity. In addition, we introduce an algorithm to quantify and calibrate the uncertainties of KDBNet's predictions, enhancing its utility in model-guided discovery in chemical or protein space. Experiments demonstrated that KDBNet outperforms existing deep learning models in predicting kinase-drug binding affinities. The uncertainties estimated by KDBNet are informative and well-calibrated with respect to prediction errors. When integrated with a Bayesian optimization framework, KDBNet enables data-efficient active learning and accelerates the exploration and exploitation of diverse high-binding kinase-drug pairs.
Collapse
Affiliation(s)
- Yunan Luo
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA
- These authors contributed equally: Yunan Luo, Yang Liu
| | - Yang Liu
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL, USA
- These authors contributed equally: Yunan Luo, Yang Liu
| | - Jian Peng
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL, USA
| |
Collapse
|
11
|
Shim E, Tewari A, Cernak T, Zimmerman PM. Machine Learning Strategies for Reaction Development: Toward the Low-Data Limit. J Chem Inf Model 2023; 63:3659-3668. [PMID: 37312524 PMCID: PMC11163943 DOI: 10.1021/acs.jcim.3c00577] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Machine learning models are increasingly being utilized to predict outcomes of organic chemical reactions. A large amount of reaction data is used to train these models, which is in stark contrast to how expert chemists discover and develop new reactions by leveraging information from a small number of relevant transformations. Transfer learning and active learning are two strategies that can operate in low-data situations, which may help fill this gap and promote the use of machine learning for tackling real-world challenges in organic synthesis. This Perspective introduces active and transfer learning and connects these to potential opportunities and directions for further research, especially in the area of prospective development of chemical transformations.
Collapse
Affiliation(s)
- Eunjae Shim
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Ambuj Tewari
- Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109, United States
- Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Tim Cernak
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Paul M Zimmerman
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| |
Collapse
|
12
|
Shields MD, Gurley K, Catarelli R, Chauhan M, Ojeda-Tuz M, Masters FJ. Active learning applied to automated physical systems increases the rate of discovery. Sci Rep 2023; 13:8402. [PMID: 37225752 DOI: 10.1038/s41598-023-35257-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Accepted: 05/15/2023] [Indexed: 05/26/2023] Open
Abstract
Active machine learning is widely used in computational studies where repeated numerical simulations can be conducted on high performance computers without human intervention. But translation of these active learning methods to physical systems has proven more difficult and the accelerated pace of discoveries aided by these methods remains as yet unrealized. Through the presentation of a general active learning framework and its application to large-scale boundary layer wind tunnel experiments, we demonstrate that the active learning framework used so successfully in computational studies is directly applicable to the investigation of physical experimental systems and the corresponding improvements in the rate of discovery can be transformative. We specifically show that, for our wind tunnel experiments, we are able to achieve in approximately 300 experiments a learning objective that would be impossible using traditional methods.
Collapse
Affiliation(s)
- Michael D Shields
- Department of Civil and Systems Engineering, Johns Hopkins University, Baltimore, MD, 21212, USA.
| | - Kurtis Gurley
- Department of Civil and Coastal Engineering, University of Florida, Gainesville, FL, 32611, USA
| | - Ryan Catarelli
- Department of Civil and Coastal Engineering, University of Florida, Gainesville, FL, 32611, USA
| | - Mohit Chauhan
- Department of Civil and Systems Engineering, Johns Hopkins University, Baltimore, MD, 21212, USA
| | - Mariel Ojeda-Tuz
- Department of Civil and Coastal Engineering, University of Florida, Gainesville, FL, 32611, USA
| | - Forrest J Masters
- Department of Civil and Coastal Engineering, University of Florida, Gainesville, FL, 32611, USA
| |
Collapse
|
13
|
Taylor CJ, Pomberger A, Felton KC, Grainger R, Barecka M, Chamberlain TW, Bourne RA, Johnson CN, Lapkin AA. A Brief Introduction to Chemical Reaction Optimization. Chem Rev 2023; 123:3089-3126. [PMID: 36820880 PMCID: PMC10037254 DOI: 10.1021/acs.chemrev.2c00798] [Citation(s) in RCA: 62] [Impact Index Per Article: 31.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Indexed: 02/24/2023]
Abstract
From the start of a synthetic chemist's training, experiments are conducted based on recipes from textbooks and manuscripts that achieve clean reaction outcomes, allowing the scientist to develop practical skills and some chemical intuition. This procedure is often kept long into a researcher's career, as new recipes are developed based on similar reaction protocols, and intuition-guided deviations are conducted through learning from failed experiments. However, when attempting to understand chemical systems of interest, it has been shown that model-based, algorithm-based, and miniaturized high-throughput techniques outperform human chemical intuition and achieve reaction optimization in a much more time- and material-efficient manner; this is covered in detail in this paper. As many synthetic chemists are not exposed to these techniques in undergraduate teaching, this leads to a disproportionate number of scientists that wish to optimize their reactions but are unable to use these methodologies or are simply unaware of their existence. This review highlights the basics, and the cutting-edge, of modern chemical reaction optimization as well as its relation to process scale-up and can thereby serve as a reference for inspired scientists for each of these techniques, detailing several of their respective applications.
Collapse
Affiliation(s)
- Connor J. Taylor
- Astex
Pharmaceuticals, 436 Cambridge Science Park, Milton Road, Cambridge CB4 0QA, U.K.
- Innovation
Centre in Digital Molecular Technologies, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K.
| | - Alexander Pomberger
- Innovation
Centre in Digital Molecular Technologies, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K.
| | - Kobi C. Felton
- Department
of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K.
| | - Rachel Grainger
- Astex
Pharmaceuticals, 436 Cambridge Science Park, Milton Road, Cambridge CB4 0QA, U.K.
| | - Magda Barecka
- Chemical
Engineering Department, Northeastern University, 360 Huntington Avenue, Boston, Massachusetts 02115, United States
- Chemistry
and Chemical Biology Department, Northeastern
University, 360 Huntington Avenue, Boston, Massachusetts 02115, United States
- Cambridge
Centre for Advanced Research and Education in Singapore, 1 Create Way, 138602 Singapore
| | - Thomas W. Chamberlain
- Institute
of Process Research and Development, School of Chemistry and School
of Chemical and Process Engineering, University
of Leeds, Leeds LS2 9JT, U.K.
| | - Richard A. Bourne
- Institute
of Process Research and Development, School of Chemistry and School
of Chemical and Process Engineering, University
of Leeds, Leeds LS2 9JT, U.K.
| | | | - Alexei A. Lapkin
- Innovation
Centre in Digital Molecular Technologies, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K.
| |
Collapse
|
14
|
Yang CI, Li YP. Explainable uncertainty quantifications for deep learning-based molecular property prediction. J Cheminform 2023; 15:13. [PMID: 36737786 PMCID: PMC9898940 DOI: 10.1186/s13321-023-00682-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2022] [Accepted: 01/15/2023] [Indexed: 02/05/2023] Open
Abstract
Quantifying uncertainty in machine learning is important in new research areas with scarce high-quality data. In this work, we develop an explainable uncertainty quantification method for deep learning-based molecular property prediction. This method can capture aleatoric and epistemic uncertainties separately and attribute the uncertainties to atoms present in the molecule. The atom-based uncertainty method provides an extra layer of chemical insight to the estimated uncertainties, i.e., one can analyze individual atomic uncertainty values to diagnose the chemical component that introduces uncertainty to the prediction. Our experiments suggest that atomic uncertainty can detect unseen chemical structures and identify chemical species whose data are potentially associated with significant noise. Furthermore, we propose a post-hoc calibration method to refine the uncertainty quantified by ensemble models for better confidence interval estimates. This work improves uncertainty calibration and provides a framework for assessing whether and why a prediction should be considered unreliable.
Collapse
Affiliation(s)
- Chu-I Yang
- grid.19188.390000 0004 0546 0241Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, 10617 Taiwan
| | - Yi-Pei Li
- grid.19188.390000 0004 0546 0241Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, 10617 Taiwan ,grid.28665.3f0000 0001 2287 1366Taiwan International Graduate Program (TIGP), Academia Sinica, No. 128, Sec. 2, Academia Road, Taipei, 11529 Taiwan
| |
Collapse
|
15
|
Tu Z, Stuyver T, Coley CW. Predictive chemistry: machine learning for reaction deployment, reaction development, and reaction discovery. Chem Sci 2023; 14:226-244. [PMID: 36743887 PMCID: PMC9811563 DOI: 10.1039/d2sc05089g] [Citation(s) in RCA: 29] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Accepted: 11/25/2022] [Indexed: 11/29/2022] Open
Abstract
The field of predictive chemistry relates to the development of models able to describe how molecules interact and react. It encompasses the long-standing task of computer-aided retrosynthesis, but is far more reaching and ambitious in its goals. In this review, we summarize several areas where predictive chemistry models hold the potential to accelerate the deployment, development, and discovery of organic reactions and advance synthetic chemistry.
Collapse
Affiliation(s)
- Zhengkai Tu
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge MA 02139 USA
| | - Thijs Stuyver
- Department of Chemical Engineering, Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge MA 02139 USA
| | - Connor W Coley
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge MA 02139 USA
- Department of Chemical Engineering, Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge MA 02139 USA
| |
Collapse
|
16
|
Rakhimbekova A, Lopukhov A, Klyachko N, Kabanov A, Madzhidov TI, Tropsha A. Efficient design of peptide-binding polymers using active learning approaches. J Control Release 2023; 353:903-914. [PMID: 36402234 DOI: 10.1016/j.jconrel.2022.11.023] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2022] [Revised: 10/21/2022] [Accepted: 11/13/2022] [Indexed: 12/23/2022]
Abstract
Active learning (AL) has become a subject of active recent research both in industry and academia as an efficient approach for rapid design and discovery of novel chemicals, materials, and polymers. Herein, we have assessed the applicability of AL for the discovery of polymeric micelle formulations for poorly soluble drugs. We were motivated by the key advantages of this approach making it a desirable strategy for rational design of drug delivery systems due toto its ability to (i) employ relatively small datasets for model development, (ii) iterate between model development and model assessment using small external datasets that can be either generated in focused experimental studies or formed from subsets of the initial training data, and (iii) progressively evolve models towards increasingly more reliable predictions and the identification of novel chemicals with the desired properties. In this study, we compared various AL protocols for their effectiveness in finding biologically active molecules using synthetic datasets. We have investigated the dependency of AL performance on the size of the initial training set, the relative complexity of the task, and the choice of the initial training dataset. We found that AL techniques as applied to regression modeling offer no benefits over random search, while AL used for classification tasks performs better than models built for randomly selected training sets but still quite far from perfect. Using the best performing AL protocol,. Finally, the best performing AL approach was employed to discover and experimentally validate novel binding polymers for a case study of asialoglycoprotein receptor (ASGPR).
Collapse
Affiliation(s)
- Assima Rakhimbekova
- A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kazan 420008, Russia
| | - Anton Lopukhov
- Laboratory of Chemical Design of Bionanomaterials, Faculty of Chemistry, M.V. Lomonosov Moscow State University, Moscow, Russia
| | - Natalia Klyachko
- Laboratory of Chemical Design of Bionanomaterials, Faculty of Chemistry, M.V. Lomonosov Moscow State University, Moscow, Russia
| | - Alexander Kabanov
- Laboratory of Chemical Design of Bionanomaterials, Faculty of Chemistry, M.V. Lomonosov Moscow State University, Moscow, Russia; Center for Nanotechnology in Drug Delivery, Division of Pharmacoengineering and Molecular Pharmaceutics, Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, NC, USA
| | - Timur I Madzhidov
- A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kazan 420008, Russia
| | - Alexander Tropsha
- Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599, USA.
| |
Collapse
|
17
|
Viet Johansson S, Gummesson Svensson H, Bjerrum E, Schliep A, Haghir Chehreghani M, Tyrchan C, Engkvist O. Using Active Learning to Develop Machine Learning Models for Reaction Yield Prediction. Mol Inform 2022; 41:e2200043. [PMID: 35732584 DOI: 10.1002/minf.202200043] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Accepted: 06/22/2022] [Indexed: 01/05/2023]
Abstract
Computer aided synthesis planning, suggesting synthetic routes for molecules of interest, is a rapidly growing field. The machine learning methods used are often dependent on access to large datasets for training, but finite experimental budgets limit how much data can be obtained from experiments. This suggests the use of schemes for data collection such as active learning, which identifies the data points of highest impact for model accuracy, and which has been used in recent studies with success. However, little has been done to explore the robustness of the methods predicting reaction yield when used together with active learning to reduce the amount of experimental data needed for training. This study aims to investigate the influence of machine learning algorithms and the number of initial data points on reaction yield prediction for two public high-throughput experimentation datasets. Our results show that active learning based on output margin reached a pre-defined AUROC faster than random sampling on both datasets. Analysis of feature importance of the trained machine learning models suggests active learning had a larger influence on the model accuracy when only a few features were important for the model prediction.
Collapse
Affiliation(s)
- Simon Viet Johansson
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, SE-431 83, Mölndal, Sweden.,Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, SE-412 96, Göteborg, Sweden
| | - Hampus Gummesson Svensson
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, SE-431 83, Mölndal, Sweden.,Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, SE-412 96, Göteborg, Sweden
| | - Esben Bjerrum
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, SE-431 83, Mölndal, Sweden
| | - Alexander Schliep
- Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, SE-412 96, Göteborg, Sweden
| | - Morteza Haghir Chehreghani
- Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, SE-412 96, Göteborg, Sweden
| | - Christian Tyrchan
- Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, SE-431 83, Mölndal, Sweden
| | - Ola Engkvist
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, SE-431 83, Mölndal, Sweden.,Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, SE-412 96, Göteborg, Sweden
| |
Collapse
|
18
|
Yarish D, Garkot S, Grygorenko OO, Radchenko DS, Moroz YS, Gurbych O. Advancing molecular graphs with descriptors for the prediction of chemical reaction yields. J Comput Chem 2022; 44:76-92. [PMID: 36264601 DOI: 10.1002/jcc.27016] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2022] [Revised: 08/31/2022] [Accepted: 09/05/2022] [Indexed: 11/08/2022]
Abstract
Chemical yield is the percentage of the reactants converted to the desired products. Chemists use predictive algorithms to select high-yielding reactions and score synthesis routes, saving time and reagents. This study suggests a novel graph neural network architecture for chemical yield prediction. The network combines structural information about participants of the transformation as well as molecular and reaction-level descriptors. It works with incomplete chemical reactions and generates reactants-product atom mapping. We show that the network benefits from advanced information by comparing it with several machine learning models and molecular representations. Models included logistic regression, support vector machine, CatBoost, and Bidirectional Encoder Representations from Transformers. Molecular representations included extended-connectivity fingerprints, Morgan fingerprints, SMILESVec embeddings, and textual. Classification and regression objectives were assessed for each model and feature set. The goal of each classification model was to separate zero- and non-zero-yielding reactions. The models were trained and evaluated on a proprietary dataset of 10 reaction types. Also, the models were benchmarked on two public single reaction type datasets. The study was supplemented with analysis of data, results, and errors, as well as the impact of steric factors, side reactions, isolation, and purification efficiency. The supplementary code is available at https://github.com/SoftServeInc/yield-paper.
Collapse
Affiliation(s)
| | - Sofiya Garkot
- SoftServe, Inc., Lviv, Ukraine.,Ukrainian Catholic University, Lviv, Ukraine
| | - Oleksandr O Grygorenko
- Enamine Ltd., Kyiv, Ukraine.,Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
| | - Dmytro S Radchenko
- Enamine Ltd., Kyiv, Ukraine.,Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
| | - Yurii S Moroz
- Taras Shevchenko National University of Kyiv, Kyiv, Ukraine.,Chemspace LLC, Kyiv, Ukraine
| | - Oleksandr Gurbych
- Lviv Polytechnic National University, Lviv, Ukraine.,Blackthorn AI, Ltd., London, UK
| |
Collapse
|
19
|
Gensch T, Smith SR, Colacot TJ, Timsina YN, Xu G, Glasspoole BW, Sigman MS. Design and Application of a Screening Set for Monophosphine Ligands in Cross-Coupling. ACS Catal 2022. [DOI: 10.1021/acscatal.2c01970] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Affiliation(s)
- Tobias Gensch
- Department of Chemistry, TU Berlin, Straße des 17. Juni 135, Sekr. C2, 10623 Berlin, Germany
| | - Sleight R. Smith
- Department of Chemistry, University of Utah, 315 South 1400 East, Salt Lake City, Utah 84112, United States
| | - Thomas J. Colacot
- MilliporeSigma, 6000 N. Teutonia Ave, Milwaukee, Wisconsin 53209, United States
| | - Yam N. Timsina
- MilliporeSigma, 6000 N. Teutonia Ave, Milwaukee, Wisconsin 53209, United States
| | - Guolin Xu
- MilliporeSigma, 6000 N. Teutonia Ave, Milwaukee, Wisconsin 53209, United States
| | - Ben W. Glasspoole
- MilliporeSigma, 6000 N. Teutonia Ave, Milwaukee, Wisconsin 53209, United States
| | - Matthew S. Sigman
- Department of Chemistry, University of Utah, 315 South 1400 East, Salt Lake City, Utah 84112, United States
| |
Collapse
|
20
|
Shim E, Kammeraad JA, Xu Z, Tewari A, Cernak T, Zimmerman PM. Predicting reaction conditions from limited data through active transfer learning. Chem Sci 2022; 13:6655-6668. [PMID: 35756521 PMCID: PMC9172577 DOI: 10.1039/d1sc06932b] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Accepted: 05/10/2022] [Indexed: 12/30/2022] Open
Abstract
Transfer and active learning have the potential to accelerate the development of new chemical reactions, using prior data and new experiments to inform models that adapt to the target area of interest. This article shows how specifically tuned machine learning models, based on random forest classifiers, can expand the applicability of Pd-catalyzed cross-coupling reactions to types of nucleophiles unknown to the model. First, model transfer is shown to be effective when reaction mechanisms and substrates are closely related, even when models are trained on relatively small numbers of data points. Then, a model simplification scheme is tested and found to provide comparative predictivity on reactions of new nucleophiles that include unseen reagent combinations. Lastly, for a challenging target where model transfer only provides a modest benefit over random selection, an active transfer learning strategy is introduced to improve model predictions. Simple models, composed of a small number of decision trees with limited depths, are crucial for securing generalizability, interpretability, and performance of active transfer learning.
Collapse
Affiliation(s)
- Eunjae Shim
- Department of Chemistry, University of Michigan Ann Arbor MI USA
| | - Joshua A Kammeraad
- Department of Chemistry, University of Michigan Ann Arbor MI USA
- Department of Statistics, University of Michigan Ann Arbor MI USA
| | - Ziping Xu
- Department of Statistics, University of Michigan Ann Arbor MI USA
| | - Ambuj Tewari
- Department of Statistics, University of Michigan Ann Arbor MI USA
- Department of Electrical Engineering and Computer Science, University of Michigan Ann Arbor MI USA
| | - Tim Cernak
- Department of Chemistry, University of Michigan Ann Arbor MI USA
- Department of Medicinal Chemistry, University of Michigan Ann Arbor MI USA
| | - Paul M Zimmerman
- Department of Chemistry, University of Michigan Ann Arbor MI USA
| |
Collapse
|
21
|
Probst D, Schwaller P, Reymond JL. Reaction classification and yield prediction using the differential reaction fingerprint DRFP. DIGITAL DISCOVERY 2022; 1:91-97. [PMID: 35515081 PMCID: PMC8996827 DOI: 10.1039/d1dd00006c] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Accepted: 01/12/2022] [Indexed: 01/19/2023]
Abstract
Predicting the nature and outcome of reactions using computational methods is a crucial tool to accelerate chemical research. The recent application of deep learning-based learned fingerprints to reaction classification and reaction yield prediction has shown an impressive increase in performance compared to previous methods such as DFT- and structure-based fingerprints. However, learned fingerprints require large training data sets, are inherently biased, and are based on complex deep learning architectures. Here we present the differential reaction fingerprint DRFP. The DRFP algorithm takes a reaction SMILES as an input and creates a binary fingerprint based on the symmetric difference of two sets containing the circular molecular n-grams generated from the molecules listed left and right from the reaction arrow, respectively, without the need for distinguishing between reactants and reagents. We show that DRFP performs better than DFT-based fingerprints in reaction yield prediction and other structure-based fingerprints in reaction classification, reaching the performance of state-of-the-art learned fingerprints in both tasks while being data-independent. Differential Reaction Fingerprint DRFP is a chemical reaction fingerprint enabling simple machine learning models running on standard hardware to reach DFT- and deep learning-based accuracies in reaction yield prediction and reaction classification.![]()
Collapse
Affiliation(s)
- Daniel Probst
- Department of Chemistry and Biochemistry, University of Bern Freiestrasse 3 3012 Bern Switzerland
| | | | - Jean-Louis Reymond
- Department of Chemistry and Biochemistry, University of Bern Freiestrasse 3 3012 Bern Switzerland
| |
Collapse
|
22
|
Lu J, Zhang Y. Unified Deep Learning Model for Multitask Reaction Predictions with Explanation. J Chem Inf Model 2022; 62:1376-1387. [PMID: 35266390 PMCID: PMC8960360 DOI: 10.1021/acs.jcim.1c01467] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
There is significant interest and importance to develop robust machine learning models to assist organic chemistry synthesis. Typically, task-specific machine learning models for distinct reaction prediction tasks have been developed. In this work, we develop a unified deep learning model, T5Chem, for a variety of chemical reaction predictions tasks by adapting the "Text-to-Text Transfer Transformer" (T5) framework in natural language processing (NLP). On the basis of self-supervised pretraining with PubChem molecules, the T5Chem model can achieve state-of-the-art performances for four distinct types of task-specific reaction prediction tasks using four different open-source data sets, including reaction type classification on USPTO_TPL, forward reaction prediction on USPTO_MIT, single-step retrosynthesis on USPTO_50k, and reaction yield prediction on high-throughput C-N coupling reactions. Meanwhile, we introduced a new unified multitask reaction prediction data set USPTO_500_MT, which can be used to train and test five different types of reaction tasks, including the above four as well as a new reagent suggestion task. Our results showed that models trained with multiple tasks are more robust and can benefit from mutual learning on related tasks. Furthermore, we demonstrated the use of SHAP (SHapley Additive exPlanations) to explain T5Chem predictions at the functional group level, which provides a way to demystify sequence-based deep learning models in chemistry. T5Chem is accessible through https://yzhang.hpc.nyu.edu/T5Chem.
Collapse
Affiliation(s)
- Jieyu Lu
- Department of Chemistry, New York University, New York, New York 10003, United States
| | - Yingkai Zhang
- Department of Chemistry, New York University, New York, New York 10003, United States
- NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| |
Collapse
|
23
|
Soliman SS, El-Haddad AE, Sedik GA, Elghobashy MR, Zaazaa HE, Saad AS. Experimentally designed chemometric models for the assay of toxic adulterants in turmeric powder. RSC Adv 2022; 12:9087-9094. [PMID: 35424884 PMCID: PMC8985183 DOI: 10.1039/d2ra00697a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Accepted: 03/07/2022] [Indexed: 11/23/2022] Open
Abstract
Turmeric is an indispensable culinary spice in different cultures and a principal component in traditional remedies. Toxic metanil yellow (MY), acid orange 7 (AO) and lead chromate (LCM) are deliberately added to adulterate turmeric powder. This work compares the ability of multivariate chemometric models with those of artificial intelligent networks to enhance the selectivity of spectral data for the rapid assay of these three adulterants in turmeric powder. Using a custom experimental design, we provide a data-driven optimization for the sensitive parameters of the partial least squares model (PLS), artificial neural network (ANN) and genetic algorithm (GA). The optimized models are validated using sets of genuine turmeric samples from five different geographical regions spiked with standard adulterant concentrations. The optimized GA-PLS and GA-ANN models reduce the root mean square error of prediction by 18.4%, 31.1% and 55.3% and 25.0%, 69.9% and 88.4% for MY, AO and LCM, respectively.
Collapse
Affiliation(s)
- Shymaa S Soliman
- Analytical Chemistry Department, Faculty of Pharmacy, October 6 University PO Box 12858 6 October City Giza Egypt
| | - Alaadin E El-Haddad
- Pharmacognosy Department, Faculty of Pharmacy, October 6 University PO Box 12858 6 October City Giza Egypt
| | - Ghada A Sedik
- Analytical Chemistry Department, Faculty of Pharmacy, Cairo University El-Kasr El-Aini Street Cairo 11562 Egypt
| | - Mohamed R Elghobashy
- Analytical Chemistry Department, Faculty of Pharmacy, Cairo University El-Kasr El-Aini Street Cairo 11562 Egypt
- Analytical Chemistry Department, Faculty of Pharmacy, October 6 University PO Box 12858 6 October City Giza Egypt
| | - Hala E Zaazaa
- Analytical Chemistry Department, Faculty of Pharmacy, Cairo University El-Kasr El-Aini Street Cairo 11562 Egypt
| | - Ahmed S Saad
- Analytical Chemistry Department, Faculty of Pharmacy, Cairo University El-Kasr El-Aini Street Cairo 11562 Egypt
- Medicinal Chemistry Department, PharmD Program, Egypt-Japan University of Science and Technology (E-JUST) New Borg El-Arab City Alexandria 21934 Egypt
| |
Collapse
|
24
|
Sharma KG, Kaisare NS, Goyal H. A recurrent neural network model for biomass gasification chemistry. REACT CHEM ENG 2022. [DOI: 10.1039/d1re00409c] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
A recurrent neural network model is built to predict the temporal evolution of chemical species during biomass gasification.
Collapse
Affiliation(s)
- Krishna Gopal Sharma
- Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, Tamil Nadu 600036, India
| | - Niket S. Kaisare
- Department of Chemical Engineering, Indian Institute of Technology Madras, Chennai, Tamil Nadu 600036, India
| | - Himanshu Goyal
- Department of Chemical Engineering, Indian Institute of Technology Madras, Chennai, Tamil Nadu 600036, India
| |
Collapse
|
25
|
Pomberger A, Pedrina McCarthy AA, Khan A, Sung S, Taylor CJ, Gaunt MJ, Colwell L, Walz D, Lapkin AA. The effect of chemical representation on active machine learning towards closed-loop optimization. REACT CHEM ENG 2022. [DOI: 10.1039/d2re00008c] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Multivariate chemical reaction optimization involving catalytic systems is a non-trivial task due to the high number of tuneable parameters and discrete choices.
Collapse
Affiliation(s)
- A. Pomberger
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge CB3 0AS, UK
| | | | - A. Khan
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge CB3 0AS, UK
| | - S. Sung
- Cambridge Centre for Advanced Research and Education in Singapore Ltd., CREATE Tower 05-05, 138602 Singapore
| | - C. J. Taylor
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge CB3 0AS, UK
- Astex Pharmaceuticals, 436 Cambridge Science Park, Milton, Cambridge CB4 0QA, UK
| | - M. J. Gaunt
- Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge CB2 1EW, UK
| | - L. Colwell
- Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge CB2 1EW, UK
| | - D. Walz
- BASF SE Data Science for Materials, Carl-Bosch-Strasse 38, 67056 Ludwigshafen am Rhein, Germany
| | - A. A. Lapkin
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge CB3 0AS, UK
- Cambridge Centre for Advanced Research and Education in Singapore Ltd., CREATE Tower 05-05, 138602 Singapore
| |
Collapse
|
26
|
Nandiwale KY, Hart T, Zahrt AF, Nambiar AMK, Mahesh PT, Mo Y, Nieves-Remacha MJ, Johnson MD, García-Losada P, Mateos C, Rincón JA, Jensen KF. Continuous stirred-tank reactor cascade platform for self-optimization of reactions involving solids. REACT CHEM ENG 2022. [DOI: 10.1039/d2re00054g] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Research-scale fully automated flow platform for reaction self-optimization with solids handling facilitates identification of optimal conditions for continuous manufacturing of pharmaceuticals while reducing amounts of raw materials consumed.
Collapse
Affiliation(s)
- Kakasaheb Y. Nandiwale
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, USA
| | - Travis Hart
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, USA
| | - Andrew F. Zahrt
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, USA
| | - Anirudh M. K. Nambiar
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, USA
| | - Prajwal T. Mahesh
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, USA
| | - Yiming Mo
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, USA
| | | | - Martin D. Johnson
- Small Molecule Design and Development, Eli Lilly and Company, Indianapolis, Indiana 46285, USA
| | - Pablo García-Losada
- Centro de Investigación Lilly S.A., Avda. de la Industria 30, Alcobendas-Madrid 28108, Spain
| | - Carlos Mateos
- Centro de Investigación Lilly S.A., Avda. de la Industria 30, Alcobendas-Madrid 28108, Spain
| | - Juan A. Rincón
- Centro de Investigación Lilly S.A., Avda. de la Industria 30, Alcobendas-Madrid 28108, Spain
| | - Klavs F. Jensen
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, USA
| |
Collapse
|
27
|
Chakkingal A, Janssens P, Poissonnier J, Barrios AJ, Virginie M, Khodakov AY, Thybaut JW. Machine learning based interpretation of microkinetic data: a Fischer–Tropsch synthesis case study. REACT CHEM ENG 2022. [DOI: 10.1039/d1re00351h] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
A systematic approach for analysing kinetic data and identifying hidden trends using interpretation techniques in data science with the ANN.
Collapse
Affiliation(s)
- Anoop Chakkingal
- Laboratory for Chemical Technology (LCT), Department of Materials, Textiles and Chemical Engineering, Ghent University, Technologiepark 125, 9052 Ghent, Belgium
- CNRS, Centrale Lille, Univ. Lille, ENSCL, Univ. Artois, UMR 8181 – UCCS – Unité de Catalyse et Chimie du Solide, F-59000 Lille, France
| | - Pieter Janssens
- Laboratory for Chemical Technology (LCT), Department of Materials, Textiles and Chemical Engineering, Ghent University, Technologiepark 125, 9052 Ghent, Belgium
| | - Jeroen Poissonnier
- Laboratory for Chemical Technology (LCT), Department of Materials, Textiles and Chemical Engineering, Ghent University, Technologiepark 125, 9052 Ghent, Belgium
| | - Alan J. Barrios
- Laboratory for Chemical Technology (LCT), Department of Materials, Textiles and Chemical Engineering, Ghent University, Technologiepark 125, 9052 Ghent, Belgium
- CNRS, Centrale Lille, Univ. Lille, ENSCL, Univ. Artois, UMR 8181 – UCCS – Unité de Catalyse et Chimie du Solide, F-59000 Lille, France
| | - Mirella Virginie
- CNRS, Centrale Lille, Univ. Lille, ENSCL, Univ. Artois, UMR 8181 – UCCS – Unité de Catalyse et Chimie du Solide, F-59000 Lille, France
| | - Andrei Y. Khodakov
- CNRS, Centrale Lille, Univ. Lille, ENSCL, Univ. Artois, UMR 8181 – UCCS – Unité de Catalyse et Chimie du Solide, F-59000 Lille, France
| | - Joris W. Thybaut
- Laboratory for Chemical Technology (LCT), Department of Materials, Textiles and Chemical Engineering, Ghent University, Technologiepark 125, 9052 Ghent, Belgium
| |
Collapse
|
28
|
Busk J, Bjørn Jørgensen P, Bhowmik A, Schmidt MN, Winther O, Vegge T. Calibrated uncertainty for molecular property prediction using ensembles of message passing neural networks. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2021. [DOI: 10.1088/2632-2153/ac3eb3] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Abstract
Data-driven methods based on machine learning have the potential to accelerate computational analysis of atomic structures. In this context, reliable uncertainty estimates are important for assessing confidence in predictions and enabling decision making. However, machine learning models can produce badly calibrated uncertainty estimates and it is therefore crucial to detect and handle uncertainty carefully. In this work we extend a message passing neural network designed specifically for predicting properties of molecules and materials with a calibrated probabilistic predictive distribution. The method presented in this paper differs from previous work by considering both aleatoric and epistemic uncertainty in a unified framework, and by recalibrating the predictive distribution on unseen data. Through computer experiments, we show that our approach results in accurate models for predicting molecular formation energies with well calibrated uncertainty in and out of the training data distribution on two public molecular benchmark datasets, QM9 and PC9. The proposed method provides a general framework for training and evaluating neural network ensemble models that are able to produce accurate predictions of properties of molecules with well calibrated uncertainty estimates.
Collapse
|
29
|
Miljković F, Rodríguez-Pérez R, Bajorath J. Impact of Artificial Intelligence on Compound Discovery, Design, and Synthesis. ACS OMEGA 2021; 6:33293-33299. [PMID: 34926881 PMCID: PMC8674916 DOI: 10.1021/acsomega.1c05512] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/04/2021] [Accepted: 11/18/2021] [Indexed: 05/17/2023]
Abstract
As in other areas, artificial intelligence (AI) is heavily promoted in different scientific fields, including chemistry. Although chemistry traditionally tends to be a conservative field and slower than others to adapt new concepts, AI is increasingly being investigated across chemical disciplines. In medicinal chemistry, supported by computer-aided drug design and cheminformatics, computational methods have long been employed to aid in the search for and optimization of active compounds. We are currently witnessing a multitude of AI-related publications in the medicinal-chemistry-relevant literature and anticipate that the numbers will further increase. Often, advances through AI promoted in such reports are difficult to reconcile or remain questionable, which hampers the acceptance of computational work in interdisciplinary environments. Herein we attempt to highlight selected investigations in which AI has shown promise to impact medicinal chemistry in areas such as compound design and synthesis.
Collapse
Affiliation(s)
- Filip Miljković
- Department
of Life Science Informatics and Data Science, B-IT, LIMES Program
Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany
- Data
Science and AI, Imaging and Data Analytics, Clinical Pharmacology
& Safety Sciences, R&D, AstraZeneca, SE-431 83 Gothenburg, Sweden
| | - Raquel Rodríguez-Pérez
- Department
of Life Science Informatics and Data Science, B-IT, LIMES Program
Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany
- Novartis
Institutes for Biomedical Research, Novartis
Campus, CH-4002 Basel, Switzerland
| | - Jürgen Bajorath
- Department
of Life Science Informatics and Data Science, B-IT, LIMES Program
Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany
- Phone: 49-228-7369-100.
| |
Collapse
|
30
|
Chu HY, Wong ASL. Facilitating Machine Learning-Guided Protein Engineering with Smart Library Design and Massively Parallel Assays. ADVANCED GENETICS (HOBOKEN, N.J.) 2021; 2:2100038. [PMID: 36619853 PMCID: PMC9744531 DOI: 10.1002/ggn2.202100038] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Revised: 11/08/2021] [Indexed: 01/11/2023]
Abstract
Protein design plays an important role in recent medical advances from antibody therapy to vaccine design. Typically, exhaustive mutational screens or directed evolution experiments are used for the identification of the best design or for improvements to the wild-type variant. Even with a high-throughput screening on pooled libraries and Next-Generation Sequencing to boost the scale of read-outs, surveying all the variants with combinatorial mutations for their empirical fitness scores is still of magnitudes beyond the capacity of existing experimental settings. To tackle this challenge, in-silico approaches using machine learning to predict the fitness of novel variants based on a subset of empirical measurements are now employed. These machine learning models turn out to be useful in many cases, with the premise that the experimentally determined fitness scores and the amino-acid descriptors of the models are informative. The machine learning models can guide the search for the highest fitness variants, resolve complex epistatic relationships, and highlight bio-physical rules for protein folding. Using machine learning-guided approaches, researchers can build more focused libraries, thus relieving themselves from labor-intensive screens and fast-tracking the optimization process. Here, we describe the current advances in massive-scale variant screens, and how machine learning and mutagenesis strategies can be integrated to accelerate protein engineering. More specifically, we examine strategies to make screens more economical, informative, and effective in discovery of useful variants.
Collapse
Affiliation(s)
- Hoi Yee Chu
- Laboratory of Combinatorial Genetics and Synthetic BiologySchool of Biomedical SciencesThe University of Hong KongHong Kong852China
| | - Alan S. L. Wong
- Laboratory of Combinatorial Genetics and Synthetic BiologySchool of Biomedical SciencesThe University of Hong KongHong Kong852China
- Electrical and Electronic EngineeringThe University of Hong KongPokfulamHong Kong852China
| |
Collapse
|
31
|
Weber JM, Guo Z, Zhang C, Schweidtmann AM, Lapkin AA. Chemical data intelligence for sustainable chemistry. Chem Soc Rev 2021; 50:12013-12036. [PMID: 34520507 DOI: 10.1039/d1cs00477h] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
This study highlights new opportunities for optimal reaction route selection from large chemical databases brought about by the rapid digitalisation of chemical data. The chemical industry requires a transformation towards more sustainable practices, eliminating its dependencies on fossil fuels and limiting its impact on the environment. However, identifying more sustainable process alternatives is, at present, a cumbersome, manual, iterative process, based on chemical intuition and modelling. We give a perspective on methods for automated discovery and assessment of competitive sustainable reaction routes based on renewable or waste feedstocks. Three key areas of transition are outlined and reviewed based on their state-of-the-art as well as bottlenecks: (i) data, (ii) evaluation metrics, and (iii) decision-making. We elucidate their synergies and interfaces since only together these areas can bring about the most benefit. The field of chemical data intelligence offers the opportunity to identify the inherently more sustainable reaction pathways and to identify opportunities for a circular chemical economy. Our review shows that at present the field of data brings about most bottlenecks, such as data completion and data linkage, but also offers the principal opportunity for advancement.
Collapse
Affiliation(s)
- Jana M Weber
- Department of Chemical Engineering and Biotechnology, University of Cambridge, West Cambridge Site, Philippa Fawcett Drive, Cambridge CB3 0AS, UK. .,Chemical Data Intelligence (CDI) Pte Ltd, Robinson Road, #02-00, 068898, Singapore
| | - Zhen Guo
- Chemical Data Intelligence (CDI) Pte Ltd, Robinson Road, #02-00, 068898, Singapore.,Cambridge Centre for Advanced Research and Education in Singapore, CARES Ltd. 1 CREATE Way, CREATE Tower #05-05, 138602, Singapore
| | - Chonghuan Zhang
- Department of Chemical Engineering and Biotechnology, University of Cambridge, West Cambridge Site, Philippa Fawcett Drive, Cambridge CB3 0AS, UK.
| | - Artur M Schweidtmann
- Department of Chemical Engineering, Delft University of Technology, Van der Maasweg 9, Delft 2629 HZ, The Netherlands
| | - Alexei A Lapkin
- Department of Chemical Engineering and Biotechnology, University of Cambridge, West Cambridge Site, Philippa Fawcett Drive, Cambridge CB3 0AS, UK. .,Chemical Data Intelligence (CDI) Pte Ltd, Robinson Road, #02-00, 068898, Singapore.,Cambridge Centre for Advanced Research and Education in Singapore, CARES Ltd. 1 CREATE Way, CREATE Tower #05-05, 138602, Singapore
| |
Collapse
|
32
|
Soleimany AP, Amini A, Goldman S, Rus D, Bhatia SN, Coley CW. Evidential Deep Learning for Guided Molecular Property Prediction and Discovery. ACS CENTRAL SCIENCE 2021; 7:1356-1367. [PMID: 34471680 PMCID: PMC8393200 DOI: 10.1021/acscentsci.1c00546] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Indexed: 05/12/2023]
Abstract
While neural networks achieve state-of-the-art performance for many molecular modeling and structure-property prediction tasks, these models can struggle with generalization to out-of-domain examples, exhibit poor sample efficiency, and produce uncalibrated predictions. In this paper, we leverage advances in evidential deep learning to demonstrate a new approach to uncertainty quantification for neural network-based molecular structure-property prediction at no additional computational cost. We develop both evidential 2D message passing neural networks and evidential 3D atomistic neural networks and apply these networks across a range of different tasks. We demonstrate that evidential uncertainties enable (1) calibrated predictions where uncertainty correlates with error, (2) sample-efficient training through uncertainty-guided active learning, and (3) improved experimental validation rates in a retrospective virtual screening campaign. Our results suggest that evidential deep learning can provide an efficient means of uncertainty quantification useful for molecular property prediction, discovery, and design tasks in the chemical and physical sciences.
Collapse
Affiliation(s)
- Ava P. Soleimany
- Harvard-MIT
Division of Health Sciences and Technology, MIT, Cambridge, Massachusetts 02139, United States
- Graduate
Program in Biophysics, Harvard University, Boston, Massachusetts 02115, United States
- Microsoft
Research New England, Cambridge, Massachusetts 02142, United States
| | - Alexander Amini
- Department
of Electrical Engineering and Computer Science, MIT, Cambridge, Massachusetts 02139, United States
| | - Samuel Goldman
- Computational
and Systems Biology, MIT, Cambridge, Massachusetts 02139, United States
| | - Daniela Rus
- Department
of Electrical Engineering and Computer Science, MIT, Cambridge, Massachusetts 02139, United States
| | - Sangeeta N. Bhatia
- Harvard-MIT
Division of Health Sciences and Technology, MIT, Cambridge, Massachusetts 02139, United States
- Department
of Electrical Engineering and Computer Science, MIT, Cambridge, Massachusetts 02139, United States
- Howard
Hughes Medical Institute, Cambridge, Massachusetts 02139, United States
| | - Connor W. Coley
- Department
of Chemical Engineering, MIT, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
33
|
Wan Z, Wang QD, Liu D, Liang J. Accelerating the optimization of enzyme-catalyzed synthesis conditions via machine learning and reactivity descriptors. Org Biomol Chem 2021; 19:6267-6273. [PMID: 34195743 DOI: 10.1039/d1ob01066b] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Enzyme-catalyzed synthesis reactions are of crucial importance for a wide range of applications. An accurate and rapid selection of optimal synthesis conditions is crucial and challenging for both human knowledge and computer predictions. In this work, a new scenario, which combines a data-driven machine learning (ML) model with reactivity descriptors, is developed to predict the optimal enzyme-catalyzed synthesis conditions and the reaction yield. Fourteen reactivity descriptors in total are constructed to describe 125 reactions (classified into five categories) included in different reaction mechanisms. Nineteen ML models are developed to train the dataset and the Quadratic support vector machine (SVM) model is found to exhibit the best performance. The Quadratic SVM model is then used to predict the optimal reaction conditions, which are subsequently used to obtain the highest yield among 109 200 reaction conditions with different molar ratios of substrates, solvents, water contents, enzyme concentrations and temperatures for each reaction. The proposed protocol should be generally applicable to a diverse range of chemical reactions and provides a black-box evaluation for optimizing the reaction conditions of organic synthesis reactions.
Collapse
Affiliation(s)
- Zhongyu Wan
- Jiangsu Key Laboratory of Coal-based Greenhouse Gas Control and Utilization, Low Carbon Energy Institute and School of Chemical Engineering, China University of Mining and Technology, Xuzhou, 221008, People's Republic of China. and School of Science, City University of Hong Kong, Hong Kong SAR 999077, People's Republic of China
| | - Quan-De Wang
- Jiangsu Key Laboratory of Coal-based Greenhouse Gas Control and Utilization, Low Carbon Energy Institute and School of Chemical Engineering, China University of Mining and Technology, Xuzhou, 221008, People's Republic of China.
| | - Dongchang Liu
- School of Science, Xi'an Polytechnic University, Xi'an 710048, People's Republic of China and Department of Physics, Sungkyunkwan University, Suwon 16419, Korea
| | - Jinhu Liang
- School of Environment and Safety Engineering, North University of China, Taiyuan 030051, People's Republic of China
| |
Collapse
|
34
|
Koutsoukos S, Philippi F, Malaret F, Welton T. A review on machine learning algorithms for the ionic liquid chemical space. Chem Sci 2021; 12:6820-6843. [PMID: 34123314 PMCID: PMC8153233 DOI: 10.1039/d1sc01000j] [Citation(s) in RCA: 49] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2021] [Accepted: 04/28/2021] [Indexed: 01/05/2023] Open
Abstract
There are thousands of papers published every year investigating the properties and possible applications of ionic liquids. Industrial use of these exceptional fluids requires adequate understanding of their physical properties, in order to create the ionic liquid that will optimally suit the application. Computational property prediction arose from the urgent need to minimise the time and cost that would be required to experimentally test different combinations of ions. This review discusses the use of machine learning algorithms as property prediction tools for ionic liquids (either as standalone methods or in conjunction with molecular dynamics simulations), presents common problems of training datasets and proposes ways that could lead to more accurate and efficient models.
Collapse
Affiliation(s)
- Spyridon Koutsoukos
- Department of Chemistry, Molecular Sciences Research Hub, Imperial College London White City Campus London W12 0BZ UK
| | - Frederik Philippi
- Department of Chemistry, Molecular Sciences Research Hub, Imperial College London White City Campus London W12 0BZ UK
| | - Francisco Malaret
- Department of Chemical Engineering, Imperial College London South Kensington Campus London SW7 2AZ UK
| | - Tom Welton
- Department of Chemistry, Molecular Sciences Research Hub, Imperial College London White City Campus London W12 0BZ UK
| |
Collapse
|
35
|
Schwaller P, Vaucher AC, Laino T, Reymond JL. Prediction of chemical reaction yields using deep learning. MACHINE LEARNING-SCIENCE AND TECHNOLOGY 2021. [DOI: 10.1088/2632-2153/abc81d] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
36
|
Wang D, Xing M, Wei Y, Wang L, Wang R, Shen Q. Modeling of Nucleation and Growth in the Synthesis of PbS Colloidal Quantum Dots Under Variable Temperatures. ACS OMEGA 2021; 6:3701-3710. [PMID: 33585750 PMCID: PMC7876681 DOI: 10.1021/acsomega.0c05223] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Accepted: 12/23/2020] [Indexed: 05/03/2023]
Abstract
Lead sulfur colloidal quantum dots (PbS CQDs) are a kind of IV-VI semiconductor nanocrystals which have attracted enormous interest in recent years because of their unique physicochemical properties. Controlling size, size distribution, and yield of PbS CQDs plays key priorities in order to improve their properties when they are applied in the photovoltaics and energy storage applications. Despite many systematical studies in PbS CQD syntheses with various perspectives, details of the formation mechanism impacted on the size, concentration, and size distribution of PbS CQDs in complicated reaction conditions remain poorly understood. In this work, an improved kinetic rate equation (IKRE) model is employed to describe PbS CQD formation under variable solution temperatures. After establishing the necessary discretized equations and reviewing the link between model parameters and experimental information, a parametric study is performed to explore the model's feature. In addition, a set of experimental data has been compared with the result of IKRE model fits, which would be used to obtain corresponding thermodynamic and kinetic parameters that can further affect the CQD growth over longer timescales. This method builds up the relationship between the nucleation and Ostwald ripening stage that would provide the possibility for future large-scale manufacturing of CQDs.
Collapse
Affiliation(s)
- Dandan Wang
- Beijing
Engineering Research Centre of Sustainable Energy and Buildings, School
of Environment and Energy Engineering, Beijing
University of Civil Engineering and Architecture, Beijing 100044, China
| | - Meibo Xing
- Beijing
Engineering Research Centre of Sustainable Energy and Buildings, School
of Environment and Energy Engineering, Beijing
University of Civil Engineering and Architecture, Beijing 100044, China
| | - Yuyao Wei
- Beijing
Engineering Research Centre of Sustainable Energy and Buildings, School
of Environment and Energy Engineering, Beijing
University of Civil Engineering and Architecture, Beijing 100044, China
| | - Longxiang Wang
- Beijing
Engineering Research Centre of Sustainable Energy and Buildings, School
of Environment and Energy Engineering, Beijing
University of Civil Engineering and Architecture, Beijing 100044, China
| | - Ruixiang Wang
- Beijing
Engineering Research Centre of Sustainable Energy and Buildings, School
of Environment and Energy Engineering, Beijing
University of Civil Engineering and Architecture, Beijing 100044, China
- . Phone: +86-10-68322133. Fax: +86-10-68322133
| | - Qing Shen
- Faculty
of Informatics and Engineering, The University
of Electro-Communications, Chofu, Tokyo 182-8585, Japan
| |
Collapse
|
37
|
Eyke NS, Koscher BA, Jensen KF. Toward Machine Learning-Enhanced High-Throughput Experimentation. TRENDS IN CHEMISTRY 2021. [DOI: 10.1016/j.trechm.2020.12.001] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
38
|
Zahrt AF, Rose BT, Darrow WT, Henle JJ, Denmark SE. Computational methods for training set selection and error assessment applied to catalyst design: guidelines for deciding which reactions to run first and which to run next. REACT CHEM ENG 2021. [DOI: 10.1039/d1re00013f] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
Different subset selection methods are examined to guide catalyst selection in optimization campaigns. Error assessment methods are used to quantitatively inform selection of new catalyst candidates from in silico libraries of catalyst structures.
Collapse
Affiliation(s)
- Andrew F. Zahrt
- 245 Roger Adams Laboratory
- Department of Chemistry
- University of Illinois
- Urbana
- USA
| | - Brennan T. Rose
- 245 Roger Adams Laboratory
- Department of Chemistry
- University of Illinois
- Urbana
- USA
| | - William T. Darrow
- 245 Roger Adams Laboratory
- Department of Chemistry
- University of Illinois
- Urbana
- USA
| | - Jeremy J. Henle
- 245 Roger Adams Laboratory
- Department of Chemistry
- University of Illinois
- Urbana
- USA
| | - Scott E. Denmark
- 245 Roger Adams Laboratory
- Department of Chemistry
- University of Illinois
- Urbana
- USA
| |
Collapse
|
39
|
Mervin LH, Johansson S, Semenova E, Giblin KA, Engkvist O. Uncertainty quantification in drug design. Drug Discov Today 2020; 26:474-489. [PMID: 33253918 DOI: 10.1016/j.drudis.2020.11.027] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Revised: 07/13/2020] [Accepted: 11/23/2020] [Indexed: 01/03/2023]
Abstract
Machine learning and artificial intelligence are increasingly being applied to the drug-design process as a result of the development of novel algorithms, growing access, the falling cost of computation and the development of novel technologies for generating chemically and biologically relevant data. There has been recent progress in fields such as molecular de novo generation, synthetic route prediction and, to some extent, property predictions. Despite this, most research in these fields has focused on improving the accuracy of the technologies, rather than on quantifying the uncertainty in the predictions. Uncertainty quantification will become a key component in autonomous decision making and will be crucial for integrating machine learning and chemistry automation to create an autonomous design-make-test-analyse cycle. This review covers the empirical, frequentist and Bayesian approaches to uncertainty quantification, and outlines how they can be used for drug design. We also outline the impact of uncertainty quantification on decision making.
Collapse
Affiliation(s)
- Lewis H Mervin
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK.
| | - Simon Johansson
- Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg, Sweden; Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden
| | - Elizaveta Semenova
- Data Sciences and Quantitative Biology, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK
| | - Kathryn A Giblin
- Medicinal Chemistry, Research and Early Development, Oncology R&D, AstraZeneca, Cambridge, UK
| | - Ola Engkvist
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden
| |
Collapse
|