1
|
Wossnig L, Furtmann N, Buchanan A, Kumar S, Greiff V. Best practices for machine learning in antibody discovery and development. Drug Discov Today 2024; 29:104025. [PMID: 38762089 DOI: 10.1016/j.drudis.2024.104025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 04/25/2024] [Accepted: 05/13/2024] [Indexed: 05/20/2024]
Abstract
In the past 40 years, therapeutic antibody discovery and development have advanced considerably, with machine learning (ML) offering a promising way to speed up the process by reducing costs and the number of experiments required. Recent progress in ML-guided antibody design and development (D&D) has been hindered by the diversity of data sets and evaluation methods, which makes it difficult to conduct comparisons and assess utility. Establishing standards and guidelines will be crucial for the wider adoption of ML and the advancement of the field. This perspective critically reviews current practices, highlights common pitfalls and proposes method development and evaluation guidelines for various ML-based techniques in therapeutic antibody D&D. Addressing challenges across the ML process, best practices are recommended for each stage to enhance reproducibility and progress.
Collapse
Affiliation(s)
- Leonard Wossnig
- LabGenius Ltd, The Biscuit Factory, 100 Drummond Road, London SE16 4DG, UK; Department of Computer Science, University College London, 66-72 Gower St, London WC1E 6EA, UK.
| | - Norbert Furtmann
- R&D Large Molecules Research Platform, Sanofi Deutschland GmbH, Industriepark Höchst, Frankfurt Am Main, Germany
| | - Andrew Buchanan
- Biologics Engineering, R&D, AstraZeneca, Cambridge CB2 0AA, UK
| | - Sandeep Kumar
- Computational Protein Design and Modeling Group, Computational Science, Moderna Therapeutics, 200 Technology Square, Cambridge, MA 02139, USA
| | - Victor Greiff
- Department of Immunology and Oslo University Hospital, University of Oslo, Oslo, Norway
| |
Collapse
|
2
|
Roth JP, Bajorath J. Relationship between prediction accuracy and uncertainty in compound potency prediction using deep neural networks and control models. Sci Rep 2024; 14:6536. [PMID: 38503823 PMCID: PMC10950896 DOI: 10.1038/s41598-024-57135-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2023] [Accepted: 03/14/2024] [Indexed: 03/21/2024] Open
Abstract
The assessment of prediction variance or uncertainty contributes to the evaluation of machine learning models. In molecular machine learning, uncertainty quantification is an evolving area of research where currently no standard approaches or general guidelines are available. We have carried out a detailed analysis of deep neural network variants and simple control models for compound potency prediction to study relationships between prediction accuracy and uncertainty. For comparably accurate predictions obtained with models of different complexity, highly variable prediction uncertainties were detected using different metrics. Furthermore, a strong dependence of prediction characteristics and uncertainties on potency levels of test compounds was observed, often leading to over- or under-confident model decisions with respect to the expected variance of predictions. Moreover, neural network models responded very differently to training set modifications. Taken together, our findings indicate that there is only little, if any correlation between compound potency prediction accuracy and uncertainty, especially for deep neural network models, when predictions are assessed on the basis of currently used metrics for uncertainty quantification.
Collapse
Affiliation(s)
- Jannik P Roth
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany.
| |
Collapse
|
3
|
Azevedo PHRDA, Peçanha BRDB, Flores-Junior LAP, Alves TF, Dias LRS, Muri EMF, Lima CHDS. In silico drug repurposing by combining machine learning classification model and molecular dynamics to identify a potential OGT inhibitor. J Biomol Struct Dyn 2024; 42:1417-1428. [PMID: 37054524 DOI: 10.1080/07391102.2023.2199868] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Accepted: 04/01/2023] [Indexed: 04/15/2023]
Abstract
O-linked N-acetylglucosamine (O-GlcNAc) is a unique intracellular post-translational glycosylation at the hydroxyl group of serine or threonine residues in nuclear, cytoplasmic and mitochondrial proteins. The enzyme O-GlcNAc transferase (OGT) is responsible for adding GlcNAc, and anomalies in this process can lead to the development of diseases associated with metabolic imbalance, such as diabetes and cancer. Repurposing approved drugs can be an attractive tool to discover new targets reducing time and costs in the drug design. This work focuses on drug repurposing to OGT targets by virtual screening of FDA-approved drugs through consensus machine learning (ML) models from an imbalanced dataset. We developed a classification model using docking scores and ligand descriptors. The SMOTE approach to resampling the dataset showed excellent statistical values in five of the seven ML algorithms to create models from the training set, with sensitivity, specificity and accuracy over 90% and Matthew's correlation coefficient greater than 0.8. The pose analysis obtained by molecular docking showed only H-bond interaction with the OGT C-Cat domain. The molecular dynamics simulation showed the lack of H-bond interactions with the C- and N-catalytic domains allowed the drug to exit the binding site. Our results showed that the non-steroidal anti-inflammatory celecoxib could be a potentially OGT inhibitor.
Collapse
Affiliation(s)
| | | | | | - Tatiana Fialho Alves
- Laboratório de Química Medicinal, Faculdade de Farmácia, Universidade Federal Fluminense, Niterói, RJ, Brazil
| | - Luiza Rosaria Sousa Dias
- Laboratório de Química Medicinal, Faculdade de Farmácia, Universidade Federal Fluminense, Niterói, RJ, Brazil
| | - Estela Maris Freitas Muri
- Laboratório de Química Medicinal, Faculdade de Farmácia, Universidade Federal Fluminense, Niterói, RJ, Brazil
| | | |
Collapse
|
4
|
Back S, Aspuru-Guzik A, Ceriotti M, Gryn'ova G, Grzybowski B, Gu GH, Hein J, Hippalgaonkar K, Hormázabal R, Jung Y, Kim S, Kim WY, Moosavi SM, Noh J, Park C, Schrier J, Schwaller P, Tsuda K, Vegge T, von Lilienfeld OA, Walsh A. Accelerated chemical science with AI. DIGITAL DISCOVERY 2024; 3:23-33. [PMID: 38239898 PMCID: PMC10793638 DOI: 10.1039/d3dd00213f] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Accepted: 12/06/2023] [Indexed: 01/22/2024]
Abstract
In light of the pressing need for practical materials and molecular solutions to renewable energy and health problems, to name just two examples, one wonders how to accelerate research and development in the chemical sciences, so as to address the time it takes to bring materials from initial discovery to commercialization. Artificial intelligence (AI)-based techniques, in particular, are having a transformative and accelerating impact on many if not most, technological domains. To shed light on these questions, the authors and participants gathered in person for the ASLLA Symposium on the theme of 'Accelerated Chemical Science with AI' at Gangneung, Republic of Korea. We present the findings, ideas, comments, and often contentious opinions expressed during four panel discussions related to the respective general topics: 'Data', 'New applications', 'Machine learning algorithms', and 'Education'. All discussions were recorded, transcribed into text using Open AI's Whisper, and summarized using LG AI Research's EXAONE LLM, followed by revision by all authors. For the broader benefit of current researchers, educators in higher education, and academic bodies such as associations, publishers, librarians, and companies, we provide chemistry-specific recommendations and summarize the resulting conclusions.
Collapse
Affiliation(s)
- Seoin Back
- Department of Chemical and Biomolecular Engineering, Institute of Emergent Materials, Sogang University Seoul Republic of Korea
| | - Alán Aspuru-Guzik
- Departments of Chemistry, Computer Science, University of Toronto St. George Campus Toronto ON Canada
- Acceleration Consortium and Vector Institute for Artificial Intelligence Toronto ON M5S 1M1 Canada
| | - Michele Ceriotti
- Laboratory of Computational Science and Modeling (COSMO), École Polytechnique Fédérale de Lausanne Lausanne Switzerland
| | - Ganna Gryn'ova
- Heidelberg Institute for Theoretical Studies (HITS gGmbH) 69118 Heidelberg Germany
- Interdisciplinary Center for Scientific Computing, Heidelberg University 69120 Heidelberg Germany
| | - Bartosz Grzybowski
- Center for Algorithmic and Robotized Synthesis (CARS), Institute for Basic Science (IBS) Ulsan Republic of Korea
- Institute of Organic Chemistry, Polish Academy of Sciences Warsaw Poland
- Department of Chemistry, Ulsan National Institute of Science and Technology Ulsan Republic of Korea
| | - Geun Ho Gu
- Department of Energy Engineering, Korea Institute of Energy Technology (KENTECH) Naju 58330 Republic of Korea
| | - Jason Hein
- Department of Chemistry, University of British Columbia Vancouver BC V6T 1Z1 Canada
| | - Kedar Hippalgaonkar
- School of Materials Science and Engineering, Nanyang Technological University 50 Nanyang Avenue Singapore 639798 Singapore
- Institute of Materials Research and Engineering, Agency for Science Technology and Research 2 Fusionopolis Way, 08-03 Singapore 138634 Singapore
| | | | - Yousung Jung
- Department of Chemical and Biomolecular Engineering, KAIST Daejeon Republic of Korea
- School of Chemical and Biological Engineering, Interdisciplinary Program in Artificial Intelligence, Seoul National University 1 Gwanak-ro, Gwanak-gu Seoul 08826 Republic of Korea
| | - Seonah Kim
- Department of Chemistry, Colorado State University 1301 Center Avenue Fort Collins CO 80523 USA
| | - Woo Youn Kim
- Department of Chemistry, KAIST Daejeon Republic of Korea
| | - Seyed Mohamad Moosavi
- Chemical Engineering & Applied Chemistry, University of Toronto Toronto Ontario M5S 3E5 Canada
| | - Juhwan Noh
- Chemical Data-Driven Research Center, Korea Research Institute of Chemical Technology Daejeon 34114 Republic of Korea
| | | | - Joshua Schrier
- Department of Chemistry, Fordham University The Bronx NY 10458 USA
| | - Philippe Schwaller
- Laboratory of Artificial Chemical Intelligence (LIAC) & National Centre of Competence in Research (NCCR) Catalysis, École Polytechnique Fédérale de Lausanne Lausanne Switzerland
| | - Koji Tsuda
- Graduate School of Frontier Sciences, The University of Tokyo Kashiwa Chiba 277-8561 Japan
- Center for Basic Research on Materials, National Institute for Materials Science Tsukuba Ibaraki 305-0044 Japan
- RIKEN Center for Advanced Intelligence Project Tokyo 103-0027 Japan
| | - Tejs Vegge
- Department of Energy Conversion and Storage, Technical University of Denmark 301 Anker Engelunds vej, Kongens Lyngby Copenhagen 2800 Denmark
| | - O Anatole von Lilienfeld
- Acceleration Consortium and Vector Institute for Artificial Intelligence Toronto ON M5S 1M1 Canada
- Departments of Chemistry, Materials Science and Engineering, and Physics, University of Toronto, St George Campus Toronto ON Canada
- Machine Learning Group, Technische Universität Berlin and Berlin Institute for the Foundations of Learning and Data 10587 Berlin Germany
| | - Aron Walsh
- Department of Materials, Imperial College London London SW7 2AZ UK
- Department of Physics, Ewha Women's University Seoul Republic of Korea
| |
Collapse
|
5
|
Askenazi EM, Lazar EA, Grinberg I. Identification of High-Reliability Regions of Machine Learning Predictions Based on Materials Chemistry. J Chem Inf Model 2023; 63:7350-7362. [PMID: 37983482 DOI: 10.1021/acs.jcim.3c01684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Progress in the application of machine learning (ML) methods to materials design is hindered by the lack of understanding of the reliability of ML predictions, in particular, for the application of ML to small data sets often found in materials science. Using ML prediction for transparent conductor oxide formation energy and band gap, dilute solute diffusion, and perovskite formation energy, band gap, and lattice parameter as examples, we demonstrate that (1) construction of a convex hull in feature space that encloses accurately predicted systems can be used to identify regions in feature space for which ML predictions are highly reliable; (2) analysis of the systems enclosed by the convex hull can be used to extract physical understanding; and (3) materials that satisfy all well-known chemical and physical principles that make a material physically reasonable are likely to be similar and show strong relationships between the properties of interest and the standard features used in ML. We also show that similar to the composition-structure-property relationships, inclusion in the ML training data set of materials from classes with different chemical properties will not be beneficial for the accuracy of ML prediction and that reliable results likely will be obtained by ML model for narrow classes of similar materials even in the case where the ML model will show large errors on the data set consisting of several classes of materials.
Collapse
Affiliation(s)
- Evan M Askenazi
- Department of Chemistry, Bar-Ilan University, Ramat, Gan 52900, Israel
| | - Emanuel A Lazar
- Department of Mathematics, Bar-Ilan University, Ramat, Gan 52900, Israel
| | - Ilya Grinberg
- Department of Chemistry, Bar-Ilan University, Ramat, Gan 52900, Israel
| |
Collapse
|
6
|
Marković G, Manojlović V, Ružić J, Sokić M. Predicting Low-Modulus Biocompatible Titanium Alloys Using Machine Learning. MATERIALS (BASEL, SWITZERLAND) 2023; 16:6355. [PMID: 37834492 PMCID: PMC10573332 DOI: 10.3390/ma16196355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Revised: 09/10/2023] [Accepted: 09/18/2023] [Indexed: 10/15/2023]
Abstract
Titanium alloys have been present for decades as the main components for the production of various orthopedic and dental elements. However, modern times require titanium alloys with a low Young's modulus, and without the presence of cytotoxic alloying elements. Machine learning was used with aim to analyze biocompatible titanium alloys and predict the composition of Ti alloys with a low Young's modulus. A database was created using experimental data for alloy composition, Young's modulus, and mechanical and thermal properties of biocompatible titanium alloys. The Extra Tree Regression model was built to predict the Young's modulus of titanium alloys. By processing data of 246 alloys, the specific heat was discovered to be the most influential parameter that contributes to the lowering of the Young's modulus of titanium alloys. Further, the Monte Carlo method was used to predict the composition of future alloys with the desired properties. Simulation results of ten million samples, with predefined conditions for obtaining titanium alloys with a Young's modulus lower than 70 GPa, show that it is possible to obtain several multicomponent alloys, consisting of five main elements: titanium, zirconium, tin, manganese and niobium.
Collapse
Affiliation(s)
- Gordana Marković
- Institute for Technology of Nuclear and Other Mineral Raw Materials, 11000 Belgrade, Serbia; (G.M.); (M.S.)
| | - Vaso Manojlović
- Faculty of Technology and Metallurgy, University of Belgrade, 11000 Belgrade, Serbia
| | - Jovana Ružić
- Department of Materials, “Vinča” Institute of Nuclear Sciences—National Institute of the Republic of Serbia, University of Belgrade, 11000 Belgrade, Serbia;
| | - Miroslav Sokić
- Institute for Technology of Nuclear and Other Mineral Raw Materials, 11000 Belgrade, Serbia; (G.M.); (M.S.)
| |
Collapse
|
7
|
Kuntz D, Wilson AK. Machine learning, artificial intelligence, and chemistry: how smart algorithms are reshaping simulation and the laboratory. PURE APPL CHEM 2022. [DOI: 10.1515/pac-2022-0202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Abstract
Machine learning and artificial intelligence are increasingly gaining in prominence through image analysis, language processing, and automation, to name a few applications. Machine learning is also making profound changes in chemistry. From revisiting decades-old analytical techniques for the purpose of creating better calibration curves, to assisting and accelerating traditional in silico simulations, to automating entire scientific workflows, to being used as an approach to deduce underlying physics of unexplained chemical phenomena, machine learning and artificial intelligence are reshaping chemistry, accelerating scientific discovery, and yielding new insights. This review provides an overview of machine learning and artificial intelligence from a chemist’s perspective and focuses on a number of examples of the use of these approaches in computational chemistry and in the laboratory.
Collapse
Affiliation(s)
- David Kuntz
- Department of Chemistry , University of North Texas , Denton , TX 76201 , USA
| | - Angela K. Wilson
- Department of Chemistry , Michigan State University , East Lansing , MI 48824 , USA
| |
Collapse
|
8
|
Quach CD, Gilmer JB, Pert D, Mason-Hogans A, Iacovella CR, Cummings PT, McCabe C. High-throughput screening of tribological properties of monolayer films using molecular dynamics and machine learning. J Chem Phys 2022; 156:154902. [PMID: 35459321 DOI: 10.1063/5.0080838] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Monolayer films have shown promise as a lubricating layer to reduce friction and wear of mechanical devices with separations on the nanoscale. These films have a vast design space with many tunable properties that can affect their tribological effectiveness. For example, terminal group chemistry, film composition, and backbone chemistry can all lead to films with significantly different tribological properties. This design space, however, is very difficult to explore without a combinatorial approach and an automatable, reproducible, and extensible workflow to screen for promising candidate films. Using the Molecular Simulation Design Framework (MoSDeF), a combinatorial screening study was performed to explore 9747 unique monolayer films (116 964 total simulations) and a machine learning (ML) model using a random forest regressor, an ensemble learning technique, to explore the role of terminal group chemistry and its effect on tribological effectiveness. The most promising films were found to contain small terminal groups such as cyano and ethylene. The ML model was subsequently applied to screen terminal group candidates identified from the ChEMBL small molecule library. Approximately 193 131 unique film candidates were screened with approximately a five order of magnitude speed-up in analysis compared to simulation alone. The ML model was thus able to be used as a predictive tool to greatly speed up the initial screening of promising candidate films for future simulation studies, suggesting that computational screening in combination with ML can greatly increase the throughput in combinatorial approaches to generate in silico data and then train ML models in a controlled, self-consistent fashion.
Collapse
Affiliation(s)
- Co D Quach
- Department of Chemical and Biomolecular Engineering, Vanderbilt University, Nashville, Tennessee 37235, USA
| | - Justin B Gilmer
- Interdiscplinary Materials Science, Vanderbilt University, Nashville, Tennessee 37235, USA
| | - Daniel Pert
- Department of Chemical and Biomolecular Engineering, Vanderbilt University, Nashville, Tennessee 37235, USA
| | - Akanke Mason-Hogans
- Department of Chemical and Biomolecular Engineering, Vanderbilt University, Nashville, Tennessee 37235, USA
| | - Christopher R Iacovella
- Department of Chemical and Biomolecular Engineering, Vanderbilt University, Nashville, Tennessee 37235, USA
| | - Peter T Cummings
- Department of Chemical and Biomolecular Engineering, Vanderbilt University, Nashville, Tennessee 37235, USA
| | - Clare McCabe
- Department of Chemical and Biomolecular Engineering, Vanderbilt University, Nashville, Tennessee 37235, USA
| |
Collapse
|
9
|
Kabir HMD, Khanam S, Khozeimeh F, Khosravi A, Mondal SK, Nahavandi S, Acharya UR. Aleatory-aware deep uncertainty quantification for transfer learning. Comput Biol Med 2022; 143:105246. [PMID: 35131610 DOI: 10.1016/j.compbiomed.2022.105246] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Revised: 12/30/2021] [Accepted: 01/12/2022] [Indexed: 11/17/2022]
Abstract
The user does not have any idea about the credibility of outcomes from deep neural networks (DNN) when uncertainty quantification (UQ) is not employed. However, current Deep UQ classification models capture mostly epistemic uncertainty. Therefore, this paper aims to propose an aleatory-aware Deep UQ method for classification problems. First, we train DNNs through transfer learning and collect numeric output posteriors for all training samples instead of logical outputs. Then we determine the probability of happening a certain class from K-nearest output posteriors of the same DNN in training samples. We name this probability as opacity score, as the paper focuses on the detection of opacity on X-ray images. This score reflects the level of aleatory on the sample. When the NN is certain on the classification of the sample, the probability of happening a class becomes much higher than the probabilities of others. Probabilities for different classes become close to each other for a highly uncertain classification outcome. To capture the epistemic uncertainty, we train multiple DNNs with different random initializations, model selection, and augmentations to observe the effect of these training parameters on prediction and uncertainty. To reduce execution time, we first obtain features from the pre-trained NN. Then we apply features to the ensemble of fully connected layers to get the distribution of opacity score during the test. We also train several ResNet and DenseNet DNNs to observe the effect of model selection on prediction and uncertainty. The paper also demonstrates a patient referral framework based on the proposed uncertainty quantification. The scripts of the proposed method are available at the following link: https://github.com/dipuk0506/Aleatory-aware-UQ.
Collapse
Affiliation(s)
- H M Dipu Kabir
- Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Australia.
| | | | - Fahime Khozeimeh
- Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Australia
| | - Abbas Khosravi
- Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Australia
| | - Subrota Kumar Mondal
- Faculty of Information Technology, Macau University of Science and Technology, Macao
| | - Saeid Nahavandi
- Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Australia; Harvard Paulson School of Engineering and Applied Sciences, Harvard University, Allston, MA, 02 134, USA
| | - U Rajendra Acharya
- Department of ECE, Ngee Ann Polytechnic, 535 Clementi Road, 599 489, Singapore; Department of Biomedical Engineering, School of Science and Technology, SUSS University, Singapore; Department of Biomedical Informatics and Medical Engineering, Asia University, Taichung, Taiwan
| |
Collapse
|
10
|
Pernot P. The long road to calibrated prediction uncertainty in computational chemistry. J Chem Phys 2022; 156:114109. [DOI: 10.1063/5.0084302] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Uncertainty quantification (UQ) in computational chemistry (CC) is still in its infancy. Very few CC methods are designed to provide a confidence level on their predictions, and most users still rely improperly on the mean absolute error as an accuracy metric. The development of reliable UQ methods is essential, notably for CC to be used confidently in industrial processes. A review of the CC-UQ literature shows that there is no common standard procedure to report or validate prediction uncertainty. I consider here analysis tools using concepts (calibration and sharpness) developed in meteorology and machine learning for the validation of probabilistic forecasters. These tools are adapted to CC-UQ and applied to datasets of prediction uncertainties provided by composite methods, Bayesian ensembles methods, and machine learning and a posteriori statistical methods.
Collapse
Affiliation(s)
- Pascal Pernot
- Institut de Chimie Physique, UMR8000 CNRS, Université Paris-Saclay, 91405 Orsay, France
| |
Collapse
|
11
|
Abstract
In Portugal, the dropout rate of university courses is around 29%. Understanding the reasons behind such a high desertion rate can drastically improve the success of students and universities. This work applies existing data mining techniques to predict the academic dropout mainly using the academic grades. Four different machine learning techniques are presented and analyzed. The dataset consists of 331 students who were previously enrolled in the Computer Engineering degree at the Universidade de Trás-os-Montes e Alto Douro (UTAD). The study aims to detect students who may prematurely drop out using existing methods. The most relevant data features were identified using the Permutation Feature Importance technique. In the second phase, several methods to predict the dropouts were applied. Then, each machine learning technique’s results were displayed and compared to select the best approach to predict academic dropout. The methods used achieved good results, reaching an F1-Score of 81% in the final test set, concluding that students’ marks somehow incorporate their living conditions.
Collapse
|
12
|
Zhong S, Zhang K, Bagheri M, Burken JG, Gu A, Li B, Ma X, Marrone BL, Ren ZJ, Schrier J, Shi W, Tan H, Wang T, Wang X, Wong BM, Xiao X, Yu X, Zhu JJ, Zhang H. Machine Learning: New Ideas and Tools in Environmental Science and Engineering. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2021; 55:12741-12754. [PMID: 34403250 DOI: 10.1021/acs.est.1c01339] [Citation(s) in RCA: 92] [Impact Index Per Article: 30.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
The rapid increase in both the quantity and complexity of data that are being generated daily in the field of environmental science and engineering (ESE) demands accompanied advancement in data analytics. Advanced data analysis approaches, such as machine learning (ML), have become indispensable tools for revealing hidden patterns or deducing correlations for which conventional analytical methods face limitations or challenges. However, ML concepts and practices have not been widely utilized by researchers in ESE. This feature explores the potential of ML to revolutionize data analysis and modeling in the ESE field, and covers the essential knowledge needed for such applications. First, we use five examples to illustrate how ML addresses complex ESE problems. We then summarize four major types of applications of ML in ESE: making predictions; extracting feature importance; detecting anomalies; and discovering new materials or chemicals. Next, we introduce the essential knowledge required and current shortcomings in ML applications in ESE, with a focus on three important but often overlooked components when applying ML: correct model development, proper model interpretation, and sound applicability analysis. Finally, we discuss challenges and future opportunities in the application of ML tools in ESE to highlight the potential of ML in this field.
Collapse
Affiliation(s)
- Shifa Zhong
- Department of Civil and Environmental Engineering, Case Western Reserve University, Cleveland, Ohio 44106, United States
| | - Kai Zhang
- Department of Civil and Environmental Engineering, Case Western Reserve University, Cleveland, Ohio 44106, United States
| | - Majid Bagheri
- Department of Civil, Architectural, and Environmental Engineering, Missouri University of Science and Technology, Rolla, Missouri 65409, United States
| | - Joel G Burken
- Department of Civil, Architectural, and Environmental Engineering, Missouri University of Science and Technology, Rolla, Missouri 65409, United States
| | - April Gu
- Department of Civil and Environmental Engineering, Cornell University, Ithaca, New York 14850, United States
| | - Baikun Li
- Department of Civil and Environmental Engineering, University of Connecticut, Storrs, Connecticut 06269, United States
| | - Xingmao Ma
- Department of Civil and Environmental Engineering, Texas A&M University, College Station, Texas, 77843, United States
| | - Babetta L Marrone
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Zhiyong Jason Ren
- Department of Civil and Environmental Engineering, Princeton University, Princeton, New Jersey 08544, United States
| | - Joshua Schrier
- Department of Chemistry, Fordham University, The Bronx, New York 10458 United States
| | - Wei Shi
- School of Environment, Nanjing University, Nanjing, 210093 China
| | - Haoyue Tan
- School of Environment, Nanjing University, Nanjing, 210093 China
| | - Tianbao Wang
- Department of Civil and Environmental Engineering, University of Connecticut, Storrs, Connecticut 06269, United States
| | - Xu Wang
- School of Civil and Environmental Engineering, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, China
- Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China
| | - Bryan M Wong
- Department of Chemical & Environmental Engineering, Materials Science & Engineering Program, University of California-Riverside, Riverside, California 92521 United States
| | - Xusheng Xiao
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, Ohio 44106, United States
| | - Xiong Yu
- Department of Civil and Environmental Engineering, Case Western Reserve University, Cleveland, Ohio 44106, United States
| | - Jun-Jie Zhu
- Department of Civil and Environmental Engineering, Princeton University, Princeton, New Jersey 08544, United States
| | - Huichun Zhang
- Department of Civil and Environmental Engineering, Case Western Reserve University, Cleveland, Ohio 44106, United States
| |
Collapse
|
13
|
Tynes M, Gao W, Burrill DJ, Batista ER, Perez D, Yang P, Lubbers N. Pairwise Difference Regression: A Machine Learning Meta-algorithm for Improved Prediction and Uncertainty Quantification in Chemical Search. J Chem Inf Model 2021; 61:3846-3857. [PMID: 34347460 DOI: 10.1021/acs.jcim.1c00670] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Machine learning (ML) plays a growing role in the design and discovery of chemicals, aiming to reduce the need to perform expensive experiments and simulations. ML for such applications is promising but difficult, as models must generalize to vast chemical spaces from small training sets and must have reliable uncertainty quantification metrics to identify and prioritize unexplored regions. Ab initio computational chemistry and chemical intuition alike often take advantage of differences between chemical conditions, rather than their absolute structure or state, to generate more reliable results. We have developed an analogous comparison-based approach for ML regression, called pairwise difference regression (PADRE), which is applicable to arbitrary underlying learning models and operates on pairs of input data points. During training, the model learns to predict differences between all possible pairs of input points. During prediction, the test points are paired with all training set points, giving rise to a set of predictions that can be treated as a distribution of which the mean is treated as a final prediction and the dispersion is treated as an uncertainty measure. Pairwise difference regression was shown to reliably improve the performance of the random forest algorithm across five chemical ML tasks. Additionally, the pair-derived dispersion is both well correlated with model error and performs well in active learning. We also show that this method is competitive with state-of-the-art neural network techniques. Thus, pairwise difference regression is a promising tool for candidate selection algorithms used in chemical discovery.
Collapse
Affiliation(s)
- Michael Tynes
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States.,Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Wenhao Gao
- Computer, Computational, and Statistical Sciences Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States.,Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Daniel J Burrill
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States.,Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Enrique R Batista
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States.,Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Danny Perez
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Ping Yang
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| | - Nicholas Lubbers
- Computer, Computational, and Statistical Sciences Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, United States
| |
Collapse
|
14
|
De Breuck PP, Evans ML, Rignanese GM. Robust model benchmarking and bias-imbalance in data-driven materials science: a case study on MODNet. JOURNAL OF PHYSICS. CONDENSED MATTER : AN INSTITUTE OF PHYSICS JOURNAL 2021; 33:404002. [PMID: 34237716 DOI: 10.1088/1361-648x/ac1280] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 07/08/2021] [Indexed: 06/13/2023]
Abstract
As the number of novel data-driven approaches to material science continues to grow, it is crucial to perform consistent quality, reliability and applicability assessments of model performance. In this paper, we benchmark the Materials Optimal Descriptor Network (MODNet) method and architecture against the recently released MatBench v0.1, a curated test suite of materials datasets. MODNet is shown to outperform current leaders on 6 of the 13 tasks, while closely matching the current leaders on a further 2 tasks; MODNet performs particularly well when the number of samples is below 10 000. Attention is paid to two topics of concern when benchmarking models. First, we encourage the reporting of a more diverse set of metrics as it leads to a more comprehensive and holistic comparison of model performance. Second, an equally important task is the uncertainty assessment of a model towards a target domain. Significant variations in validation errors can be observed, depending on the imbalance and bias in the training set (i.e., similarity between training and application space). By using an ensemble MODNet model, confidence intervals can be built and the uncertainty on individual predictions can be quantified. Imbalance and bias issues are often overlooked, and yet are important for successful real-world applications of machine learning in materials science and condensed matter.
Collapse
Affiliation(s)
- Pierre-Paul De Breuck
- Université catholique de Louvain (UCLouvain), Institute of Condensed Matter and Nanosciences (IMCN), Chemin des Étoiles 8, B-1348 Louvain-la-Neuve, Belgium
| | - Matthew L Evans
- Université catholique de Louvain (UCLouvain), Institute of Condensed Matter and Nanosciences (IMCN), Chemin des Étoiles 8, B-1348 Louvain-la-Neuve, Belgium
| | - Gian-Marco Rignanese
- Université catholique de Louvain (UCLouvain), Institute of Condensed Matter and Nanosciences (IMCN), Chemin des Étoiles 8, B-1348 Louvain-la-Neuve, Belgium
| |
Collapse
|