1
|
Snyder SH, Vignaux PA, Ozalp MK, Gerlach J, Puhl AC, Lane TR, Corbett J, Urbina F, Ekins S. The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications. Commun Chem 2024; 7:134. [PMID: 38866916 PMCID: PMC11169557 DOI: 10.1038/s42004-024-01220-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Accepted: 06/04/2024] [Indexed: 06/14/2024] Open
Abstract
Recent advances in machine learning (ML) have led to newer model architectures including transformers (large language models, LLMs) showing state of the art results in text generation and image analysis as well as few-shot learning (FSLC) models which offer predictive power with extremely small datasets. These new architectures may offer promise, yet the 'no-free lunch' theorem suggests that no single model algorithm can outperform at all possible tasks. Here, we explore the capabilities of classical (SVR), FSLC, and transformer models (MolBART) over a range of dataset tasks and show a 'goldilocks zone' for each model type, in which dataset size and feature distribution (i.e. dataset "diversity") determines the optimal algorithm strategy. When datasets are small ( < 50 molecules), FSLC tend to outperform both classical ML and transformers. When datasets are small-to-medium sized (50-240 molecules) and diverse, transformers outperform both classical models and few-shot learning. Finally, when datasets are of larger and of sufficient size, classical models then perform the best, suggesting that the optimal model to choose likely depends on the dataset available, its size and diversity. These findings may help to answer the perennial question of which ML algorithm is to be used when faced with a new dataset.
Collapse
Affiliation(s)
- Scott H Snyder
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA
| | - Patricia A Vignaux
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA
| | - Mustafa Kemal Ozalp
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA
| | - Jacob Gerlach
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA
| | - Ana C Puhl
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA
| | - Thomas R Lane
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA
| | - John Corbett
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA
| | - Fabio Urbina
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA.
| | - Sean Ekins
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA.
| |
Collapse
|
2
|
Li Y, Liu B, Deng J, Guo Y, Du H. Image-based molecular representation learning for drug development: a survey. Brief Bioinform 2024; 25:bbae294. [PMID: 38920347 PMCID: PMC11200195 DOI: 10.1093/bib/bbae294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Revised: 05/19/2024] [Accepted: 06/08/2024] [Indexed: 06/27/2024] Open
Abstract
Artificial intelligence (AI) powered drug development has received remarkable attention in recent years. It addresses the limitations of traditional experimental methods that are costly and time-consuming. While there have been many surveys attempting to summarize related research, they only focus on general AI or specific aspects such as natural language processing and graph neural network. Considering the rapid advance on computer vision, using the molecular image to enable AI appears to be a more intuitive and effective approach since each chemical substance has a unique visual representation. In this paper, we provide the first survey on image-based molecular representation for drug development. The survey proposes a taxonomy based on the learning paradigms in computer vision and reviews a large number of corresponding papers, highlighting the contributions of molecular visual representation in drug development. Besides, we discuss the applications, limitations and future directions in the field. We hope this survey could offer valuable insight into the use of image-based molecular representation learning in the context of drug development.
Collapse
Affiliation(s)
- Yue Li
- Division of Gastroenterology, Dongzhimen Hospital, Beijing University of Chinese Medicine, No. 5 Haiyun Warehouse, 100700, Beijing, China
| | - Bingyan Liu
- School of Computer Science, Beijing University of Posts and Telecommunications, No.10 Xituchen Street, 100876, Beijing, China
| | - Jinyan Deng
- Division of Gastroenterology, Dongzhimen Hospital, Beijing University of Chinese Medicine, No. 5 Haiyun Warehouse, 100700, Beijing, China
| | - Yi Guo
- Division of Gastroenterology, Dongzhimen Hospital, Beijing University of Chinese Medicine, No. 5 Haiyun Warehouse, 100700, Beijing, China
| | - Hongbo Du
- Division of Gastroenterology, Dongzhimen Hospital, Beijing University of Chinese Medicine, No. 5 Haiyun Warehouse, 100700, Beijing, China
- Institute of Liver Disease, Beijing University of Chinese Medicine, No. 5 Haiyun Warehouse, 100700, Beijing, China
| |
Collapse
|
3
|
Liu JY, Sayes CM. Modeling mixtures interactions in environmental toxicology. ENVIRONMENTAL TOXICOLOGY AND PHARMACOLOGY 2024; 106:104380. [PMID: 38309542 DOI: 10.1016/j.etap.2024.104380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 01/26/2024] [Accepted: 01/30/2024] [Indexed: 02/05/2024]
Abstract
In the environment, organisms are exposed to mixtures of different toxicants, which may interact in ways that are difficult to predict when only considering each component individually. Adapting and expanding tools from pharmacology, the toxicology field uses analytical, graphical, and computational methods to identify and quantify interactions in multi-component mixtures. The two general frameworks are concentration addition, where components have similar modes of action and their effects sum together, or independent action, where components have dissimilar modes of action and do not interact. Other interaction behaviors include synergism and antagonism, where the combined effects are more or less than the additive sum of individual effects. This review covers foundational theory, methods, an in-depth survey of original research from the past 20 years, current trends, and future directions. As humans and ecosystems are exposed to increasingly complex mixtures of environmental contaminants, analyzing mixtures interactions will continue to become a more critical aspect of toxicological research.
Collapse
Affiliation(s)
- James Y Liu
- Department of Environmental Science, Baylor University, Waco, TX, USA
| | - Christie M Sayes
- Department of Environmental Science, Baylor University, Waco, TX, USA.
| |
Collapse
|
4
|
McGibbon M, Shave S, Dong J, Gao Y, Houston DR, Xie J, Yang Y, Schwaller P, Blay V. From intuition to AI: evolution of small molecule representations in drug discovery. Brief Bioinform 2023; 25:bbad422. [PMID: 38033290 PMCID: PMC10689004 DOI: 10.1093/bib/bbad422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 10/13/2023] [Accepted: 11/01/2023] [Indexed: 12/02/2023] Open
Abstract
Within drug discovery, the goal of AI scientists and cheminformaticians is to help identify molecular starting points that will develop into safe and efficacious drugs while reducing costs, time and failure rates. To achieve this goal, it is crucial to represent molecules in a digital format that makes them machine-readable and facilitates the accurate prediction of properties that drive decision-making. Over the years, molecular representations have evolved from intuitive and human-readable formats to bespoke numerical descriptors and fingerprints, and now to learned representations that capture patterns and salient features across vast chemical spaces. Among these, sequence-based and graph-based representations of small molecules have become highly popular. However, each approach has strengths and weaknesses across dimensions such as generality, computational cost, inversibility for generative applications and interpretability, which can be critical in informing practitioners' decisions. As the drug discovery landscape evolves, opportunities for innovation continue to emerge. These include the creation of molecular representations for high-value, low-data regimes, the distillation of broader biological and chemical knowledge into novel learned representations and the modeling of up-and-coming therapeutic modalities.
Collapse
Affiliation(s)
- Miles McGibbon
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Steven Shave
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Jie Dong
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, 410013, China
| | - Yumiao Gao
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Douglas R Houston
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Jiancong Xie
- Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-Sen University, Guangzhou, 510000, China
| | - Yuedong Yang
- Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-Sen University, Guangzhou, 510000, China
| | - Philippe Schwaller
- Laboratory of Artificial Chemical Intelligence (LIAC), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Vincent Blay
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| |
Collapse
|
5
|
Wang N, Zhang Y, Wang W, Ye Z, Chen H, Hu G, Ouyang D. How can machine learning and multiscale modeling benefit ocular drug development? Adv Drug Deliv Rev 2023; 196:114772. [PMID: 36906232 DOI: 10.1016/j.addr.2023.114772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Revised: 02/06/2023] [Accepted: 03/05/2023] [Indexed: 03/12/2023]
Abstract
The eyes possess sophisticated physiological structures, diverse disease targets, limited drug delivery space, distinctive barriers, and complicated biomechanical processes, requiring a more in-depth understanding of the interactions between drug delivery systems and biological systems for ocular formulation development. However, the tiny size of the eyes makes sampling difficult and invasive studies costly and ethically constrained. Developing ocular formulations following conventional trial-and-error formulation and manufacturing process screening procedures is inefficient. Along with the popularity of computational pharmaceutics, non-invasive in silico modeling & simulation offer new opportunities for the paradigm shift of ocular formulation development. The current work first systematically reviews the theoretical underpinnings, advanced applications, and unique advantages of data-driven machine learning and multiscale simulation approaches represented by molecular simulation, mathematical modeling, and pharmacokinetic (PK)/pharmacodynamic (PD) modeling for ocular drug development. Following this, a new computer-driven framework for rational pharmaceutical formulation design is proposed, inspired by the potential of in silico explorations in understanding drug delivery details and facilitating drug formulation design. Lastly, to promote the paradigm shift, integrated in silico methodologies were highlighted, and discussions on data challenges, model practicality, personalized modeling, regulatory science, interdisciplinary collaboration, and talent training were conducted in detail with a view to achieving more efficient objective-oriented pharmaceutical formulation design.
Collapse
Affiliation(s)
- Nannan Wang
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences (ICMS), University of Macau, Macau, China
| | - Yunsen Zhang
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences (ICMS), University of Macau, Macau, China
| | - Wei Wang
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences (ICMS), University of Macau, Macau, China
| | - Zhuyifan Ye
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences (ICMS), University of Macau, Macau, China
| | - Hongyu Chen
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences (ICMS), University of Macau, Macau, China; Faculty of Science and Technology (FST), University of Macau, Macau, China
| | - Guanghui Hu
- Faculty of Science and Technology (FST), University of Macau, Macau, China
| | - Defang Ouyang
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences (ICMS), University of Macau, Macau, China; Department of Public Health and Medicinal Administration, Faculty of Health Sciences (FHS), University of Macau, Macau, China.
| |
Collapse
|
6
|
Lane TR, Harris J, Urbina F, Ekins S. Comparing LD 50/LC 50 Machine Learning Models for Multiple Species. ACS CHEMICAL HEALTH & SAFETY 2023. [DOI: 10.1021/acs.chas.2c00088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023]
Affiliation(s)
- Thomas R. Lane
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Joshua Harris
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Fabio Urbina
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Sean Ekins
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| |
Collapse
|