1
|
Blanchard AE, Bhowmik D, Fox Z, Gounley J, Glaser J, Akpa BS, Irle S. Adaptive language model training for molecular design. J Cheminform 2023; 15:59. [PMID: 37291633 DOI: 10.1186/s13321-023-00719-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Accepted: 04/03/2023] [Indexed: 06/10/2023] Open
Abstract
The vast size of chemical space necessitates computational approaches to automate and accelerate the design of molecular sequences to guide experimental efforts for drug discovery. Genetic algorithms provide a useful framework to incrementally generate molecules by applying mutations to known chemical structures. Recently, masked language models have been applied to automate the mutation process by leveraging large compound libraries to learn commonly occurring chemical sequences (i.e., using tokenization) and predict rearrangements (i.e., using mask prediction). Here, we consider how language models can be adapted to improve molecule generation for different optimization tasks. We use two different generation strategies for comparison, fixed and adaptive. The fixed strategy uses a pre-trained model to generate mutations; the adaptive strategy trains the language model on each new generation of molecules selected for target properties during optimization. Our results show that the adaptive strategy allows the language model to more closely fit the distribution of molecules in the population. Therefore, for enhanced fitness optimization, we suggest the use of the fixed strategy during an initial phase followed by the use of the adaptive strategy. We demonstrate the impact of adaptive training by searching for molecules that optimize both heuristic metrics, drug-likeness and synthesizability, as well as predicted protein binding affinity from a surrogate model. Our results show that the adaptive strategy provides a significant improvement in fitness optimization compared to the fixed pre-trained model, empowering the application of language models to molecular design tasks.
Collapse
Affiliation(s)
- Andrew E Blanchard
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
| | - Debsindhu Bhowmik
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA.
| | - Zachary Fox
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
| | - John Gounley
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
| | - Jens Glaser
- National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
| | - Belinda S Akpa
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
- Chemical & Biomolecular Engineering, University of Tennessee, Knoxville, TN, 37996, USA
| | - Stephan Irle
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
| |
Collapse
|
2
|
Amirahmadi A, Ohlsson M, Etminani K, Melander O, Björk J. A Masked Language Model for Multi-Source EHR Trajectories Contextual Representation Learning. Stud Health Technol Inform 2023; 302:609-610. [PMID: 37203760 DOI: 10.3233/shti230217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Using electronic health records data and machine learning to guide future decisions needs to address challenges, including 1) long/short-term dependencies and 2) interactions between diseases and interventions. Bidirectional transformers have effectively addressed the first challenge. Here we tackled the latter challenge by masking one source (e.g., ICD10 codes) and training the transformer to predict it using other sources (e.g., ATC codes).
Collapse
Affiliation(s)
- Ali Amirahmadi
- Center for Applied Intelligent Systems Research, Halmstad University, Sweden
| | - Mattias Ohlsson
- Center for Applied Intelligent Systems Research, Halmstad University, Sweden
- Centre for Environmental and Climate Science, Lund University, Sweden
| | - Kobra Etminani
- Center for Applied Intelligent Systems Research, Halmstad University, Sweden
| | - Olle Melander
- Division of Occupational and Environmental Medicine, Lund University, Sweden
| | - Jonas Björk
- Department of Clinical Sciences, Lund University, Sweden
| |
Collapse
|