1
|
Jiang J, Li Y, Zhang R, Liu Y. INTransformer: Data augmentation-based contrastive learning by injecting noise into transformer for molecular property prediction. J Mol Graph Model 2024; 128:108703. [PMID: 38228013 DOI: 10.1016/j.jmgm.2024.108703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 12/05/2023] [Accepted: 01/02/2024] [Indexed: 01/18/2024]
Abstract
Molecular property prediction plays an essential role in drug discovery for identifying the candidate molecules with target properties. Deep learning models usually require sufficient labeled data to train good prediction models. However, the size of labeled data is usually small for molecular property prediction, which brings great challenges to deep learning-based molecular property prediction methods. Furthermore, the global information of molecules is critical for predicting molecular properties. Therefore, we propose INTransformer for molecular property prediction, which is a data augmentation method via contrastive learning to alleviate the limitations of the labeled molecular data while enhancing the ability to capture global information. Specifically, INTransformer consists of two identical Transformer sub-encoders to extract the molecular representation from the original SMILES and noisy SMILES respectively, while achieving the goal of data augmentation. To reduce the influence of noise, we use contrastive learning to ensure the molecular encoding of noisy SMILES is consistent with that of the original input so that the molecular representation information can be better extracted by INTransformer. Experiments on various benchmark datasets show that INTransformer achieved competitive performance for molecular property prediction tasks compared with the baselines and state-of-the-art methods.
Collapse
Affiliation(s)
- Jing Jiang
- Key Laboratory of Linguistic and Cultural Computing, Ministry of Education, Northwest Minzu University, Lanzhou 730030, China.
| | - Yachao Li
- Key Laboratory of Linguistic and Cultural Computing, Ministry of Education, Northwest Minzu University, Lanzhou 730030, China.
| | - Ruisheng Zhang
- School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China.
| | - Yunwu Liu
- School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China.
| |
Collapse
|
2
|
Sai L, Fu L, Zhao J. Predicting Binding Energies and Electronic Properties of Boron Nitride Fullerenes Using a Graph Convolutional Network. J Chem Inf Model 2024; 64:2645-2653. [PMID: 38117935 DOI: 10.1021/acs.jcim.3c01708] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2023]
Abstract
As isoelectronic counterparts of carbon fullerenes, medium-sized boron nitride clusters also prefer cage structures composed of even-sized polygons. As the cluster size increases, the number of cage isomers grows rapidly, and determining the ground state structure requires a tremendous amount of DFT calculations. Herein, we develop a graph convolutional network (GCN) that can describe the energy of a (BN)n cage by its topology connection. We define a vertex feature vector on a dual polyhedron by the permutation of the neighbor vertices' degree and aggregate the information on vertices by two graph convolutional layers to learn the local feature of the dual polyhedron. The GCN is trained on (BN)28 and subsequently tested on (BN)23 and (BN)24 data sets, which satisfactorily reproduce the order of isomer energies from DFT calculations. We further employ the trained GCN to predict the ground state structures within the size range of n = 25-32, which agree well with DFT results. Using the same GCN framework, we also successfully trained the highest-occupied or lowest-unoccupied orbital energies of (BN)28 isomers. The present graph convolutional network establishes a direct mapping between the topological connection and the energetic or electronic properties of a cage-like cluster or molecule.
Collapse
Affiliation(s)
- Linwei Sai
- Department of Mathematics, Hohai University, Changzhou 213200, China
| | - Li Fu
- Key Laboratory of Materials Modification by Laser, Ion and Electron Beams, Dalian University of Technology, Ministry of Education, Dalian 116024, China
| | - Jijun Zhao
- Key Laboratory of Materials Modification by Laser, Ion and Electron Beams, Dalian University of Technology, Ministry of Education, Dalian 116024, China
| |
Collapse
|
3
|
Jiang J, Zhang R, Yuan Y, Li T, Li G, Zhao Z, Yu Z. NoiseMol: A noise-robusted data augmentation via perturbing noise for molecular property prediction. J Mol Graph Model 2023; 121:108454. [PMID: 36963306 DOI: 10.1016/j.jmgm.2023.108454] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Revised: 03/05/2023] [Accepted: 03/13/2023] [Indexed: 03/17/2023]
Abstract
Simplified Molecular-Input Line-Entry System (SMILES) is one of a widely used molecular representation methods for molecular property prediction. We conjecture that all the characters in the SMILES string of a molecule are essential for making up the molecules, but most of them make little contribution to determining a particular property of the molecule. Therefore, we verified the conjecture in the pre-experiment. Motivated by the result, we propose to inject proper noisy information into the SMILES to augment the training data by increasing the diversity of the labeled molecules. To this end, we explore injecting perturbing noise into the original labeled SMILES strings to construct augmented data for alleviating the limitation of the labeled compound data and enhancing the model to extract more useful molecular representation for molecular property prediction. Specifically, we directly adopt mask, swap, deletion, and fusion operations on SMILES strings to randomly mask, swap, and delete atoms in SMILES strings. Then, the augmented data is used by two strategies: each epoch alternately feeds the original and perturbing noisy molecules, or each batch alternately feeds the original and perturbing noisy molecules. We conduct experiments on both Transformer and BiGRU models to validate the effectiveness by adopting widely used datasets from MoleculeNet and ZINC. Experimental results demonstrate that the proposed method outperforms strong baselines on all the datasets. NoiseMol obtains the best performance on BBBP and FDA when compared with state-of-the-art methods. Besides, NoiseMol achieves the best accuracy on LogP. Therefore, injecting perturbing noise into the labeled SMILES strings is an effective and efficient method, which improves the prediction performance, generalization, and robustness of the deep learning models.
Collapse
Affiliation(s)
- Jing Jiang
- School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China; Key Laboratory of China's Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou, Gansu, China.
| | - Ruisheng Zhang
- School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China.
| | - Yongna Yuan
- School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China.
| | - Tongfeng Li
- School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China; Computer College, Qinghai Normal University, Xining, Qinghai, China.
| | - Gaili Li
- School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China.
| | - Zhili Zhao
- School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China.
| | - Zhixuan Yu
- School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China.
| |
Collapse
|
4
|
Tran TTV, Tayara H, Chong KT. Recent Studies of Artificial Intelligence on In Silico Drug Distribution Prediction. Int J Mol Sci 2023; 24:1815. [PMID: 36768139 PMCID: PMC9915725 DOI: 10.3390/ijms24031815] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2022] [Revised: 01/11/2023] [Accepted: 01/13/2023] [Indexed: 01/19/2023] Open
Abstract
Drug distribution is an important process in pharmacokinetics because it has the potential to influence both the amount of medicine reaching the active sites and the effectiveness as well as safety of the drug. The main causes of 90% of drug failures in clinical development are lack of efficacy and uncontrolled toxicity. In recent years, several advances and promising developments in drug distribution property prediction have been achieved, especially in silico, which helped to drastically reduce the time and expense of screening undesired drug candidates. In this study, we provide comprehensive knowledge of drug distribution background, influencing factors, and artificial intelligence-based distribution property prediction models from 2019 to the present. Additionally, we gathered and analyzed public databases and datasets commonly utilized by the scientific community for distribution prediction. The distribution property prediction performance of five large ADMET prediction tools is mentioned as a benchmark for future research. On this basis, we also offer future challenges in drug distribution prediction and research directions. We hope that this review will provide researchers with helpful insight into distribution prediction, thus facilitating the development of innovative approaches for drug discovery.
Collapse
Affiliation(s)
- Thi Tuyet Van Tran
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea
- Department of Information Technology, An Giang University, Long Xuyen 880000, Vietnam
- Vietnam National University–Ho Chi Minh City, Ho Chi Minh 700000, Vietnam
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Republic of Korea
| | - Kil To Chong
- Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Republic of Korea
| |
Collapse
|
5
|
Lim S, Lee S, Piao Y, Choi M, Bang D, Gu J, Kim S. On modeling and utilizing chemical compound information with deep learning technologies: A task-oriented approach. Comput Struct Biotechnol J 2022; 20:4288-4304. [PMID: 36051875 PMCID: PMC9399946 DOI: 10.1016/j.csbj.2022.07.049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2022] [Revised: 07/29/2022] [Accepted: 07/29/2022] [Indexed: 11/22/2022] Open
Abstract
A large number of chemical compounds are available in databases such as PubChem and ZINC. However, currently known compounds, though large, represent only a fraction of possible compounds, which is known as chemical space. Many of these compounds in the databases are annotated with properties and assay data that can be used for drug discovery efforts. For this goal, a number of machine learning algorithms have been developed and recent deep learning technologies can be effectively used to navigate chemical space, especially for unknown chemical compounds, in terms of drug-related tasks. In this article, we survey how deep learning technologies can model and utilize chemical compound information in a task-oriented way by exploiting annotated properties and assay data in the chemical compounds databases. We first compile what kind of tasks are trying to be accomplished by machine learning methods. Then, we survey deep learning technologies to show their modeling power and current applications for accomplishing drug related tasks. Next, we survey deep learning techniques to address the insufficiency issue of annotated data for more effective navigation of chemical space. Chemical compound information alone may not be powerful enough for drug related tasks, thus we survey what kind of information, such as assay and gene expression data, can be used to improve the prediction power of deep learning models. Finally, we conclude this survey with four important newly developed technologies that are yet to be fully incorporated into computational analysis of chemical information.
Collapse
Affiliation(s)
- Sangsoo Lim
- Bioinformatics Institute, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Sangseon Lee
- Institute of Computer Technology, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Yinhua Piao
- Department of Computer Science and Engineering, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - MinGyu Choi
- Department of Chemistry, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
- AIGENDRUG Co., Ltd., Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Dongmin Bang
- Interdisciplinary Program in Bioinformatics, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Jeonghyeon Gu
- Interdisciplinary Program in Artificial Intelligence, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Sun Kim
- Department of Computer Science and Engineering, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
- Interdisciplinary Program in Artificial Intelligence, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
- MOGAM Institute for Biomedical Research, Yong-in 16924, South Korea
- AIGENDRUG Co., Ltd., Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| |
Collapse
|
6
|
Gu Y, Zheng S, Xu Z, Yin Q, Li L, Li J. An efficient curriculum learning-based strategy for molecular graph learning. Brief Bioinform 2022; 23:6562682. [DOI: 10.1093/bib/bbac099] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2021] [Revised: 01/18/2022] [Accepted: 02/27/2022] [Indexed: 12/14/2022] Open
Abstract
Abstract
Computational methods have been widely applied to resolve various core issues in drug discovery, such as molecular property prediction. In recent years, a data-driven computational method-deep learning had achieved a number of impressive successes in various domains. In drug discovery, graph neural networks (GNNs) take molecular graph data as input and learn graph-level representations in non-Euclidean space. An enormous amount of well-performed GNNs have been proposed for molecular graph learning. Meanwhile, efficient use of molecular data during training process, however, has not been paid enough attention. Curriculum learning (CL) is proposed as a training strategy by rearranging training queue based on calculated samples' difficulties, yet the effectiveness of CL method has not been determined in molecular graph learning. In this study, inspired by chemical domain knowledge and task prior information, we proposed a novel CL-based training strategy to improve the training efficiency of molecular graph learning, called CurrMG. Consisting of a difficulty measurer and a training scheduler, CurrMG is designed as a plug-and-play module, which is model-independent and easy-to-use on molecular data. Extensive experiments demonstrated that molecular graph learning models could benefit from CurrMG and gain noticeable improvement on five GNN models and eight molecular property prediction tasks (overall improvement is 4.08%). We further observed CurrMG’s encouraging potential in resource-constrained molecular property prediction. These results indicate that CurrMG can be used as a reliable and efficient training strategy for molecular graph learning.
Availability: The source code is available in https://github.com/gu-yaowen/CurrMG.
Collapse
Affiliation(s)
- Yaowen Gu
- Institute of Medical Information (IMI), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China
| | - Si Zheng
- Institute of Medical Information (IMI), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China
- Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
| | - Zidu Xu
- Institute of Medical Information (IMI), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China
| | - Qijin Yin
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Liang Li
- Key Laboratory of Antibiotic Bioengineering of National Health and Family Planning Commission (NHFPC), Institute of Medicinal Biotechnology (IMB), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China
| | - Jiao Li
- Institute of Medical Information (IMI), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China
| |
Collapse
|