1
|
Zhang C, Zhai Y, Gong Z, Duan H, She YB, Yang YF, Su A. Transfer learning across different chemical domains: virtual screening of organic materials with deep learning models pretrained on small molecule and chemical reaction data. J Cheminform 2024; 16:89. [PMID: 39080777 PMCID: PMC11290278 DOI: 10.1186/s13321-024-00886-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Accepted: 07/21/2024] [Indexed: 08/02/2024] Open
Abstract
Machine learning is becoming a preferred method for the virtual screening of organic materials due to its cost-effectiveness over traditional computationally demanding techniques. However, the scarcity of labeled data for organic materials poses a significant challenge for training advanced machine learning models. This study showcases the potential of utilizing databases of drug-like small molecules and chemical reactions to pretrain the BERT model, enhancing its performance in the virtual screening of organic materials. By fine-tuning the BERT models with data from five virtual screening tasks, the version pretrained with the USPTO-SMILES dataset achieved R2 scores exceeding 0.94 for three tasks and over 0.81 for two others. This performance surpasses that of models pretrained on the small molecule or organic materials databases and outperforms three traditional machine learning models trained directly on virtual screening data. The success of the USPTO-SMILES pretrained BERT model can be attributed to the diverse array of organic building blocks in the USPTO database, offering a broader exploration of the chemical space. The study further suggests that accessing a reaction database with a wider range of reactions than the USPTO could further enhance model performance. Overall, this research validates the feasibility of applying transfer learning across different chemical domains for the efficient virtual screening of organic materials.Scientific contributionThis study verifies the feasibility of applying transfer learning to large language models in different chemical fields to help organic materials perform virtual screening. Through the comparison of transfer learning from different chemical fields to a variety of organic material molecules, the high precision virtual screening of organic materials is realized.
Collapse
Affiliation(s)
- Chengwei Zhang
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, 310014, Zhejiang, China
| | - Yushuang Zhai
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, 310014, Zhejiang, China
| | - Ziyang Gong
- Key Laboratory of Pharmaceutical Engineering of Zhejiang Province, Key Laboratory for Green Pharmaceutical Technologies and Related Equipment of Ministry of Education, Collaborative Innovation Center of Yangtze River Delta Region Green Pharmaceuticals, Zhejiang University of Technology, Hangzhou, 310014, People's Republic of China
| | - Hongliang Duan
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, 999078, China
| | - Yuan-Bin She
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, 310014, Zhejiang, China
| | - Yun-Fang Yang
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, 310014, Zhejiang, China
| | - An Su
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, 310014, Zhejiang, China.
- Key Laboratory of Pharmaceutical Engineering of Zhejiang Province, Key Laboratory for Green Pharmaceutical Technologies and Related Equipment of Ministry of Education, Collaborative Innovation Center of Yangtze River Delta Region Green Pharmaceuticals, Zhejiang University of Technology, Hangzhou, 310014, People's Republic of China.
| |
Collapse
|
2
|
Su A, Cheng Y, Zhang C, Yang YF, She YB, Rajan K. An artificial intelligence platform for automated PFAS subgroup classification: A discovery tool for PFAS screening. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024; 921:171229. [PMID: 38402985 DOI: 10.1016/j.scitotenv.2024.171229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 01/27/2024] [Accepted: 02/21/2024] [Indexed: 02/27/2024]
Abstract
Since structural analyses and toxicity assessments have not been able to keep up with the discovery of unknown per- and polyfluoroalkyl substances (PFAS), there is an urgent need for effective categorization and grouping of PFAS. In this study, we presented PFAS-Atlas, an artificial intelligence-based platform containing a rule-based automatic classification system and a machine learning-based grouping model. Compared with previously developed classification software, the platform's classification system follows the latest Organization for Economic Co-operation and Development (OECD) definition of PFAS and reduces the number of uncategorized PFAS. In addition, the platform incorporates deep unsupervised learning models to visualize the chemical space of PFAS by clustering similar structures and linking related classes. Through real-world use cases, we demonstrate that PFAS-Atlas can rapidly screen for relationships between chemical structure and persistence, bioaccumulation, or toxicity data for PFAS. The platform can also guide the planning of the PFAS testing strategy by showing which PFAS classes urgently require further attention. Ultimately, the release of PFAS-Atlas will benefit both the PFAS research and regulation communities.
Collapse
Affiliation(s)
- An Su
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China; Key Laboratory of Pharmaceutical Engineering of Zhejiang Province, Collaborative Innovation Center of Yangtze River Delta Region Green Pharmaceuticals, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, PR China.
| | - Yingying Cheng
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China; Key Laboratory of Pharmaceutical Engineering of Zhejiang Province, Collaborative Innovation Center of Yangtze River Delta Region Green Pharmaceuticals, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, PR China
| | - Chengwei Zhang
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China
| | - Yun-Fang Yang
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China
| | - Yuan-Bin She
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China.
| | - Krishna Rajan
- Department of Materials Design and Innovation, University at Buffalo, Buffalo, NY 14260-1660, United States.
| |
Collapse
|
3
|
Zhang H, Lu C, Yao Q, Jiao Q. In silico study to identify novel NEK7 inhibitors from natural sources by a combination strategy. Mol Divers 2024:10.1007/s11030-024-10838-4. [PMID: 38598164 DOI: 10.1007/s11030-024-10838-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Accepted: 03/06/2024] [Indexed: 04/11/2024]
Abstract
Cancer poses a significant global health challenge and significantly contributes to mortality. NEK7, related to the NIMA protein kinase family, plays a crucial role in spindle assembly and cell division. The dysregulation of NEK7 is closely linked to the onset and progression of various cancers, especially colon and breast cancer, making it a promising target for cancer therapy. Nevertheless, the shortage of high-quality NEK7 inhibitors highlights the need for new therapeutic strategies. In this study, we utilized a multidisciplinary approach, including virtual screening, molecular docking, pharmacokinetics, molecular dynamics simulations (MDs), and MM/PBSA calculations, to evaluate natural compounds as NEK7 inhibitors comprehensively. Through various docking strategies, we identified three natural compounds: (-)-balanol, digallic acid, and scutellarin. Molecular docking revealed significant interactions at residues such as GLU112 and ALA114, with docking scores of -15.054, -13.059, and -11.547 kcal/mol, respectively, highlighting their potential as NEK7 inhibitors. MDs confirmed the stability of these compounds at the NEK7-binding site. Hydrogen bond analysis during simulations revealed consistent interactions, supporting their strong binding capacity. MM/PBSA analysis identified other crucial amino acids contributing to binding affinity, including ILE20, VAL28, ILE75, LEU93, ALA94, LYS143, PHE148, LEU160, and THR161, crucial for stabilizing the complex. This research demonstrated that these compounds exceeded dabrafenib in binding energy, according to MM/PBSA calculations, underscoring their effectiveness as NEK7 inhibitors. ADME/T predictions showed lower oral toxicity for these compounds, suggesting their potential for further development. This study highlights the promise of these natural compounds as bases for creating more potent derivatives with significant biological activities, paving the way for future experimental validation.
Collapse
Affiliation(s)
- Heng Zhang
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing, 210023, China
| | - Chenhong Lu
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing, 210023, China
| | - Qilong Yao
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing, 210023, China
| | - Qingcai Jiao
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing, 210023, China.
| |
Collapse
|
4
|
Dral PO. AI in computational chemistry through the lens of a decade-long journey. Chem Commun (Camb) 2024; 60:3240-3258. [PMID: 38444290 DOI: 10.1039/d4cc00010b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/07/2024]
Abstract
This article gives a perspective on the progress of AI tools in computational chemistry through the lens of the author's decade-long contributions put in the wider context of the trends in this rapidly expanding field. This progress over the last decade is tremendous: while a decade ago we had a glimpse of what was to come through many proof-of-concept studies, now we witness the emergence of many AI-based computational chemistry tools that are mature enough to make faster and more accurate simulations increasingly routine. Such simulations in turn allow us to validate and even revise experimental results, deepen our understanding of the physicochemical processes in nature, and design better materials, devices, and drugs. The rapid introduction of powerful AI tools gives rise to unique challenges and opportunities that are discussed in this article too.
Collapse
Affiliation(s)
- Pavlo O Dral
- State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Chemistry and Chemical Engineering, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, and Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM), Xiamen University, Xiamen, Fujian 361005, China.
| |
Collapse
|
5
|
Zhang Z, Zhang C, Zhang Y, Deng S, Yang YF, Su A, She YB. Predicting band gaps of MOFs on small data by deep transfer learning with data augmentation strategies. RSC Adv 2023; 13:16952-16962. [PMID: 37288371 PMCID: PMC10243186 DOI: 10.1039/d3ra02142d] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2023] [Accepted: 05/31/2023] [Indexed: 06/09/2023] Open
Abstract
Porphyrin-based MOFs combine the unique photophysical and electrochemical properties of metalloporphyrins with the catalytic efficiency of MOF materials, making them an important candidate for light energy harvesting and conversion. However, accurate prediction of the band gap of porphyrin-based MOFs is hampered by their complex structure-function relationships. Although machine learning (ML) has performed well in predicting the properties of MOFs with large training datasets, such ML applications become challenging when the training data size of the materials is small. In this study, we first constructed a dataset of 202 porphyrin-based MOFs using DFT computations and increased the training data size using two data augmentation strategies. After that, four state-of-the-art neural network models were pre-trained with the recognized open-source database QMOF and fine-tuned with our augmented self-curated datasets. The GCN models predicted the band gaps of the porphyrin-based materials with the lowest RMSE of 0.2767 eV and MAE of 0.1463 eV. In addition, the data augmentation strategy rotation and mirroring effectively decreased the RMSE by 38.51% and MAE by 50.05%. This study demonstrates that, when proper transfer learning and data augmentation strategies are applied, machine learning models can predict the properties of MOFs using small training data.
Collapse
Affiliation(s)
- Zhihui Zhang
- College of Chemical Engineering, Zhejiang University of Technology Hangzhou 310014 China
| | - Chengwei Zhang
- College of Chemical Engineering, Zhejiang University of Technology Hangzhou 310014 China
| | - Yutao Zhang
- College of Chemical Engineering, Zhejiang University of Technology Hangzhou 310014 China
| | - Shengwei Deng
- College of Chemical Engineering, Zhejiang University of Technology Hangzhou 310014 China
| | - Yun-Fang Yang
- College of Chemical Engineering, Zhejiang University of Technology Hangzhou 310014 China
| | - An Su
- College of Chemical Engineering, Zhejiang University of Technology Hangzhou 310014 China
| | - Yuan-Bin She
- College of Chemical Engineering, Zhejiang University of Technology Hangzhou 310014 China
| |
Collapse
|