1
|
Kashafutdinova IM, Poyezzhayeva A, Gimadiev T, Madzhidov T. Active learning approaches in molecule pKi prediction. Mol Inform 2024:e202400154. [PMID: 39105614 DOI: 10.1002/minf.202400154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2024] [Revised: 06/21/2024] [Accepted: 06/23/2024] [Indexed: 08/07/2024]
Abstract
During the early stages of drug design, identifying compounds with suitable bioactivities is crucial. Given the vast array of potential drug databases, it's feasible to assay only a limited subset of candidates. The optimal method for selecting the candidates, aiming to minimize the overall number of assays, involves an active learning (AL) approach. In this work, we benchmarked a range of AL strategies with two main objectives: (1) to identify a strategy that ensures high model performance and (2) to select molecules with desired properties using minimal assays. To evaluate the different AL strategies, we employed the simulated AL workflow based on "virtual" experiments. These experiments leveraged ChEMBL datasets, which come with known biological activity values for the molecules. Furthermore, for classification tasks, we proposed the hybrid selection strategy that unified both exploration and exploitation AL strategies into a single acquisition function, defined by parameters n and c. We have also shown that popular minimal margin and maximal variance selection approaches for exploration selection correspond to minimization of the hybrid acquisition function with n=1 and 2 respectively. The balance between the exploration and exploitation strategies can be adjusted using a coefficient (c), making the optimal strategy selection straightforward. The primary strength of the hybrid selection method lies in its adaptability; it offers the flexibility to adjust the criteria for molecule selection based on the specific task by modifying the value of the contribution coefficient. Our analysis revealed that, in regression tasks, AL strategies didn't succeed at ensuring high model performance, however, they were successful in selecting molecules with desired properties using minimal number of tests. In analogous experiments in classification tasks, exploration strategy and the hybrid selection function with a constant c<1 (for n=1) and c≤0.2 (for n=2) were effective in achieving the goal of constructing a high-performance predictive model using minimal data. When searching for molecules with desired properties, exploitation, and the hybrid function with c≥1 (n=1) and c≥0.7 (n=2) demonstrated efficiency identifying molecules in fewer iterations compared to random selection method. Notably, when the hybrid function was set to an intermediate coefficient value (c=0.7), it successfully addressed both tasks simultaneously.
Collapse
Affiliation(s)
- I M Kashafutdinova
- A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kazan, 420008, Russia
| | - A Poyezzhayeva
- A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kazan, 420008, Russia
| | - T Gimadiev
- A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kazan, 420008, Russia
| | - T Madzhidov
- Chemistry Solutions, Elsevier, London, EC2Y 5AS, UK
| |
Collapse
|
2
|
Chen Y, Du Z, Ren X, Pan C, Zhu Y, Li Z, Meng T, Yao X. mRNA-CLA: An interpretable deep learning approach for predicting mRNA subcellular localization. Methods 2024; 227:17-26. [PMID: 38705502 DOI: 10.1016/j.ymeth.2024.04.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 03/30/2024] [Accepted: 04/28/2024] [Indexed: 05/07/2024] Open
Abstract
Messenger RNA (mRNA) is vital for post-transcriptional gene regulation, acting as the direct template for protein synthesis. However, the methods available for predicting mRNA subcellular localization need to be improved and enhanced. Notably, few existing algorithms can annotate mRNA sequences with multiple localizations. In this work, we propose the mRNA-CLA, an innovative multi-label subcellular localization prediction framework for mRNA, leveraging a deep learning approach with a multi-head self-attention mechanism. The framework employs a multi-scale convolutional layer to extract sequence features across different regions and uses a self-attention mechanism explicitly designed for each sequence. Paired with Position Weight Matrices (PWMs) derived from the convolutional neural network layers, our model offers interpretability in the analysis. In particular, we perform a base-level analysis of mRNA sequences from diverse subcellular localizations to determine the nucleotide specificity corresponding to each site. Our evaluations demonstrate that the mRNA-CLA model substantially outperforms existing methods and tools.
Collapse
Affiliation(s)
- Yifan Chen
- Institute of Artificial Intelligence Application, College of Computer and Information Engineering, Central South University of Forestry and Technology, Changsha, Hunan 410004, China
| | - Zhenya Du
- Guangzhou Xinhua University, 510520, Guangzhou, China
| | - Xuanbai Ren
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, China
| | - Chu Pan
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, China
| | - Yangbin Zhu
- Manufacturing and Electronic Engineering, Wenzhou University of Technology, 325027, Wenzhou, China.
| | - Zhen Li
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China.
| | - Tao Meng
- Institute of Artificial Intelligence Application, College of Computer and Information Engineering, Central South University of Forestry and Technology, Changsha, Hunan 410004, China
| | - Xiaojun Yao
- Faculty of Applied Sciences, Macao Polytechnic University, 999078, Macao.
| |
Collapse
|
3
|
Fu X, Chen Y, Tian S. DlncRNALoc: A discrete wavelet transform-based model for predicting lncRNA subcellular localization. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:20648-20667. [PMID: 38124569 DOI: 10.3934/mbe.2023913] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/23/2023]
Abstract
The prediction of long non-coding RNA (lncRNA) subcellular localization is essential to the understanding of its function and involvement in cellular regulation. Traditional biological experimental methods are costly and time-consuming, making computational methods the preferred approach for predicting lncRNA subcellular localization (LSL). However, existing computational methods have limitations due to the structural characteristics of lncRNAs and the uneven distribution of data across subcellular compartments. We propose a discrete wavelet transform (DWT)-based model for predicting LSL, called DlncRNALoc. We construct a physicochemical property matrix of a 2-tuple bases based on lncRNA sequences, and we introduce a DWT lncRNA feature extraction method. We use the Synthetic Minority Over-sampling Technique (SMOTE) for oversampling and the local fisher discriminant analysis (LFDA) algorithm to optimize feature information. The optimized feature vectors are fed into support vector machine (SVM) to construct a predictive model. DlncRNALoc has been applied for a five-fold cross-validation on the three sets of benchmark datasets. Extensive experiments have demonstrated the superiority and effectiveness of the DlncRNALoc model in predicting LSL.
Collapse
Affiliation(s)
- Xiangzheng Fu
- Neher's Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao, China
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, China
- Department of Basic Biology, Changsha Medical College, Changsha, Hunan, China
| | - Yifan Chen
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, China
- Department of Basic Biology, Changsha Medical College, Changsha, Hunan, China
| | - Sha Tian
- Department of Internal Medicine, College of Integrated Chinese and Western Medicine, Hunan University of Chinese Medicine, Changsha, Hunan, China
| |
Collapse
|
4
|
Ju H, Bai J, Jiang J, Che Y, Chen X. Comparative evaluation and analysis of DNA N4-methylcytosine methylation sites using deep learning. Front Genet 2023; 14:1254827. [PMID: 37671040 PMCID: PMC10476523 DOI: 10.3389/fgene.2023.1254827] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 07/31/2023] [Indexed: 09/07/2023] Open
Abstract
DNA N4-methylcytosine (4mC) is significantly involved in biological processes, such as DNA expression, repair, and replication. Therefore, accurate prediction methods are urgently needed. Deep learning methods have transformed applications that previously require sequencing expertise into engineering challenges that do not require expertise to solve. Here, we compare a variety of state-of-the-art deep learning models on six benchmark datasets to evaluate their performance in 4mC methylation site detection. We visualize the statistical analysis of the datasets and the performance of different deep-learning models. We conclude that deep learning can greatly expand the potential of methylation site prediction.
Collapse
Affiliation(s)
- Hong Ju
- Heilongjiang Agricultural Engineering Vocational College, Harbin, China
| | - Jie Bai
- Engineering Research Center of Integration and Application of Digital Learning Technology, Ministry of Education, Hangzhou, China
| | - Jing Jiang
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Yusheng Che
- Heilongjiang Agricultural Engineering Vocational College, Harbin, China
| | - Xin Chen
- Department of Neurosurgical Laboratory, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| |
Collapse
|
5
|
Liang Y, Ma X. iACP-GE: accurate identification of anticancer peptides by using gradient boosting decision tree and extra tree. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2023; 34:1-19. [PMID: 36562289 DOI: 10.1080/1062936x.2022.2160011] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Accepted: 12/12/2022] [Indexed: 06/17/2023]
Abstract
Cancer is one of the main diseases threatening human life, accounting for millions of deaths around the world each year. Traditional physical and chemical methods for cancer treatment are extremely time-consuming, lab-intensive, expensive, inefficient and difficult to be applied in a high-throughput way. Hence, it is an urgent task to develop automated computational methods to enable fast and accurate identification of anticancer peptides (ACPs). In this paper, we develop a novel model named iACP-GE to identify ACPs. Multi-features are extracted by using binary encoding, enhanced grouped amino acid composition and BLOSUM62 encoding based on the N5C5 sequence, as well as detrended forward moving-average auto-cross correlation analysis based on physicochemical properties of 20 natural amino acids. Thus, 835 features are obtained for each sample, in order to avoid information redundancy, gradient boosting decision tree was adopted as the feature selection strategy. Then, the optimal feature subset is input to the extra tree classifier. The accuracies of ACP740 and ACP240 datasets with the 5-fold cross-validation were 90.54% and 91.25%, respectively. Experimental results indicate that iACP-GE significantly outperforms several existing models on ACP740 and ACP240 datasets and can be used as an effective tool for the identification of ACPs. The datasets and source codes for iACP-GE are available at https://github.com/yunyunliang88/iACP-GE.
Collapse
Affiliation(s)
- Y Liang
- School of Science, Xi'an Polytechnic University, Xi'an, P. R. China
| | - X Ma
- School of Science, Xi'an Polytechnic University, Xi'an, P. R. China
| |
Collapse
|
6
|
Cai L, Gao M, Ren X, Fu X, Xu J, Wang P, Chen Y. MILNP: Plant lncRNA-miRNA Interaction Prediction Based on Improved Linear Neighborhood Similarity and Label Propagation. FRONTIERS IN PLANT SCIENCE 2022; 13:861886. [PMID: 35401586 PMCID: PMC8990282 DOI: 10.3389/fpls.2022.861886] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Accepted: 02/21/2022] [Indexed: 06/14/2023]
Abstract
Knowledge of the interactions between long non-coding RNAs (lncRNAs) and microRNAs (miRNAs) is the basis of understanding various biological activities and designing new drugs. Previous computational methods for predicting lncRNA-miRNA interactions lacked for plants, and they suffer from various limitations that affect the prediction accuracy and their applicability. Research on plant lncRNA-miRNA interactions is still in its infancy. In this paper, we propose an accurate predictor, MILNP, for predicting plant lncRNA-miRNA interactions based on improved linear neighborhood similarity measurement and linear neighborhood propagation algorithm. Specifically, we propose a novel similarity measure based on linear neighborhood similarity from multiple similarity profiles of lncRNAs and miRNAs and derive more precise neighborhood ranges so as to escape the limits of the existing methods. We then simultaneously update the lncRNA-miRNA interactions predicted from both similarity matrices based on label propagation. We comprehensively evaluate MILNP on the latest plant lncRNA-miRNA interaction benchmark datasets. The results demonstrate the superior performance of MILNP than the most up-to-date methods. What's more, MILNP can be leveraged for isolated plant lncRNAs (or miRNAs). Case studies suggest that MILNP can identify novel plant lncRNA-miRNA interactions, which are confirmed by classical tools. The implementation is available on https://github.com/HerSwain/gra/tree/MILNP.
Collapse
Affiliation(s)
| | | | | | - Xiangzheng Fu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | | | - Peng Wang
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | | |
Collapse
|