1
|
Kang Y, Wang H, Qin Y, Liu G, Yu Y, Zhang Y. PSATF-6mA: an integrated learning fusion feature-encoded DNA-6 mA methylcytosine modification site recognition model based on attentional mechanisms. Front Genet 2024; 15:1498884. [PMID: 39600317 PMCID: PMC11588721 DOI: 10.3389/fgene.2024.1498884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2024] [Accepted: 10/30/2024] [Indexed: 11/29/2024] Open
Abstract
DNA methylation is of crucial importance for biological genetic expression, such as biological cell differentiation and cellular tumours. The identification of DNA-6mA sites using traditional biological experimental methods requires more cumbersome steps and a large amount of time. The advent of neural network technology has facilitated the identification of 6 mA sites on cross-species DNA with enhanced efficacy. Nevertheless, the majority of contemporary neural network models for identifying 6 mA sites prioritize the design of the identification model, with comparatively limited research conducted on the statistically significant DNA sequence itself. Consequently, this paper will focus on the statistical strategy of DNA double-stranded features, utilising the multi-head self-attention mechanism in neural networks applied to DNA position probabilistic relationships. Furthermore, a new recognition model, PSATF-6 mA, will be constructed by continually adjusting the attentional tendency of feature fusion through an integrated learning framework. The experimental results, obtained through cross-validation with cross-species data, demonstrate that the PSATF-6 mA model outperforms the baseline model. The in-Matthews correlation coefficient (MCC) for the cross-species dataset of rice and m. musus genomes can reach a score of 0.982. The present model is expected to assist biologists in more accurately identifying 6 mA locus and in formulating new testable biological hypotheses.
Collapse
Affiliation(s)
- Yanmei Kang
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| | - Hongyuan Wang
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| | - Yubo Qin
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| | - Guanlin Liu
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| | - Yi Yu
- College of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China
| | - Yongjian Zhang
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| |
Collapse
|
2
|
Xiao H, Zou Y, Wang J, Wan S. A Review for Artificial Intelligence Based Protein Subcellular Localization. Biomolecules 2024; 14:409. [PMID: 38672426 PMCID: PMC11048326 DOI: 10.3390/biom14040409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 03/21/2024] [Accepted: 03/25/2024] [Indexed: 04/28/2024] Open
Abstract
Proteins need to be located in appropriate spatiotemporal contexts to carry out their diverse biological functions. Mislocalized proteins may lead to a broad range of diseases, such as cancer and Alzheimer's disease. Knowing where a target protein resides within a cell will give insights into tailored drug design for a disease. As the gold validation standard, the conventional wet lab uses fluorescent microscopy imaging, immunoelectron microscopy, and fluorescent biomarker tags for protein subcellular location identification. However, the booming era of proteomics and high-throughput sequencing generates tons of newly discovered proteins, making protein subcellular localization by wet-lab experiments a mission impossible. To tackle this concern, in the past decades, artificial intelligence (AI) and machine learning (ML), especially deep learning methods, have made significant progress in this research area. In this article, we review the latest advances in AI-based method development in three typical types of approaches, including sequence-based, knowledge-based, and image-based methods. We also elaborately discuss existing challenges and future directions in AI-based method development in this research field.
Collapse
Affiliation(s)
- Hanyu Xiao
- Department of Genetics, Cell Biology and Anatomy, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| | - Yijin Zou
- College of Veterinary Medicine, China Agricultural University, Beijing 100193, China;
| | - Jieqiong Wang
- Department of Neurological Sciences, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| | - Shibiao Wan
- Department of Genetics, Cell Biology and Anatomy, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| |
Collapse
|
3
|
Zhang M, Gao H, Liao X, Ning B, Gu H, Yu B. DBGRU-SE: predicting drug-drug interactions based on double BiGRU and squeeze-and-excitation attention mechanism. Brief Bioinform 2023:7176312. [PMID: 37225428 DOI: 10.1093/bib/bbad184] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2023] [Revised: 04/03/2023] [Accepted: 04/23/2023] [Indexed: 05/26/2023] Open
Abstract
The prediction of drug-drug interactions (DDIs) is essential for the development and repositioning of new drugs. Meanwhile, they play a vital role in the fields of biopharmaceuticals, disease diagnosis and pharmacological treatment. This article proposes a new method called DBGRU-SE for predicting DDIs. Firstly, FP3 fingerprints, MACCS fingerprints, Pubchem fingerprints and 1D and 2D molecular descriptors are used to extract the feature information of the drugs. Secondly, Group Lasso is used to remove redundant features. Then, SMOTE-ENN is applied to balance the data to obtain the best feature vectors. Finally, the best feature vectors are fed into the classifier combining BiGRU and squeeze-and-excitation (SE) attention mechanisms to predict DDIs. After applying five-fold cross-validation, The ACC values of DBGRU-SE model on the two datasets are 97.51 and 94.98%, and the AUC are 99.60 and 98.85%, respectively. The results showed that DBGRU-SE had good predictive performance for drug-drug interactions.
Collapse
Affiliation(s)
| | - Hongli Gao
- Qingdao University of Science and Technology, China
| | - Xin Liao
- Qingdao University of Science and Technology, China
| | - Baoxing Ning
- Qingdao University of Science and Technology, China
| | - Haiming Gu
- Qingdao University of Science and Technology, China
| | - Bin Yu
- Qingdao University of Science and Technology, China
| |
Collapse
|
4
|
Yu Y, Ding P, Gao H, Liu G, Zhang F, Yu B. Cooperation of local features and global representations by a dual-branch network for transcription factor binding sites prediction. Brief Bioinform 2023; 24:7030619. [PMID: 36748992 DOI: 10.1093/bib/bbad036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 01/03/2023] [Accepted: 01/18/2023] [Indexed: 02/08/2023] Open
Abstract
Interactions between DNA and transcription factors (TFs) play an essential role in understanding transcriptional regulation mechanisms and gene expression. Due to the large accumulation of training data and low expense, deep learning methods have shown huge potential in determining the specificity of TFs-DNA interactions. Convolutional network-based and self-attention network-based methods have been proposed for transcription factor binding sites (TFBSs) prediction. Convolutional operations are efficient to extract local features but easy to ignore global information, while self-attention mechanisms are expert in capturing long-distance dependencies but difficult to pay attention to local feature details. To discover comprehensive features for a given sequence as far as possible, we propose a Dual-branch model combining Self-Attention and Convolution, dubbed as DSAC, which fuses local features and global representations in an interactive way. In terms of features, convolution and self-attention contribute to feature extraction collaboratively, enhancing the representation learning. In terms of structure, a lightweight but efficient architecture of network is designed for the prediction, in particular, the dual-branch structure makes the convolution and the self-attention mechanism can be fully utilized to improve the predictive ability of our model. The experiment results on 165 ChIP-seq datasets show that DSAC obviously outperforms other five deep learning based methods and demonstrate that our model can effectively predict TFBSs based on sequence feature alone. The source code of DSAC is available at https://github.com/YuBinLab-QUST/DSAC/.
Collapse
Affiliation(s)
- Yutong Yu
- College of Information Science and Technology, Qingdao University of Science and Technology, China
| | - Pengju Ding
- College of Information Science and Technology, Qingdao University of Science and Technology, China
| | - Hongli Gao
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Guozhu Liu
- College of Information Science and Technology, Qingdao University of Science and Technology, China
| | - Fa Zhang
- School of Medical Technology, Beijing Institute of Technology, China
| | - Bin Yu
- College of Information Science and Technology, School of Data Science, Qingdao University of Science and Technology, China
| |
Collapse
|
5
|
Ullah M, Hadi F, Song J, Yu DJ. PScL-2LSAESM: bioimage-based prediction of protein subcellular localization by integrating heterogeneous features with the two-level SAE-SM and mean ensemble method. Bioinformatics 2023; 39:6839969. [PMID: 36413068 PMCID: PMC9947927 DOI: 10.1093/bioinformatics/btac727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Revised: 11/02/2022] [Accepted: 11/21/2022] [Indexed: 11/23/2022] Open
Abstract
MOTIVATION Over the past decades, a variety of in silico methods have been developed to predict protein subcellular localization within cells. However, a common and major challenge in the design and development of such methods is how to effectively utilize the heterogeneous feature sets extracted from bioimages. In this regards, limited efforts have been undertaken. RESULTS We propose a new two-level stacked autoencoder network (termed 2L-SAE-SM) to improve its performance by integrating the heterogeneous feature sets. In particular, in the first level of 2L-SAE-SM, each optimal heterogeneous feature set is fed to train our designed stacked autoencoder network (SAE-SM). All the trained SAE-SMs in the first level can output the decision sets based on their respective optimal heterogeneous feature sets, known as 'intermediate decision' sets. Such intermediate decision sets are then ensembled using the mean ensemble method to generate the 'intermediate feature' set for the second-level SAE-SM. Using the proposed framework, we further develop a novel predictor, referred to as PScL-2LSAESM, to characterize image-based protein subcellular localization. Extensive benchmarking experiments on the latest benchmark training and independent test datasets collected from the human protein atlas databank demonstrate the effectiveness of the proposed 2L-SAE-SM framework for the integration of heterogeneous feature sets. Moreover, performance comparison of the proposed PScL-2LSAESM with current state-of-the-art methods further illustrates that PScL-2LSAESM clearly outperforms the existing state-of-the-art methods for the task of protein subcellular localization. AVAILABILITY AND IMPLEMENTATION https://github.com/csbio-njust-edu/PScL-2LSAESM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Matee Ullah
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Fazal Hadi
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | | | - Dong-Jun Yu
- To whom correspondence should be addressed. or
| |
Collapse
|
6
|
Ding Y, He W, Tang J, Zou Q, Guo F. Laplacian Regularized Sparse Representation Based Classifier for Identifying DNA N4-Methylcytosine Sites via L 2,1/2-Matrix Norm. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:500-511. [PMID: 34882559 DOI: 10.1109/tcbb.2021.3133309] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
N4-methylcytosine (4mC) is one of important epigenetic modifications in DNA sequences. Detecting 4mC sites is time-consuming. The computational method based on machine learning has provided effective help for identifying 4mC. To further improve the performance of prediction, we propose a Laplacian Regularized Sparse Representation based Classifier with L2,1/2-matrix norm (LapRSRC). We also utilize kernel trick to derive the kernel LapRSRC for nonlinear modeling. Matrix factorization technology is employed to solve the sparse representation coefficients of all test samples in the training set. And an efficient iterative algorithm is proposed to solve the objective function. We implement our model on six benchmark datasets of 4mC and eight UCI datasets to evaluate performance. The results show that the performance of our method is better or comparable.
Collapse
|
7
|
Wu L, Gao S, Yao S, Wu F, Li J, Dong Y, Zhang Y. Gm-PLoc: A Subcellular Localization Model of Multi-Label Protein Based on GAN and DeepFM. Front Genet 2022; 13:912614. [PMID: 35783287 PMCID: PMC9240597 DOI: 10.3389/fgene.2022.912614] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 05/20/2022] [Indexed: 11/13/2022] Open
Abstract
Identifying the subcellular localization of a given protein is an essential part of biological and medical research, since the protein must be localized in the correct organelle to ensure physiological function. Conventional biological experiments for protein subcellular localization have some limitations, such as high cost and low efficiency, thus massive computational methods are proposed to solve these problems. However, some of these methods need to be improved further for protein subcellular localization with class imbalance problem. We propose a new model, generating minority samples for protein subcellular localization (Gm-PLoc), to predict the subcellular localization of multi-label proteins. This model includes three steps: using the position specific scoring matrix to extract distinguishable features of proteins; synthesizing samples of the minority category to balance the distribution of categories based on the revised generative adversarial networks; training a classifier with the rebalanced dataset to predict the subcellular localization of multi-label proteins. One benchmark dataset is selected to evaluate the performance of the presented model, and the experimental results demonstrate that Gm-PLoc performs well for the multi-label protein subcellular localization.
Collapse
Affiliation(s)
- Liwen Wu
- Engineering Research Center of Cyberspace, Yunnan University, Kunming, China
- School of Software, Yunnan University, Kunming, China
| | - Song Gao
- Engineering Research Center of Cyberspace, Yunnan University, Kunming, China
- School of Software, Yunnan University, Kunming, China
| | - Shaowen Yao
- Engineering Research Center of Cyberspace, Yunnan University, Kunming, China
- School of Software, Yunnan University, Kunming, China
| | - Feng Wu
- Engineering Research Center of Cyberspace, Yunnan University, Kunming, China
- School of Software, Yunnan University, Kunming, China
| | - Jie Li
- Engineering Research Center of Cyberspace, Yunnan University, Kunming, China
- School of Software, Yunnan University, Kunming, China
| | - Yunyun Dong
- Engineering Research Center of Cyberspace, Yunnan University, Kunming, China
- School of Software, Yunnan University, Kunming, China
| | - Yunqi Zhang
- Engineering Research Center of Cyberspace, Yunnan University, Kunming, China
- School of Software, Yunnan University, Kunming, China
- Yunnan Key Laboratory of Statistical Modeling and Data Analysis, School of Mathematics and Statistics, Yunnan University, Kunming, China
- *Correspondence: Yunqi Zhang,
| |
Collapse
|