1
|
Zhang M, Zhang L, Liu T, Feng H, He Z, Li F, Zhao J, Liu H. CBIL-VHPLI: a model for predicting viral-host protein-lncRNA interactions based on machine learning and transfer learning. Sci Rep 2024; 14:17549. [PMID: 39080344 PMCID: PMC11289117 DOI: 10.1038/s41598-024-68750-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2024] [Accepted: 07/26/2024] [Indexed: 08/02/2024] Open
Abstract
Virus‒host protein‒lncRNA interaction (VHPLI) predictions are critical for decoding the molecular mechanisms of viral pathogens and host immune processes. Although VHPLI interactions have been predicted in both plants and animals, they have not been extensively studied in viruses. For the first time, we propose a new deep learning-based approach that consists mainly of a convolutional neural network and bidirectional long and short-term memory network modules in combination with transfer learning named CBIL‒VHPLI to predict viral-host protein‒lncRNA interactions. The models were first trained on large and diverse datasets (including plants, animals, etc.). Protein sequence features were extracted using a k-mer method combined with the one-hot encoding and composition-transition-distribution (CTD) methods, and lncRNA sequence features were extracted using a k-mer method combined with the one-hot encoding and Z curve methods. The results obtained on three independent external validation datasets showed that the pre-trained CBIL‒VHPLI model performed the best with an accuracy of approximately 0.9. Pretraining was followed by conducting transfer learning on a viral protein-human lncRNA dataset, and the fine-tuning results showed that the accuracy of CBIL‒VHPLI was 0.946, which was significantly greater than that of the previous models. The final case study results showed that CBIL‒VHPLI achieved a prediction reproducibility rate of 91.6% for the RIP-Seq experimental screening results. This model was then used to predict the interactions between human lncRNA PIK3CD-AS2 and the nonstructural protein 1 (NS1) of the H5N1 virus, and RNA pull-down experiments were used to prove the prediction readiness of the model in terms of prediction. The source code of CBIL‒VHPLI and the datasets used in this work are available at https://github.com/Liu-Lab-Lnu/CBIL-VHPLI for academic usage.
Collapse
Affiliation(s)
- Man Zhang
- School of Life Science, Liaoning University, Shenyang, 110036, China
| | - Li Zhang
- School of Life Science, Liaoning University, Shenyang, 110036, China
- Technology Innovation Center for Computer Simulating and Information Processing of Bio-Macromolecules of Liaoning Province, Shenyang, 110036, China
- Engineering Laboratory for Molecular Simulation and Designing of Drug Molecules of Liaoning, Shenyang, 110036, China
| | - Ting Liu
- School of Life Science, Liaoning University, Shenyang, 110036, China
- China Medical University-Queen's University Belfast Joint College, China Medical University, Shenyang, 110036, China
| | - Huawei Feng
- Technology Innovation Center for Computer Simulating and Information Processing of Bio-Macromolecules of Liaoning Province, Shenyang, 110036, China
- Engineering Laboratory for Molecular Simulation and Designing of Drug Molecules of Liaoning, Shenyang, 110036, China
- School of Pharmacy, Liaoning University, No. 66, Chongshan Zhonglu, Shenyang, 110036, Liaoning, China
| | - Zhe He
- School of Life Science, Liaoning University, Shenyang, 110036, China
| | - Feng Li
- School of Life Science, Liaoning University, Shenyang, 110036, China
| | - Jian Zhao
- School of Life Science, Liaoning University, Shenyang, 110036, China
| | - Hongsheng Liu
- Technology Innovation Center for Computer Simulating and Information Processing of Bio-Macromolecules of Liaoning Province, Shenyang, 110036, China.
- Engineering Laboratory for Molecular Simulation and Designing of Drug Molecules of Liaoning, Shenyang, 110036, China.
- School of Pharmacy, Liaoning University, No. 66, Chongshan Zhonglu, Shenyang, 110036, Liaoning, China.
| |
Collapse
|
2
|
Wang K, Zeng X, Zhou J, Liu F, Luan X, Wang X. BERT-TFBS: a novel BERT-based model for predicting transcription factor binding sites by transfer learning. Brief Bioinform 2024; 25:bbae195. [PMID: 38701417 PMCID: PMC11066948 DOI: 10.1093/bib/bbae195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 03/26/2024] [Accepted: 04/10/2024] [Indexed: 05/05/2024] Open
Abstract
Transcription factors (TFs) are proteins essential for regulating genetic transcriptions by binding to transcription factor binding sites (TFBSs) in DNA sequences. Accurate predictions of TFBSs can contribute to the design and construction of metabolic regulatory systems based on TFs. Although various deep-learning algorithms have been developed for predicting TFBSs, the prediction performance needs to be improved. This paper proposes a bidirectional encoder representations from transformers (BERT)-based model, called BERT-TFBS, to predict TFBSs solely based on DNA sequences. The model consists of a pre-trained BERT module (DNABERT-2), a convolutional neural network (CNN) module, a convolutional block attention module (CBAM) and an output module. The BERT-TFBS model utilizes the pre-trained DNABERT-2 module to acquire the complex long-term dependencies in DNA sequences through a transfer learning approach, and applies the CNN module and the CBAM to extract high-order local features. The proposed model is trained and tested based on 165 ENCODE ChIP-seq datasets. We conducted experiments with model variants, cross-cell-line validations and comparisons with other models. The experimental results demonstrate the effectiveness and generalization capability of BERT-TFBS in predicting TFBSs, and they show that the proposed model outperforms other deep-learning models. The source code for BERT-TFBS is available at https://github.com/ZX1998-12/BERT-TFBS.
Collapse
Affiliation(s)
- Kai Wang
- Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), School of Internet of Things Engineering, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| | - Xuan Zeng
- Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), School of Internet of Things Engineering, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| | - Jingwen Zhou
- Science Center for Future Foods, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
- Key Laboratory of Industrial Biotechnology, Ministry of Education and School of Biotechnology, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
- Engineering Research Center of Ministry of Education on Food Synthetic Biotechnology, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
- Jiangsu Province Engineering Research Center of Food Synthetic Biotechnology, Jiangnan University, Wuxi 214122, China
| | - Fei Liu
- Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), School of Internet of Things Engineering, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| | - Xiaoli Luan
- Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), School of Internet of Things Engineering, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| | - Xinglong Wang
- Science Center for Future Foods, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
- Key Laboratory of Industrial Biotechnology, Ministry of Education and School of Biotechnology, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| |
Collapse
|
3
|
Fan Y, Zhang C, Hu X, Huang Z, Xue J, Deng L. SGCLDGA: unveiling drug-gene associations through simple graph contrastive learning. Brief Bioinform 2024; 25:bbae231. [PMID: 38754409 PMCID: PMC11097980 DOI: 10.1093/bib/bbae231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 04/15/2024] [Accepted: 04/30/2024] [Indexed: 05/18/2024] Open
Abstract
Drug repurposing offers a viable strategy for discovering new drugs and therapeutic targets through the analysis of drug-gene interactions. However, traditional experimental methods are plagued by their costliness and inefficiency. Despite graph convolutional network (GCN)-based models' state-of-the-art performance in prediction, their reliance on supervised learning makes them vulnerable to data sparsity, a common challenge in drug discovery, further complicating model development. In this study, we propose SGCLDGA, a novel computational model leveraging graph neural networks and contrastive learning to predict unknown drug-gene associations. SGCLDGA employs GCNs to extract vector representations of drugs and genes from the original bipartite graph. Subsequently, singular value decomposition (SVD) is employed to enhance the graph and generate multiple views. The model performs contrastive learning across these views, optimizing vector representations through a contrastive loss function to better distinguish positive and negative samples. The final step involves utilizing inner product calculations to determine association scores between drugs and genes. Experimental results on the DGIdb4.0 dataset demonstrate SGCLDGA's superior performance compared with six state-of-the-art methods. Ablation studies and case analyses validate the significance of contrastive learning and SVD, highlighting SGCLDGA's potential in discovering new drug-gene associations. The code and dataset for SGCLDGA are freely available at https://github.com/one-melon/SGCLDGA.
Collapse
Affiliation(s)
- Yanhao Fan
- School of Computer Science and Engineering, Central South University, 410075, Changsha, China
| | - Che Zhang
- School of software, Xinjiang University, 830046, Urumqi, China
| | - Xiaowen Hu
- School of Computer Science and Engineering, Central South University, 410075, Changsha, China
| | - Zhijian Huang
- School of Computer Science and Engineering, Central South University, 410075, Changsha, China
| | - Jiameng Xue
- School of Computer Science and Engineering, Central South University, 410075, Changsha, China
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, 410075, Changsha, China
| |
Collapse
|
4
|
Kabir A, Bhattarai M, Rasmussen KØ, Shehu A, Bishop AR, Alexandrov B, Usheva A. Advancing Transcription Factor Binding Site Prediction Using DNA Breathing Dynamics and Sequence Transformers via Cross Attention. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.16.575935. [PMID: 38293094 PMCID: PMC10827174 DOI: 10.1101/2024.01.16.575935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2024]
Abstract
Understanding the impact of genomic variants on transcription factor binding and gene regulation remains a key area of research, with implications for unraveling the complex mechanisms underlying various functional effects. Our study delves into the role of DNA's biophysical properties, including thermodynamic stability, shape, and flexibility in transcription factor (TF) binding. We developed a multi-modal deep learning model integrating these properties with DNA sequence data. Trained on ChIP-Seq (chromatin immunoprecipitation sequencing) data in vivo involving 690 TF-DNA binding events in human genome, our model significantly improves prediction performance in over 660 binding events, with up to 9.6% increase in AUROC metric compared to the baseline model when using no DNA biophysical properties explicitly. Further, we expanded our analysis to in vitro high-throughput Systematic Evolution of Ligands by Exponential enrichment (SELEX) and Protein Binding Microarray (PBM) datasets, comparing our model with established frameworks. The inclusion of DNA breathing features consistently improved TF binding predictions across different cell lines in these datasets. Notably, for complex ChIP-Seq datasets, integrating DNABERT2 with a cross-attention mechanism provided greater predictive capabilities and insights into the mechanisms of disease-related non-coding variants found in genome-wide association studies. This work highlights the importance of DNA biophysical characteristics in TF binding and the effectiveness of multi-modal deep learning models in gene regulation studies.
Collapse
Affiliation(s)
- Anowarul Kabir
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544, NM, USA
- Department of Computer Science, George Mason University, 4400 University Dr, 22030, VA, USA
| | - Manish Bhattarai
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544, NM, USA
| | - Kim Ø Rasmussen
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544, NM, USA
| | - Amarda Shehu
- Department of Computer Science, George Mason University, 4400 University Dr, 22030, VA, USA
| | - Alan R Bishop
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544, NM, USA
| | - Boian Alexandrov
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544, NM, USA
| | - Anny Usheva
- Department of Surgery, Brown University, 69 Brown St Box 1822, 02912, RI, USA
| |
Collapse
|
5
|
Alatrany AS, Khan W, Hussain AJ, Mustafina J, Al-Jumeily D. Transfer Learning for Classification of Alzheimer's Disease Based on Genome Wide Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2700-2711. [PMID: 37018274 DOI: 10.1109/tcbb.2022.3233869] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Alzheimer's disease (AD) is a type of brain disorder that is regarded as a degenerative disease because the corresponding symptoms aggravate with the time progression. Single nucleotide polymorphisms (SNPs) have been identified as relevant biomarkers for this condition. This study aims to identify SNPs biomarkers associated with the AD in order to perform a reliable classification of AD. In contrast to existing related works, we utilize deep transfer learning with varying experimental analysis for reliable classification of AD. For this purpose, the convolutional neural networks (CNN) are firstly trained over the genome-wide association studies (GWAS) dataset requested from the AD neuroimaging initiative. We then employ the deep transfer learning for further training of our CNN (as base model) over a different AD GWAS dataset, to extract the final set of features. The extracted features are then fed into Support Vector Machine for classification of AD. Detailed experiments are performed using multiple datasets and varying experimental configurations. The statistical outcomes indicate an accuracy of 89% which is a significant improvement when benchmarked with existing related works.
Collapse
|
6
|
Zhang Q, Xu Y, Wang S, Wu Y, Ye Y, Yuan CA, Gribova V, Filaretov VF, Huang DS. Using Fully Convolutional Network to Locate Transcription Factor Binding Sites Based on DNA Sequence and Conservation Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2690-2699. [PMID: 36374878 DOI: 10.1109/tcbb.2022.3219831] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Transcription factors (TFs) play a part in gene expression. TFs can form complex gene expression regulation system by combining with DNA. Thereby, identifying the binding regions has become an indispensable step for understanding the regulatory mechanism of gene expression. Due to the great achievements of applying deep learning (DL) to computer vision and language processing in recent years, many scholars are inspired to use these methods to predict TF binding sites (TFBSs), achieving extraordinary results. However, these methods mainly focus on whether DNA sequences include TFBSs. In this paper, we propose a fully convolutional network (FCN) coupled with refinement residual block (RRB) and global average pooling layer (GAPL), namely FCNARRB. Our model could classify binding sequences at nucleotide level by outputting dense label for input data. Experimental results on human ChIP-seq datasets show that the RRB and GAPL structures are very useful for improving model performance. Adding GAPL improves the performance by 9.32% and 7.61% in terms of IoU (Intersection of Union) and PRAUC (Area Under Curve of Precision and Recall), and adding RRB improves the performance by 7.40% and 4.64%, respectively. In addition, we find that conservation information can help locate TFBSs.
Collapse
|
7
|
Wang LS, Sun ZL. iDHS-FFLG: Identifying DNase I Hypersensitive Sites by Feature Fusion and Local-Global Feature Extraction Network. Interdiscip Sci 2023; 15:155-170. [PMID: 36166165 DOI: 10.1007/s12539-022-00538-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 09/12/2022] [Accepted: 09/12/2022] [Indexed: 05/01/2023]
Abstract
The DNase I hypersensitive sites (DHSs) are active regions on chromatin that have been found to be highly sensitive to DNase I. These regions contain various cis-regulatory elements, including promoters, enhancers and silencers. Accurate identification of DHSs helps researchers better understand the transcriptional machinery of DNA and deepen the knowledge of functional DNA elements in non-coding sequences. Researchers have developed many methods based on traditional experiments and machine learning to identify DHSs. However, low prediction accuracy and robustness limit their application in genetics research. In this paper, a novel computational approach based on deep learning is proposed by feature fusion and local-global feature extraction network to identify DHSs in mouse, named iDHS-FFLG. First of all, multiple binary features of nucleotides are fused to better express sequence information. Then, a network consisting of the convolutional neural network (CNN), bidirectional long short-term memory (BiLSTM) and self-attention mechanism is designed to extract local features and global contextual associations. In the end, the prediction module is applied to distinguish between DHSs and non-DHSs. The results of several experiments demonstrate the superior performances of iDHS-FFLG compared to the latest methods.
Collapse
Affiliation(s)
- Lei-Shan Wang
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, Hefei, 230601, Anhui, China
- School of Electrical Engineering and Automation, Anhui University, Hefei, 230601, Anhui, China
| | - Zhan-Li Sun
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, Hefei, 230601, Anhui, China.
- School of Electrical Engineering and Automation, Anhui University, Hefei, 230601, Anhui, China.
| |
Collapse
|
8
|
Ma Z, Sun ZL, Liu M. CRBP-HFEF: Prediction of RBP-Binding Sites on circRNAs Based on Hierarchical Feature Expansion and Fusion. Interdiscip Sci 2023:10.1007/s12539-023-00572-0. [PMID: 37233959 DOI: 10.1007/s12539-023-00572-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Revised: 04/20/2023] [Accepted: 04/21/2023] [Indexed: 05/27/2023]
Abstract
Circular RNAs (circRNAs) participate in the regulation of biological processes by binding to specific proteins and thus influence transcriptional processes. In recent years, circRNAs have become an emerging hotspot in RNA research. Due to powerful learning ability, the various deep learning frameworks have been used to predict the binding sites of RNA-binding protein (RPB) on circRNAs. These methods usually perform only single-level feature extraction of sequence information. However, the feature acquisition may be inadequate for single-level extraction. Generally, the features of deep and shallow layers of neural network can complement each other and are both important for binding site prediction tasks. Based on this concept, we propose a method that combines deep and shallow features, namely CRBP-HFEF. Specifically, features are first extracted and expanded for different levels of network. Then, the expanded deep and shallow features are fused and fed into the classification network, which finally determines whether they are binding sites. Compared to several existing methods, the experimental results on multiple datasets show that the proposed method achieves significant improvements in a number of metrics (with an average AUC of 0.9855). Moreover, much sufficient ablation experiments are also performed to verify the effectiveness of the hierarchical feature expansion strategy.
Collapse
Affiliation(s)
- Zheng Ma
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, and School of Electrical Engineering and Automation Anhui University, Hefei, 230601, Anhui, China
| | - Zhan-Li Sun
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, and School of Electrical Engineering and Automation Anhui University, Hefei, 230601, Anhui, China.
| | - Mengya Liu
- School of Computer Science and Technology, Anhui University, Hefei, 230601, Anhui, China
| |
Collapse
|
9
|
Bai X, Zhang F, Liu J, Xia F. Quantifying the impact of scientific collaboration and papers via motif-based heterogeneous networks. J Informetr 2023. [DOI: 10.1016/j.joi.2023.101397] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2023]
|
10
|
Chen Y, Lin YCD, Luo Y, Cai X, Qiu P, Cui S, Wang Z, Huang HY, Huang HD. Quantitative model for genome-wide cyclic AMP receptor protein binding site identification and characteristic analysis. Brief Bioinform 2023; 24:7145906. [PMID: 37114659 DOI: 10.1093/bib/bbad138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 03/10/2023] [Accepted: 03/16/2023] [Indexed: 04/29/2023] Open
Abstract
Cyclic AMP receptor proteins (CRPs) are important transcription regulators in many species. The prediction of CRP-binding sites was mainly based on position-weighted matrixes (PWMs). Traditional prediction methods only considered known binding motifs, and their ability to discover inflexible binding patterns was limited. Thus, a novel CRP-binding site prediction model called CRPBSFinder was developed in this research, which combined the hidden Markov model, knowledge-based PWMs and structure-based binding affinity matrixes. We trained this model using validated CRP-binding data from Escherichia coli and evaluated it with computational and experimental methods. The result shows that the model not only can provide higher prediction performance than a classic method but also quantitatively indicates the binding affinity of transcription factor binding sites by prediction scores. The prediction result included not only the most knowns regulated genes but also 1089 novel CRP-regulated genes. The major regulatory roles of CRPs were divided into four classes: carbohydrate metabolism, organic acid metabolism, nitrogen compound metabolism and cellular transport. Several novel functions were also discovered, including heterocycle metabolic and response to stimulus. Based on the functional similarity of homologous CRPs, we applied the model to 35 other species. The prediction tool and the prediction results are online and are available at: https://awi.cuhk.edu.cn/∼CRPBSFinder.
Collapse
Affiliation(s)
- Yigang Chen
- School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
- Warshel Institute for Computational Biology, School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
| | - Yang-Chi-Dung Lin
- School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
- Warshel Institute for Computational Biology, School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
| | - Yijun Luo
- School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
| | - Xiaoxuan Cai
- School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
- Warshel Institute for Computational Biology, School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
| | - Peng Qiu
- School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
| | - Shidong Cui
- School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
- Warshel Institute for Computational Biology, School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
| | - Zhe Wang
- School of Humanities and Social Science, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
| | - Hsi-Yuan Huang
- School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
- Warshel Institute for Computational Biology, School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
| | - Hsien-Da Huang
- School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
- Warshel Institute for Computational Biology, School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
| |
Collapse
|
11
|
Chen L, Sun ZL. PmliHFM: Predicting Plant miRNA-lncRNA Interactions with Hybrid Feature Mining Network. Interdiscip Sci 2023; 15:44-54. [PMID: 36223068 DOI: 10.1007/s12539-022-00540-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Revised: 09/27/2022] [Accepted: 09/27/2022] [Indexed: 11/07/2022]
Abstract
Due to the crucial role of interactions between microRNAs (miRNAs) and long non-coding RNAs (lncRNAs) in biological processes, the study of their biological functions is necessary. So far, the various computational methods have been employed to make predictions of the miRNA-lncRNA interaction, which compensate for the inadequacy of biological experiments. However, the existing methods do not consider the differences between miRNA and lncRNA in feature extraction. In this paper, we propose a hybrid feature mining network, named PmliHFM, for predicting plant miRNA-lncRNA interactions. Firstly, miRNA and lncRNA with different sequence lengths are encoded by different encodings, which can reduce the loss of information caused by using the same coding approach. Then, a hybrid feature mining network is designed to adapt to different encoding methods and extract more useful feature information than a single network. Finally, an ensemble module is utilized to integrate the training results of the hybrid feature mining network, while a prediction module is employed to determine whether there are interactions. By testing on multiple test sets, PmliHFM outperforms several state-of-the-art approaches. The results show that the AUC of PmliHFM achieves 0.8[Formula: see text], 3.1[Formula: see text] and 0.4[Formula: see text] improvement respectively on three balanced datasets, and achieves 2.1[Formula: see text] and 1.8[Formula: see text] improvement respectively on two imbalanced datasets. These experiments demonstrate the feasibility of the proposed method.
Collapse
Affiliation(s)
- Lin Chen
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, Hefei, 230601, Anhui, China
- School of Electrical Engineering and Automation, Anhui University, Hefei, 230601, Anhui, China
| | - Zhan-Li Sun
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, Hefei, 230601, Anhui, China.
- School of Electrical Engineering and Automation, Anhui University, Hefei, 230601, Anhui, China.
| |
Collapse
|
12
|
Lin X, Quan Z, Wang ZJ, Guo Y, Zeng X, Yu PS. Effectively Identifying Compound-Protein Interaction Using Graph Neural Representation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:932-943. [PMID: 35951570 DOI: 10.1109/tcbb.2022.3198003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Effectively identifying compound-protein interactions (CPIs) is crucial for new drug design, which is an important step in silico drug discovery. Current machine learning methods for CPI prediction mainly use one-demensional (1D) compound/protein strings and/or the specific descriptors. However, they often ignore the fact that molecules are essentially modeled by the molecular graph. We observe that in real-world scenarios, the topological structure information of the molecular graph usually provides an overview of how the atoms are connected, and the local chemical context reveals the functionality of the protein sequence in CPI. These two types of information are complementary to each other and they are both significant for modeling compound-protein pairs. Motivated by this, we propose an end-to-end deep learning framework named GraphCPI, which captures the structural information of compounds and leverages the chemical context of protein sequences for solving the CPI prediction task. Our framework can integrate any popular graph neural networks for learning compounds, and it combines with a convolutional neural network for embedding sequences. To compare our method with classic and state-of-the-art deep learning methods, we conduct extensive experiments based on several widely-used CPI datasets. The experimental results show the feasibility and competitiveness of our proposed method.
Collapse
|
13
|
Luo S, Xiong D, Zhao X, Duan L. An Attempt of Seeking Favorable Binding Free Energy Prediction Schemes Considering the Entropic Effect on Fis-DNA Binding. J Phys Chem B 2023; 127:1312-1324. [PMID: 36735878 DOI: 10.1021/acs.jpcb.2c07811] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Protein-DNA binding mechanisms in a complex manner are essential for understanding many biological processes. Over the past decades, numerous experiments and calculations have analyzed the specificity of protein-DNA binding. However, the accuracy of binding free energy prediction for multi-base DNA systems still needs to be improved. Fis is a DNA-binding protein that regulates various transcription and recombination reactions. In the present work, we tested several methods of predict binding free energy based on this system to find a favorable prediction scheme and explore the binding mechanism of Fis protein and DNA. Two solvent models (explicit and implicit solvent models) were chosen for the dynamics process, and the predicted binding free energy was more accurate under the explicit solvent model. When different Poisson-Boltzmann/Generalized Born (PB/GB) models were tested for DNA force fields (BSC1 and OL15), it was found that the binding free energy predicted by the selected OL15 force field performed better and the correlation between predicted and experimental values was improved with the increasing interior dielectric constant (Dk). Finally, using Dk = 8, the GBOBC1 model combined with interaction entropy (IE), which was calculated for entropic contribution (GBOBC1_IE_8), was screened out for the binding free energy prediction and analysis of the Fis-DNA system, and the validity of the method was further verified by testing the Cren7-DNA system. By performing conformational analysis of the minor groove, it was found that mutation of the DNA central sequence A/T to C/G and deletion of the guanine 2-amino group would change the minor groove width and thus affect the formation of the major groove, altering the interaction and atomic contact between the protein and the major groove, thus changing the binding affinity of Fis and DNA. Hopefully, the series of tests in this work can shed some light on the related studies of protein and DNA systems.
Collapse
Affiliation(s)
- Song Luo
- School of Physics and Electronics, Shandong Normal University, Jinan, Shandong250014, China
| | - Danyang Xiong
- School of Physics and Electronics, Shandong Normal University, Jinan, Shandong250014, China
| | - Xiaoyu Zhao
- School of Physics and Electronics, Shandong Normal University, Jinan, Shandong250014, China
| | - Lili Duan
- School of Physics and Electronics, Shandong Normal University, Jinan, Shandong250014, China
| |
Collapse
|
14
|
Tsukiyama S, Hasan MM, Kurata H. CNN6mA: Interpretable neural network model based on position-specific CNN and cross-interactive network for 6mA site prediction. Comput Struct Biotechnol J 2022; 21:644-654. [PMID: 36659917 PMCID: PMC9826936 DOI: 10.1016/j.csbj.2022.12.043] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2022] [Revised: 12/26/2022] [Accepted: 12/27/2022] [Indexed: 12/29/2022] Open
Abstract
N6-methyladenine (6mA) plays a critical role in various epigenetic processing including DNA replication, DNA repair, silencing, transcription, and diseases such as cancer. To understand such epigenetic mechanisms, 6 mA has been detected by high-throughput technologies on a genome-wide scale at single-base resolution, together with conventional methods such as immunoprecipitation, mass spectrometry and capillary electrophoresis, but these experimental approaches are time-consuming and laborious. To complement these problems, we have developed a CNN-based 6 mA site predictor, named CNN6mA, which proposed two new architectures: a position-specific 1-D convolutional layer and a cross-interactive network. In the position-specific 1-D convolutional layer, position-specific filters with different window sizes were applied to an inquiry sequence instead of sharing the same filters over all positions in order to extract the position-specific features at different levels. The cross-interactive network explored the relationships between all the nucleotide patterns within the inquiry sequence. Consequently, CNN6mA outperformed the existing state-of-the-art models in many species and created the contribution score vector that intelligibly interpret the prediction mechanism. The source codes and web application in CNN6mA are freely accessible at https://github.com/kuratahiroyuki/CNN6mA.git and http://kurata35.bio.kyutech.ac.jp/CNN6mA/, respectively.
Collapse
Key Words
- 6mA, N6-methyladenine
- AUCs, Area under the curves
- BERT, Bidirectional Encoder Representations from Transformers
- CNN
- CNN, Convolutional neural network
- DNA modification
- Deep learning
- Interpretable prediction
- LSTM, Long short-term memory
- MCC, Matthews correlation coefficient
- Machine learning
- N6-methyladenine
- RF, Random forest
- SMRT, Single-molecule real-time
- SN, Sensitivity
- SP, Specificity
- UMAP, Uniform manifold approximation and projection
- t-SNE, t-distributed stochastic neighbor embedding
Collapse
Affiliation(s)
- Sho Tsukiyama
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680–4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Md Mehedi Hasan
- Tulane Center for Aging and Department of Medicine, Tulane University Health Sciences Center, New Orleans, LA 70112, USA
| | - Hiroyuki Kurata
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680–4 Kawazu, Iizuka, Fukuoka 820-8502, Japan,Corresponding author.
| |
Collapse
|
15
|
Kumar R, Singh D, Srinivasan K, Hu YC. AI-Powered Blockchain Technology for Public Health: A Contemporary Review, Open Challenges, and Future Research Directions. Healthcare (Basel) 2022; 11:healthcare11010081. [PMID: 36611541 PMCID: PMC9819078 DOI: 10.3390/healthcare11010081] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 12/14/2022] [Accepted: 12/20/2022] [Indexed: 12/29/2022] Open
Abstract
Blockchain technology has been growing at a substantial growth rate over the last decade. Introduced as the backbone of cryptocurrencies such as Bitcoin, it soon found its application in other fields because of its security and privacy features. Blockchain has been used in the healthcare industry for several purposes including secure data logging, transactions, and maintenance using smart contracts. Great work has been carried out to make blockchain smart, with the integration of Artificial Intelligence (AI) to combine the best features of the two technologies. This review incorporates the conceptual and functional aspects of the individual technologies and innovations in the domains of blockchain and artificial intelligence and lays down a strong foundational understanding of the domains individually and also rigorously discusses the various ways AI has been used along with blockchain to power the healthcare industry including areas of great importance such as electronic health record (EHR) management, distant-patient monitoring and telemedicine, genomics, drug research, and testing, specialized imaging and outbreak prediction. It compiles various algorithms from supervised and unsupervised machine learning problems along with deep learning algorithms such as convolutional/recurrent neural networks and numerous platforms currently being used in AI-powered blockchain systems and discusses their applications. The review also presents the challenges still faced by these systems which they inherit from the AI and blockchain algorithms used at the core of them and the scope of future work.
Collapse
Affiliation(s)
- Ritik Kumar
- School of Computer Science and Engineering, Vellore Institute of Technology, Vellore 632014, India
| | - Divyangi Singh
- School of Computer Science and Engineering, Vellore Institute of Technology, Vellore 632014, India
| | - Kathiravan Srinivasan
- School of Computer Science and Engineering, Vellore Institute of Technology, Vellore 632014, India
| | - Yuh-Chung Hu
- Department of Mechanical and Electromechanical Engineering, National ILan University, Yilan 26047, Taiwan
| |
Collapse
|
16
|
Turner M, Danino YM, Barshai M, Yacovzada NS, Cohen Y, Olender T, Rotkopf R, Monchaud D, Hornstein E, Orenstein Y. rG4detector, a novel RNA G-quadruplex predictor, uncovers their impact on stress granule formation. Nucleic Acids Res 2022; 50:11426-11441. [PMID: 36350614 PMCID: PMC9723610 DOI: 10.1093/nar/gkac950] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2022] [Revised: 09/21/2022] [Accepted: 10/14/2022] [Indexed: 11/11/2022] Open
Abstract
RNA G-quadruplexes (rG4s) are RNA secondary structures, which are formed by guanine-rich sequences and have important cellular functions. Existing computational tools for rG4 prediction rely on specific sequence features and/or were trained on small datasets, without considering rG4 stability information, and are therefore sub-optimal. Here, we developed rG4detector, a convolutional neural network to identify potential rG4s in transcriptomics data. rG4detector outperforms existing methods in both predicting rG4 stability and in detecting rG4-forming sequences. To demonstrate the biological-relevance of rG4detector, we employed it to study RNAs that are bound by the RNA-binding protein G3BP1. G3BP1 is central to the induction of stress granules (SGs), which are cytoplasmic biomolecular condensates that form in response to a variety of cellular stresses. Unexpectedly, rG4detector revealed a dynamic enrichment of rG4s bound by G3BP1 in response to cellular stress. In addition, we experimentally characterized G3BP1 cross-talk with rG4s, demonstrating that G3BP1 is a bona fide rG4-binding protein and that endogenous rG4s are enriched within SGs. Furthermore, we found that reduced rG4 availability impairs SG formation. Hence, we conclude that rG4s play a direct role in SG biology via their interactions with RNA-binding proteins and that rG4detector is a novel useful tool for rG4 transcriptomics data analyses.
Collapse
Affiliation(s)
| | | | - Mira Barshai
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Be’er-Sheva 8410501, Israel
| | - Nancy S Yacovzada
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 7610001, Israel,Department of Molecular Neuroscience, Weizmann Institute of Science, Rehovot 7610001, Israel
| | - Yahel Cohen
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 7610001, Israel,Department of Molecular Neuroscience, Weizmann Institute of Science, Rehovot 7610001, Israel
| | - Tsviya Olender
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 7610001, Israel
| | - Ron Rotkopf
- Bioinformatics Unit, Life Sciences Core Facilities, Weizmann Institute of Science, Rehovot 7610001, Israel
| | - David Monchaud
- Institut de Chimie Moleculaire, ICMUB CNRS UMR 6302, UBFC Dijon, France
| | | | - Yaron Orenstein
- Correspondence may also be addressed to Yaron Orenstein. Tel: +972 3 531 7990;
| |
Collapse
|
17
|
Fang M, He Y, Du Z, Uversky VN. DeepCLD: An Efficient Sequence-Based Predictor of Intrinsically Disordered Proteins. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3154-3159. [PMID: 34727037 DOI: 10.1109/tcbb.2021.3124273] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Intrinsic disorder is common in proteins, plays important roles in protein functionality, and is commonly associated with various human diseases. To have an accurate tool for the annotation of intrinsic disorder in proteins, this paper proposes a novel algorithm, DeepCLD, for sequence-based prediction of intrinsically disordered proteins. This algorithm uses amino acid position specific scoring matrix (PSSM) to capture the intrinsic variability characteristic of sequence patterns, ResNet to preserve feature space structure, and bidirectional CudnnLSTM as recurrent layer to further improve the efficiency. Futhermore, DeepCLD also utilized the attention mechanism to solve the problem of gradient disappearing in deep network. Comparative analyses show that DeepCLD has faster training speed and higher prediction accuracy than comparable methods.
Collapse
|
18
|
Hu J, Bai YS, Zheng LL, Jia NX, Yu DJ, Zhang GJ. Protein-DNA Binding Residue Prediction via Bagging Strategy and Sequence-Based Cube-Format Feature. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3635-3645. [PMID: 34714748 DOI: 10.1109/tcbb.2021.3123828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Protein-DNA interactions play an important role in diverse biological processes. Accurately identifying protein-DNA binding residues is a critical but challenging task for protein function annotations and drug design. Although wet-lab experimental methods are the most accurate way to identify protein-DNA binding residues, they are time consuming and labor intensive. There is an urgent need to develop computational methods to rapidly and accurately predict protein-DNA binding residues. In this study, we propose a novel sequence-based method, named PredDBR, for predicting DNA-binding residues. In PredDBR, for each query protein, its position-specific frequency matrix (PSFM), predicted secondary structure (PSS), and predicted probabilities of ligand-binding residues (PPLBR) are first generated as three feature sources. Secondly, for each feature source, the sliding window technique is employed to extract the matrix-format feature of each residue. Then, we design two strategies, i.e., square root (SR) and average (AVE), to separately transform PSFM-based and two predicted feature source-based, i.e., PSS-based and PPLBR-based, matrix-format features of each residue into three corresponding cube-format features. Finally, after serially combining the three cube-format features, the ensemble classifier is generated via applying bagging strategy to multiple base classifiers built by the framework of 2D convolutional neural network. The computational experimental results demonstrate that the proposed PredDBR achieves an average overall accuracy of 93.7% and a Mathew's correlation coefficient of 0.405 on two independent validation datasets and outperforms several state-of-the-art sequenced-based protein-DNA binding residue predictors. The PredDBR web-server is available at https://jun-csbio.github.io/PredDBR/.
Collapse
|
19
|
Wu QW, Cao RF, Xia JF, Ni JC, Zheng CH, Su YS. Extra Trees Method for Predicting LncRNA-Disease Association Based On Multi-Layer Graph Embedding Aggregation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3171-3178. [PMID: 34529571 DOI: 10.1109/tcbb.2021.3113122] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Lots of experimental studies have revealed the significant associations between lncRNAs and diseases. Identifying accurate associations will provide a new perspective for disease therapy. Calculation-based methods have been developed to solve these problems, but these methods have some limitations. In this paper, we proposed an accurate method, named MLGCNET, to discover potential lncRNA-disease associations. Firstly, we reconstructed similarity networks for both lncRNAs and diseases using top k similar information, and constructed a lncRNA-disease heterogeneous network (LDN). Then, we applied Multi-Layer Graph Convolutional Network on LDN to obtain latent feature representations of nodes. Finally, the Extra Trees was used to calculate the probability of association between disease and lncRNA. The results of extensive 5-fold cross-validation experiments show that MLGCNET has superior prediction performance compared to the state-of-the-art methods. Case studies confirm the performance of our model on specific diseases. All the experiment results prove the effectiveness and practicality of MLGCNET in predicting potential lncRNA-disease associations.
Collapse
|
20
|
Zhang Q, Zhang Y, Wang S, Chen ZH, Gribova V, Filaretov VF, Huang DS. Predicting In-Vitro DNA-Protein Binding With a Spatially Aligned Fusion of Sequence and Shape. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3144-3153. [PMID: 34882561 DOI: 10.1109/tcbb.2021.3133869] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Discovery of transcription factor binding sites (TFBSs) is of primary importance for understanding the underlying binding mechanic and gene regulation process. Growing evidence indicates that apart from the primary DNA sequences, DNA shape landscape has a significant influence on transcription factor binding preference. To effectively model the co-influence of sequence and shape features, we emphasize the importance of position information of sequence motif and shape pattern. In this paper, we propose a novel deep learning-based architecture, named hybridShape eDeepCNN, for TFBS prediction which integrates DNA sequence and shape information in a spatially aligned manner. Our model utilizes the power of the multi-layer convolutional neural network and constructs an independent subnetwork to adapt for the distinct data distribution of heterogeneous features. Besides, we explore the usage of continuous embedding vectors as the representation of DNA sequences. Based on the experiments on 20 in-vitro datasets derived from universal protein binding microarrays (uPBMs), we demonstrate the superiority of our proposed method and validate the underlying design logic.
Collapse
|
21
|
Towards a better understanding of TF-DNA binding prediction from genomic features. Comput Biol Med 2022; 149:105993. [DOI: 10.1016/j.compbiomed.2022.105993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 07/12/2022] [Accepted: 08/14/2022] [Indexed: 11/17/2022]
|
22
|
MHDMF: Prediction of miRNA-disease associations based on Deep Matrix Factorization with Multi-source Graph Convolutional Network. Comput Biol Med 2022; 149:106069. [PMID: 36115300 DOI: 10.1016/j.compbiomed.2022.106069] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 07/31/2022] [Accepted: 08/27/2022] [Indexed: 11/24/2022]
Abstract
A growing number of works have proved that microRNAs (miRNAs) are a crucial biomarker in diverse bioprocesses affecting various diseases. As a good complement to high-cost wet experiment-based methods, numerous computational prediction methods have sprung up. However, there are still challenges that exist in making effective use of high false-negative associations and multi-source information for finding the potential associations. In this work, we develop an end-to-end computational framework, called MHDMF, which integrates the multi-source information on a heterogeneous network to discover latent disease-miRNA associations. Since high false-negative exist in the miRNA-disease associations, MHDMF utilizes the multi-source Graph Convolutional Network (GCN) to correct the false-negative association by reformulating the miRNA-disease association score matrix. The score matrix reformulation is based on different similarity profiles and known associations between miRNAs, genes, and diseases. Then, MHDMF employs Deep Matrix Factorization (DMF) to predict the miRNA-disease associations based on reformulated miRNA-disease association score matrix. The experimental results show that the proposed framework outperforms highly related comparison methods by a large margin on tasks of miRNA-disease association prediction. Furthermore, case studies suggest that MHDMF could be a convenient and efficient tool and may supply a new way to think about miRNA-disease association prediction.
Collapse
|
23
|
Liu J, Zhou D. Minimum Functional Length Analysis of K-Mer Based on BPNN. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2920-2925. [PMID: 34310316 DOI: 10.1109/tcbb.2021.3098512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
BP neural network (BPNN), as a multilayer feed-forward network, can realize the deep cognition to target data and high accuracy to output results. However, there were still no related research of k-mer based on BPNN yet. In present study, BPNN was used to train and test binary classification data of each classification mode respectively. All k-mer were divided into two categories according to the X + Y content or completely random mode. Results showed that 1) For classification mode of X + Y content, the accuracy of k-mers classification was 100 percent, no matter k ≤ 6 or k ≥ 7; 2) For completely random classification mode, the accuracy of classification is 100 percent for k-mers of k ≤ 6; But for k-mers of k ≥ 7, the accuracy is less than 100 percent, and with the increase of k value, the accuracy of classification gradually decreases (gradually approaches 50 percent). The k-mers of k ≥ 7 should be the basic functional fragment of nucleic acid, and perform basic nucleic acid function in the DNA sequence. The k-mers of k ≤ 6 should be the basic component fragment of nucleic acid, and no longer perform basic nucleic acid function.
Collapse
|
24
|
Ahn SY, Kim M, Bae JE, Bang IS, Lee SW. Reliability of the In Silico Prediction Approach to In Vitro Evaluation of Bacterial Toxicity. SENSORS (BASEL, SWITZERLAND) 2022; 22:6557. [PMID: 36081016 PMCID: PMC9459819 DOI: 10.3390/s22176557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Revised: 08/26/2022] [Accepted: 08/26/2022] [Indexed: 06/15/2023]
Abstract
Several pathogens that spread through the air are highly contagious, and related infectious diseases are more easily transmitted through airborne transmission under indoor conditions, as observed during the COVID-19 pandemic. Indoor air contaminated by microorganisms, including viruses, bacteria, and fungi, or by derived pathogenic substances, can endanger human health. Thus, identifying and analyzing the potential pathogens residing in the air are crucial to preventing disease and maintaining indoor air quality. Here, we applied deep learning technology to analyze and predict the toxicity of bacteria in indoor air. We trained the ProtBert model on toxic bacterial and virulence factor proteins and applied them to predict the potential toxicity of some bacterial species by analyzing their protein sequences. The results reflect the results of the in vitro analysis of their toxicity in human cells. The in silico-based simulation and the obtained results demonstrated that it is plausible to find possible toxic sequences in unknown protein sequences.
Collapse
Affiliation(s)
- Sung-Yoon Ahn
- Pattern Recognition and Machine Learning Lab, Department of AI Software, Gachon University, Seongnam 13557, Korea
| | - Mira Kim
- Department of Microbiology and Immunology, Chosun University School of Dentistry, Gwangju 61452, Korea
| | - Ji-Eun Bae
- Department of Microbiology and Immunology, Chosun University School of Dentistry, Gwangju 61452, Korea
| | - Iel-Soo Bang
- Department of Microbiology and Immunology, Chosun University School of Dentistry, Gwangju 61452, Korea
| | - Sang-Woong Lee
- Pattern Recognition and Machine Learning Lab, Department of AI Software, Gachon University, Seongnam 13557, Korea
| |
Collapse
|
25
|
Yin YH, Shen LC, Jiang Y, Gao S, Song J, Yu DJ. Improving the prediction of DNA-protein binding by integrating multi-scale dense convolutional network with fault-tolerant coding. Anal Biochem 2022; 656:114878. [DOI: 10.1016/j.ab.2022.114878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 08/18/2022] [Accepted: 08/23/2022] [Indexed: 11/01/2022]
|
26
|
Pan-cancer identification of the relationship of metabolism-related differentially expressed transcription regulation with non-differentially expressed target genes via a gated recurrent unit network. Comput Biol Med 2022; 148:105883. [PMID: 35878490 DOI: 10.1016/j.compbiomed.2022.105883] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 07/10/2022] [Accepted: 07/16/2022] [Indexed: 11/20/2022]
Abstract
The transcriptome describes the expression of all genes in a sample. Most studies have investigated the differential patterns or discrimination powers of transcript expression levels. In this study, we hypothesized that the quantitative correlations between the expression levels of transcription factors (TFs) and their regulated target genes (mRNAs) serve as a novel view of healthy status, and a disease sample exhibits a differential landscape (mqTrans) of transcription regulations compared with healthy status. We formulated quantitative transcription regulation relationships of metabolism-related genes as a multi-input multi-output regression model via a gated recurrent unit (GRU) network. The GRU model was trained using healthy blood transcriptomes and the expression levels of mRNAs were predicted by those of the TFs. The mqTrans feature of a gene was defined as the difference between its predicted and actual expression levels. A pan-cancer investigation of the differentially expressed mqTrans features was conducted between the early- and late-stage cancers in 26 cancer types of The Cancer Genome Atlas database. This study focused on the differentially expressed mqTrans features, that did not show differential expression in the actual expression levels. These genes could not be detected by conventional differential analysis. Such dark biomarkers are worthy of further wet-lab investigation. The experimental data also showed that the proposed mqTrans investigation improved the classification between early- and late-stage samples for some cancer types. Thus, the mqTrans features serve as a complementary view to transcriptomes, an OMIC type with mature high-throughput production technologies, and abundant public resources.
Collapse
|
27
|
Zhang Y, Bao W, Cao Y, Cong H, Chen B, Chen Y. A survey on protein–DNA-binding sites in computational biology. Brief Funct Genomics 2022; 21:357-375. [DOI: 10.1093/bfgp/elac009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Revised: 04/07/2022] [Accepted: 04/22/2022] [Indexed: 01/08/2023] Open
Abstract
Abstract
Transcription factors are important cellular components of the process of gene expression control. Transcription factor binding sites are locations where transcription factors specifically recognize DNA sequences, targeting gene-specific regions and recruiting transcription factors or chromatin regulators to fine-tune spatiotemporal gene regulation. As the common proteins, transcription factors play a meaningful role in life-related activities. In the face of the increase in the protein sequence, it is urgent how to predict the structure and function of the protein effectively. At present, protein–DNA-binding site prediction methods are based on traditional machine learning algorithms and deep learning algorithms. In the early stage, we usually used the development method based on traditional machine learning algorithm to predict protein–DNA-binding sites. In recent years, methods based on deep learning to predict protein–DNA-binding sites from sequence data have achieved remarkable success. Various statistical and machine learning methods used to predict the function of DNA-binding proteins have been proposed and continuously improved. Existing deep learning methods for predicting protein–DNA-binding sites can be roughly divided into three categories: convolutional neural network (CNN), recursive neural network (RNN) and hybrid neural network based on CNN–RNN. The purpose of this review is to provide an overview of the computational and experimental methods applied in the field of protein–DNA-binding site prediction today. This paper introduces the methods of traditional machine learning and deep learning in protein–DNA-binding site prediction from the aspects of data processing characteristics of existing learning frameworks and differences between basic learning model frameworks. Our existing methods are relatively simple compared with natural language processing, computational vision, computer graphics and other fields. Therefore, the summary of existing protein–DNA-binding site prediction methods will help researchers better understand this field.
Collapse
|
28
|
Mi JX, Feng J, Huang KY. Designing efficient convolutional neural network structure: A survey. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2021.08.158] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
29
|
Zou C, Zhang Q, Wei X. Synchronization of Hyper-Lorenz System Based on DNA Strand Displacement. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1897-1908. [PMID: 33385311 DOI: 10.1109/tcbb.2020.3048753] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Lorenz system is depicted by chemical reaction equations of an ideal formal chemical reaction network, and a series of reversible reactions are added into chemical reaction network in order to construct a cluster of hyper-Lorenz system. DNA as a universal substrate for chemical dynamics can approximate arbitrary dynamical characteristics of ideal formal chemical reaction network through auxiliary DNA strands and displacement reactions. Based on Lyapunov's stableness theory, a novel synchronization strategy is proposed. A 6-dimensional hyper-Lorenz system is taken as examples for simulation and shows that DNA strands displacement reactions can implement the synchronization of ideal formal chemical reaction networks. Numerical simulations indicate that synchronization based on DNA strand displacement is robust to the detection of DNA strand concentration, control of reaction rate, and noise.
Collapse
|
30
|
Shen Z, Zhang Q, Han K, Huang DS. A Deep Learning Model for RNA-Protein Binding Preference Prediction Based on Hierarchical LSTM and Attention Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:753-762. [PMID: 32750884 DOI: 10.1109/tcbb.2020.3007544] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Attention mechanism has the ability to find important information in the sequence. The regions of the RNA sequence that can bind to proteins are more important than those that cannot bind to proteins. Neither conventional methods nor deep learning-based methods, they are not good at learning this information. In this study, LSTM is used to extract the correlation features between different sites in RNA sequence. We also use attention mechanism to evaluate the importance of different sites in RNA sequence. We get the optimal combination of k-mer length, k-mer stride window, k-mer sentence length, k-mer sentence stride window, and optimization function through hyper-parm experiments. The results show that the performance of our method is better than other methods. We tested the effects of changes in k-mer vector length on model performance. We show model performance changes under various k-mer related parameter settings. Furthermore, we investigate the effect of attention mechanism and RNA structure data on model performance.
Collapse
|
31
|
Base-resolution prediction of transcription factor binding signals by a deep learning framework. PLoS Comput Biol 2022; 18:e1009941. [PMID: 35263332 PMCID: PMC8982852 DOI: 10.1371/journal.pcbi.1009941] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Revised: 04/05/2022] [Accepted: 02/19/2022] [Indexed: 01/13/2023] Open
Abstract
Transcription factors (TFs) play an important role in regulating gene expression, thus the identification of the sites bound by them has become a fundamental step for molecular and cellular biology. In this paper, we developed a deep learning framework leveraging existing fully convolutional neural networks (FCN) to predict TF-DNA binding signals at the base-resolution level (named as FCNsignal). The proposed FCNsignal can simultaneously achieve the following tasks: (i) modeling the base-resolution signals of binding regions; (ii) discriminating binding or non-binding regions; (iii) locating TF-DNA binding regions; (iv) predicting binding motifs. Besides, FCNsignal can also be used to predict opening regions across the whole genome. The experimental results on 53 TF ChIP-seq datasets and 6 chromatin accessibility ATAC-seq datasets show that our proposed framework outperforms some existing state-of-the-art methods. In addition, we explored to use the trained FCNsignal to locate all potential TF-DNA binding regions on a whole chromosome and predict DNA sequences of arbitrary length, and the results show that our framework can find most of the known binding regions and accept sequences of arbitrary length. Furthermore, we demonstrated the potential ability of our framework in discovering causal disease-associated single-nucleotide polymorphisms (SNPs) through a series of experiments.
Collapse
|
32
|
Zhang Y, Wang Z, Zeng Y, Liu Y, Xiong S, Wang M, Zhou J, Zou Q. A novel convolution attention model for predicting transcription factor binding sites by combination of sequence and shape. Brief Bioinform 2021; 23:6470969. [PMID: 34929739 DOI: 10.1093/bib/bbab525] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 10/28/2021] [Accepted: 11/13/2021] [Indexed: 12/17/2022] Open
Abstract
The discovery of putative transcription factor binding sites (TFBSs) is important for understanding the underlying binding mechanism and cellular functions. Recently, many computational methods have been proposed to jointly account for DNA sequence and shape properties in TFBSs prediction. However, these methods fail to fully utilize the latent features derived from both sequence and shape profiles and have limitation in interpretability and knowledge discovery. To this end, we present a novel Deep Convolution Attention network combining Sequence and Shape, dubbed as D-SSCA, for precisely predicting putative TFBSs. Experiments conducted on 165 ENCODE ChIP-seq datasets reveal that D-SSCA significantly outperforms several state-of-the-art methods in predicting TFBSs, and justify the utility of channel attention module for feature refinements. Besides, the thorough analysis about the contribution of five shapes to TFBSs prediction demonstrates that shape features can improve the predictive power for transcription factors-DNA binding. Furthermore, D-SSCA can realize the cross-cell line prediction of TFBSs, indicating the occupancy of common interplay patterns concerning both sequence and shape across various cell lines. The source code of D-SSCA can be found at https://github.com/MoonLord0525/.
Collapse
Affiliation(s)
- Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China.,School of Computer Science and Engineering, University of Electronic Science and Technology of China, 611731, Chengdu, China
| | - Zixuan Wang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Yuanqi Zeng
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Yuhang Liu
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Shuwen Xiong
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Maocheng Wang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Jiliu Zhou
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054, Chengdu, China
| |
Collapse
|
33
|
Wang Z, Lei X. Prediction of RBP binding sites on circRNAs using an LSTM-based deep sequence learning architecture. Brief Bioinform 2021; 22:6355419. [PMID: 34415289 DOI: 10.1093/bib/bbab342] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Revised: 07/14/2021] [Accepted: 08/02/2021] [Indexed: 01/22/2023] Open
Abstract
Circular RNAs (circRNAs) are widely expressed in highly diverged eukaryotes. Although circRNAs have been known for many years, their function remains unclear. Interaction with RNA-binding protein (RBP) to influence post-transcriptional regulation is considered to be an important pathway for circRNA function, such as acting as an oncogenic RBP sponge to inhibit cancer. In this study, we design a deep learning framework, CRPBsites, to predict the binding sites of RBPs on circRNAs. In this model, the sequences of variable-length binding sites are transformed into embedding vectors by word2vec model. Bidirectional LSTM is used to encode the embedding vectors of binding sites, and then they are fed into another LSTM decoder for decoding and classification tasks. To train and test the model, we construct four datasets that contain sequences of variable-length binding sites on circRNAs, and each set corresponds to an RBP, which is overexpressed in bladder cancer tissues. Experimental results on four datasets and comparison with other existing models show that CRPBsites has superior performance. Afterwards, we found that there were highly similar binding motifs in the four binding site datasets. Finally, we applied well-trained CRPBsites to identify the binding sites of IGF2BP1 on circCDYL, and the results proved the effectiveness of this method. In conclusion, CRPBsites is an effective prediction model for circRNA-RBP interaction site identification. We hope that CRPBsites can provide valuable guidance for experimental studies on the influence of circRNA on post-transcriptional regulation.
Collapse
Affiliation(s)
- Zhengfeng Wang
- School of Computer Science, Shaanxi Normal University, Xi'an, China.,College of Information Science and Engineering, Guilin University of Technology, Guilin, China
| | - Xiujuan Lei
- School of Computer Science, Shaanxi Normal University, Xi'an, China
| |
Collapse
|
34
|
Tayara H, Chong KT. Improved Predicting of The Sequence Specificities of RNA Binding Proteins by Deep Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2526-2534. [PMID: 32191896 DOI: 10.1109/tcbb.2020.2981335] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
RNA-binding proteins (RBPs) have a significant role in various regulatory tasks. However, the mechanism by which RBPs identify the subsequence target RNAs is still not clear. In recent years, several machine and deep learning-based computational models have been proposed for understanding the binding preferences of RBPs. These methods required integrating multiple features with raw RNA sequences such as secondary structure and their performances can be further improved. In this paper, we propose an efficient and simple convolution neural network, RBPCNN, that relies on the combination of the raw RNA sequence and evolutionary information. We show that conservation scores (evolutionary information) for the RNA sequences can significantly improve the overall performance of the proposed predictor. In addition, the automatic extraction of the binding sequence motifs can enhance our understanding of the binding specificities of RBPs. The experimental results show that RBPCNN outperforms significantly the current state-of-the-art methods. More specifically, the average area under the receiver operator curve was improved by 2.67 percent and the mean average precision was improved by 8.03 percent. The datasets and results can be downloaded from https://home.jbnu.ac.kr/NSCL/RBPCNN.htm.
Collapse
|
35
|
Han K, Shen LC, Zhu YH, Xu J, Song J, Yu DJ. MAResNet: predicting transcription factor binding sites by combining multi-scale bottom-up and top-down attention and residual network. Brief Bioinform 2021; 23:6399874. [PMID: 34664074 DOI: 10.1093/bib/bbab445] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2021] [Revised: 09/06/2021] [Accepted: 09/28/2021] [Indexed: 11/14/2022] Open
Abstract
Accurate identification of transcription factor binding sites is of great significance in understanding gene expression, biological development and drug design. Although a variety of methods based on deep-learning models and large-scale data have been developed to predict transcription factor binding sites in DNA sequences, there is room for further improvement in prediction performance. In addition, effective interpretation of deep-learning models is greatly desirable. Here we present MAResNet, a new deep-learning method, for predicting transcription factor binding sites on 690 ChIP-seq datasets. More specifically, MAResNet combines the bottom-up and top-down attention mechanisms and a state-of-the-art feed-forward network (ResNet), which is constructed by stacking attention modules that generate attention-aware features. In particular, the multi-scale attention mechanism is utilized at the first stage to extract rich and representative sequence features. We further discuss the attention-aware features learned from different attention modules in accordance with the changes as the layers go deeper. The features learned by MAResNet are also visualized through the TMAP tool to illustrate that the method can extract the unique characteristics of transcription factor binding sites. The performance of MAResNet is extensively tested on 690 test subsets with an average AUC of 0.927, which is higher than that of the current state-of-the-art methods. Overall, this study provides a new and useful framework for the prediction of transcription factor binding sites by combining the funnel attention modules with the residual network.
Collapse
Affiliation(s)
- Ke Han
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Long-Chen Shen
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Jian Xu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| |
Collapse
|
36
|
Huang K, Xiao C, Glass LM, Critchlow CW, Gibson G, Sun J. Machine learning applications for therapeutic tasks with genomics data. PATTERNS (NEW YORK, N.Y.) 2021; 2:100328. [PMID: 34693370 PMCID: PMC8515011 DOI: 10.1016/j.patter.2021.100328] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Thanks to the increasing availability of genomics and other biomedical data, many machine learning algorithms have been proposed for a wide range of therapeutic discovery and development tasks. In this survey, we review the literature on machine learning applications for genomics through the lens of therapeutic development. We investigate the interplay among genomics, compounds, proteins, electronic health records, cellular images, and clinical texts. We identify 22 machine learning in genomics applications that span the whole therapeutics pipeline, from discovering novel targets, personalizing medicine, developing gene-editing tools, all the way to facilitating clinical trials and post-market studies. We also pinpoint seven key challenges in this field with potentials for expansion and impact. This survey examines recent research at the intersection of machine learning, genomics, and therapeutic development.
Collapse
Affiliation(s)
- Kexin Huang
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Cao Xiao
- Amplitude, San Francisco, CA 94105, USA
| | - Lucas M. Glass
- Analytics Center of Excellence, IQVIA, Cambridge, MA 02139, USA
| | | | - Greg Gibson
- Center for Integrative Genomics, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Jimeng Sun
- Computer Science Department and Carle's Illinois College of Medicine, University of Illinois at Urbana-Champaign, Urbana, IL 61820, USA
| |
Collapse
|
37
|
Shen LC, Liu Y, Song J, Yu DJ. SAResNet: self-attention residual network for predicting DNA-protein binding. Brief Bioinform 2021; 22:bbab101. [PMID: 33837387 PMCID: PMC8579196 DOI: 10.1093/bib/bbab101] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 03/03/2021] [Accepted: 03/08/2021] [Indexed: 11/12/2022] Open
Abstract
Knowledge of the specificity of DNA-protein binding is crucial for understanding the mechanisms of gene expression, regulation and gene therapy. In recent years, deep-learning-based methods for predicting DNA-protein binding from sequence data have achieved significant success. Nevertheless, the current state-of-the-art computational methods have some drawbacks associated with the use of limited datasets with insufficient experimental data. To address this, we propose a novel transfer learning-based method, termed SAResNet, which combines the self-attention mechanism and residual network structure. More specifically, the attention-driven module captures the position information of the sequence, while the residual network structure guarantees that the high-level features of the binding site can be extracted. Meanwhile, the pre-training strategy used by SAResNet improves the learning ability of the network and accelerates the convergence speed of the network during transfer learning. The performance of SAResNet is extensively tested on 690 datasets from the ChIP-seq experiments with an average AUC of 92.0%, which is 4.4% higher than that of the best state-of-the-art method currently available. When tested on smaller datasets, the predictive performance is more clearly improved. Overall, we demonstrate that the superior performance of DNA-protein binding prediction on DNA sequences can be achieved by combining the attention mechanism and residual structure, and a novel pipeline is accordingly developed. The proposed methodology is generally applicable and can be used to address any other sequence classification problems.
Collapse
Affiliation(s)
- Long-Chen Shen
- School of Computer Science and Engineering, Nanjing University of Science and Technology, China
| | - Yan Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, China
| |
Collapse
|
38
|
Zhang Q, Wang S, Chen Z, He Y, Liu Q, Huang DS. Locating transcription factor binding sites by fully convolutional neural network. Brief Bioinform 2021; 22:bbaa435. [PMID: 33498086 PMCID: PMC8425303 DOI: 10.1093/bib/bbaa435] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2020] [Revised: 12/11/2020] [Accepted: 12/26/2020] [Indexed: 12/27/2022] Open
Abstract
Transcription factors (TFs) play an important role in regulating gene expression, thus identification of the regions bound by them has become a fundamental step for molecular and cellular biology. In recent years, an increasing number of deep learning (DL) based methods have been proposed for predicting TF binding sites (TFBSs) and achieved impressive prediction performance. However, these methods mainly focus on predicting the sequence specificity of TF-DNA binding, which is equivalent to a sequence-level binary classification task, and fail to identify motifs and TFBSs accurately. In this paper, we developed a fully convolutional network coupled with global average pooling (FCNA), which by contrast is equivalent to a nucleotide-level binary classification task, to roughly locate TFBSs and accurately identify motifs. Experimental results on human ChIP-seq datasets show that FCNA outperforms other competing methods significantly. Besides, we find that the regions located by FCNA can be used by motif discovery tools to further refine the prediction performance. Furthermore, we observe that FCNA can accurately identify TF-DNA binding motifs across different cell lines and infer indirect TF-DNA bindings.
Collapse
Affiliation(s)
- Qinhu Zhang
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Tongji University, Shanghai, China
| | - Siguo Wang
- Computer Science and Technology, Tongji University, China
| | | | - Ying He
- Computer Science and Technology at Tongji University, China
| | - Qi Liu
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - De-Shuang Huang
- Institute of Machines Learning and Systems Biology, Tongji University, China
| |
Collapse
|
39
|
Zhang Q, Yu W, Han K, Nandi AK, Huang DS. Multi-Scale Capsule Network for Predicting DNA-Protein Binding Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1793-1800. [PMID: 32960766 DOI: 10.1109/tcbb.2020.3025579] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Discovering DNA-protein binding sites, also known as motif discovery, is the foundation for further analysis of transcription factors (TFs). Deep learning algorithms such as convolutional neural networks (CNN) have been introduced to motif discovery task and have achieved state-of-art performance. However, due to the limitations of CNN, motif discovery methods based on CNN do not take full advantage of large-scale sequencing data generated by high-throughput sequencing technology. Hence, in this paper we propose multi-scale capsule network architecture (MSC) integrating multi-scale CNN, a variant of CNN able to extract motif features of different lengths, and capsule network, a novel type of artificial neural network architecture aimed at improving CNN. The proposed method is tested on real ChIP-seq datasets and the experimental results show a considerable improvement compared with two well-tested deep learning-based sequence model, DeepBind and Deepsea.
Collapse
|
40
|
Li M, Wang Y, Li F, Zhao Y, Liu M, Zhang S, Bin Y, Smith AI, Webb GI, Li J, Song J, Xia J. A Deep Learning-Based Method for Identification of Bacteriophage-Host Interaction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1801-1810. [PMID: 32813660 PMCID: PMC8703204 DOI: 10.1109/tcbb.2020.3017386] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Multi-drug resistance (MDR) has become one of the greatest threats to human health worldwide, and novel treatment methods of infections caused by MDR bacteria are urgently needed. Phage therapy is a promising alternative to solve this problem, to which the key is correctly matching target pathogenic bacteria with the corresponding therapeutic phage. Deep learning is powerful for mining complex patterns to generate accurate predictions. In this study, we develop PredPHI (Predicting Phage-Host Interactions), a deep learning-based tool capable of predicting the host of phages from sequence data. We collect >3000 phage-host pairs along with their protein sequences from PhagesDB and GenBank databases and extract a set of features. Then we select high-quality negative samples based on the K-Means clustering method and construct a balanced training set. Finally, we employ a deep convolutional neural network to build the predictive model. The results indicate that PredPHI can achieve a predictive performance of 81 percent in terms of the area under the receiver operating characteristic curve on the test set, and the clustering-based method is significantly more robust than that based on randomly selecting negative samples. These results highlight that PredPHI is a useful and accurate tool for identifying phage-host interactions from sequence data.
Collapse
|
41
|
Zheng K, You ZH, Wang L, Li YR, Zhou JR, Zeng HT. MISSIM: An Incremental Learning-Based Model With Applications to the Prediction of miRNA-Disease Association. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1733-1742. [PMID: 32749964 DOI: 10.1109/tcbb.2020.3013837] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In the past few years, the prediction models have shown remarkable performance in most biological correlation prediction tasks. These tasks traditionally use a fixed dataset, and the model, once trained, is deployed as is. These models often encounter training issues such as sensitivity to hyperparameter tuning and "catastrophic forgetting" when adding new data. However, with the development of biomedicine and the accumulation of biological data, new predictive models are required to face the challenge of adapting to change. To this end, we propose a computational approach based on Broad learning system (BLS) to predict potential disease-associated miRNAs that retain the ability to distinguish prior training associations when new data need to be adapted. In particular, we are introducing incremental learning to the field of biological association prediction for the first time and proposed a new method for quantifying sequence similarity. In the performance evaluation, the AUC in the 5-fold cross-validation was 0.9400 +/- 0.0041. To better assess the effectiveness of MISSIM, we compared it with various classifiers and former prediction models. Its performance is superior to the previous method. Besides, the case study on identifying miRNAs associated with breast neoplasms, lung neoplasms and esophageal neoplasms show that 34, 36 and 35 out of the top 40 associations predicted by MISSIM are confirmed by recent biomedical resources. These results provide ample convincing evidence of this approach have potential value and prospect in promoting biomedical research productivity.
Collapse
|
42
|
Li M, Zhang W. PHIAF: prediction of phage-host interactions with GAN-based data augmentation and sequence-based feature fusion. Brief Bioinform 2021; 23:6362109. [PMID: 34472593 DOI: 10.1093/bib/bbab348] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2021] [Revised: 07/05/2021] [Accepted: 07/18/2021] [Indexed: 01/01/2023] Open
Abstract
Phage therapy has become one of the most promising alternatives to antibiotics in the treatment of bacterial diseases, and identifying phage-host interactions (PHIs) helps to understand the possible mechanism through which a phage infects bacteria to guide the development of phage therapy. Compared with wet experiments, computational methods of identifying PHIs can reduce costs and save time and are more effective and economic. In this paper, we propose a PHI prediction method with a generative adversarial network (GAN)-based data augmentation and sequence-based feature fusion (PHIAF). First, PHIAF applies a GAN-based data augmentation module, which generates pseudo PHIs to alleviate the data scarcity. Second, PHIAF fuses the features originated from DNA and protein sequences for better performance. Third, PHIAF utilizes an attention mechanism to consider different contributions of DNA/protein sequence-derived features, which also provides interpretability of the prediction model. In computational experiments, PHIAF outperforms other state-of-the-art PHI prediction methods when evaluated via 5-fold cross-validation (AUC and AUPR are 0.88 and 0.86, respectively). An ablation study shows that data augmentation, feature fusion and an attention mechanism are all beneficial to improve the prediction performance of PHIAF. Additionally, four new PHIs with the highest PHIAF score in the case study were verified by recent literature. In conclusion, PHIAF is a promising tool to accelerate the exploration of phage therapy.
Collapse
Affiliation(s)
- Menglu Li
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Wen Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| |
Collapse
|
43
|
iEnhancer-RD: Identification of enhancers and their strength using RKPK features and deep neural networks. Anal Biochem 2021; 630:114318. [PMID: 34364858 DOI: 10.1016/j.ab.2021.114318] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Revised: 07/02/2021] [Accepted: 07/27/2021] [Indexed: 11/20/2022]
Abstract
Enhancers are regulatory elements involved in gene expression.It is a part of DNA, which can enhance the transcription rate of gene. However, the identification of enhancer by biological experimental methods is time-consuming and expensive. Therefore, there is an urgent need for more efficient methods to identify them.In this study, we propose a new feature extraction method RKPK, which combines three feature methods and uses the recursive feature elimination algorithm for feature selection, and apply deep neural network as classifier to construct the iEnhancer-RD calculation method for enhancer identification. It is a two-layer classification architecture in which the first layer(layer I) identifies enhancers from a set of DNA sequences, and the second layer(layer II) divides the identified enhancers into two subgroups, namely strong and weak enhancers. Independent dataset test indicates that the proposed method is significantly better than most existing methods, and attains the accuracy of 78.8% and 70.5% in the two layers, respectively. Our iEnhancer-RD architecture is implemented in Python and is available at https://github.com/YangHuan639/iEnhancer-RD.
Collapse
|
44
|
Li JY, Jin S, Tu XM, Ding Y, Gao G. Identifying complex motifs in massive omics data with a variable-convolutional layer in deep neural network. Brief Bioinform 2021; 22:6312656. [PMID: 34219140 DOI: 10.1093/bib/bbab233] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 05/25/2021] [Accepted: 05/28/2021] [Indexed: 01/10/2023] Open
Abstract
Motif identification is among the most common and essential computational tasks for bioinformatics and genomics. Here we proposed a novel convolutional layer for deep neural network, named variable convolutional (vConv) layer, for effective motif identification in high-throughput omics data by learning kernel length from data adaptively. Empirical evaluations on DNA-protein binding and DNase footprinting cases well demonstrated that vConv-based networks have superior performance to their convolutional counterparts regardless of model complexity. Meanwhile, vConv could be readily integrated into multi-layer neural networks as an 'in-place replacement' of canonical convolutional layer. All source codes are freely available on GitHub for academic usage.
Collapse
Affiliation(s)
- Jing-Yi Li
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Shen Jin
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Xin-Ming Tu
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Yang Ding
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Ge Gao
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| |
Collapse
|
45
|
Zhang J, Chen Q, Liu B. DeepDRBP-2L: A New Genome Annotation Predictor for Identifying DNA-Binding Proteins and RNA-Binding Proteins Using Convolutional Neural Network and Long Short-Term Memory. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1451-1463. [PMID: 31722485 DOI: 10.1109/tcbb.2019.2952338] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs) are two kinds of crucial proteins, which are associated with various cellule activities and some important diseases. Accurate identification of DBPs and RBPs facilitate both theoretical research and real world application. Existing sequence-based DBP predictors can accurately identify DBPs but incorrectly predict many RBPs as DBPs, and vice versa, resulting in low prediction precision. Moreover, some proteins (DRBPs) interacting with both DNA and RNA play important roles in gene expression and cannot be identified by existing computational methods. In this study, a two-level predictor named DeepDRBP-2L was proposed by combining Convolutional Neural Network (CNN) and the Long Short-Term Memory (LSTM). It is the first computational method that is able to identify DBPs, RBPs and DRBPs. Rigorous cross-validations and independent tests showed that DeepDRBP-2L is able to overcome the shortcoming of the existing methods and can go one further step to identify DRBPs. Application of DeepDRBP-2L to tomato genome further demonstrated its performance. The webserver of DeepDRBP-2L is freely available at http://bliulab.net/DeepDRBP-2L.
Collapse
|
46
|
Zhang Q, Shen Z, Huang DS. Predicting in-vitro Transcription Factor Binding Sites Using DNA Sequence + Shape. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:667-676. [PMID: 31634140 DOI: 10.1109/tcbb.2019.2947461] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Discovery of transcription factor binding sites (TFBSs) is essential for understanding the underlying binding mechanisms and cellular functions. Recently, Convolutional neural network (CNN) has succeeded in predicting TFBSs from the primary DNA sequences. In addition to DNA sequences, several evidences suggest that protein-DNA binding is partly mediated by properties of DNA shape. Although many methods have been proposed to jointly account for DNA sequences and shape properties in predicting TFBSs, they ignore the power of the combination of deep learning and DNA sequence + shape. Therefore we develop a deep-learning-based sequence + shape framework (DLBSS) in this paper, which appropriately integrates DNA sequences and shape properties, to better understand protein-DNA binding preference. This method uses a shared CNN to find their common patterns from DNA sequences and their corresponding shape features, which are then concatenated to compute a predicted value. Using 66 in-vitro datasets derived from universal protein binding microarrays (uPBMs), we show that our proposed method DLBSS significantly improves the performance of predicting TFBSs. In addition, we explain the reason why we should use the shared CNN, and explore the performance of DLBSS when using a deeper CNN, through a series of experiments.
Collapse
|
47
|
Yusuf SM, Zhang F, Zeng M, Li M. DeepPPF: A deep learning framework for predicting protein family. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.11.062] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
48
|
Du X, Hu J, Li S. Using Chou's 5-Step Rule to Predict DNA-Protein Binding with Multi-scale Complementary Feature. J Proteome Res 2021; 20:1639-1656. [PMID: 33522829 DOI: 10.1021/acs.jproteome.0c00864] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
It is well known that DNA-protein binding (DPB) prediction is not only beneficial to understand the regulation mechanism of gene expression but also a challenging task in the field of computational biology. Traditional methods for DPB prediction that depend on manually extracted features may lead to classification errors. Recently, deep learning such as convolutional neural network (CNN) has been successfully applied to classification tasks and improved DPB prediction performance significantly. Yet, these methods are based on the original DNA sequence modeling, ignoring the hidden complex dependency and complementarity between multiple sequence features. In consideration of this problem, we propose a method to fuse different sequence features and analyze them systematically through multi-scale CNN. First, sliding windows of specified lengths are set on distinct DNA sequences to generate multiple sequence features with unequal lengths. Second, multiple feature sequences are fused and encoded for feature representation. Third, multi-scale CNN with different binding motif lengths is used to automatically learn and mine the influence of internal attributes and hidden complex relations between the fusion sequence features and make full use of the complementary advantages of extracted CNN features to predict DPB. When our model is applied to 690 ChIP-seq datasets, it achieves an average AUC of 0.9112, which is significantly better than the latest methods. The results show that our method is effective for DPB prediction and is freely available at http://121.5.71.120/mscDPB/.
Collapse
Affiliation(s)
- Xiuquan Du
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, Anhui, China.,School of Computer Science and Technology, Anhui University, Hefei 230601, Anhui, China
| | - Jiajia Hu
- School of Computer Science and Technology, Anhui University, Hefei 230601, Anhui, China
| | - Shuo Li
- Department of Medical Imaging, Western University, London, ON N6A 3K7, Canada
| |
Collapse
|
49
|
He Y, Shen Z, Zhang Q, Wang S, Huang DS. A survey on deep learning in DNA/RNA motif mining. Brief Bioinform 2020; 22:5916939. [PMID: 33005921 PMCID: PMC8293829 DOI: 10.1093/bib/bbaa229] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2020] [Revised: 08/19/2020] [Accepted: 08/24/2020] [Indexed: 01/18/2023] Open
Abstract
DNA/RNA motif mining is the foundation of gene function research. The DNA/RNA motif mining plays an extremely important role in identifying the DNA- or RNA-protein binding site, which helps to understand the mechanism of gene regulation and management. For the past few decades, researchers have been working on designing new efficient and accurate algorithms for mining motif. These algorithms can be roughly divided into two categories: the enumeration approach and the probabilistic method. In recent years, machine learning methods had made great progress, especially the algorithm represented by deep learning had achieved good performance. Existing deep learning methods in motif mining can be roughly divided into three types of models: convolutional neural network (CNN) based models, recurrent neural network (RNN) based models, and hybrid CNN–RNN based models. We introduce the application of deep learning in the field of motif mining in terms of data preprocessing, features of existing deep learning architectures and comparing the differences between the basic deep learning models. Through the analysis and comparison of existing deep learning methods, we found that the more complex models tend to perform better than simple ones when data are sufficient, and the current methods are relatively simple compared with other fields such as computer vision, language processing (NLP), computer games, etc. Therefore, it is necessary to conduct a summary in motif mining by deep learning, which can help researchers understand this field.
Collapse
Affiliation(s)
- Ying He
- computer science and technology at Tongji University, China
| | - Zhen Shen
- computer science and technology at Tongji University, China
| | - Qinhu Zhang
- computer science and technology at Tongji University, China
| | - Siguo Wang
- computer science and technology at Tongji University, China
| | - De-Shuang Huang
- Institute of Machines Learning and Systems Biology, Tongji University
| |
Collapse
|
50
|
Khan F, Khan M, Iqbal N, Khan S, Muhammad Khan D, Khan A, Wei DQ. Prediction of Recombination Spots Using Novel Hybrid Feature Extraction Method via Deep Learning Approach. Front Genet 2020; 11:539227. [PMID: 33093842 PMCID: PMC7527634 DOI: 10.3389/fgene.2020.539227] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Accepted: 08/13/2020] [Indexed: 01/20/2023] Open
Abstract
Meiotic recombination is the driving force of evolutionary development and an important source of genetic variation. The meiotic recombination does not take place randomly in a chromosome but occurs in some regions of the chromosome. A region in chromosomes with higher rate of meiotic recombination events are considered as hotspots and a region where frequencies of the recombination events are lower are called coldspots. Prediction of meiotic recombination spots provides useful information about the basic functionality of inheritance and genome diversity. This study proposes an intelligent computational predictor called iRSpots-DNN for the identification of recombination spots. The proposed predictor is based on a novel feature extraction method and an optimized deep neural network (DNN). The DNN was employed as a classification engine whereas, the novel features extraction method was developed to extract meaningful features for the identification of hotspots and coldspots across the yeast genome. Unlike previous algorithms, the proposed feature extraction avoids bias among different selected features and preserved the sequence discriminant properties along with the sequence-structure information simultaneously. This study also considered other effective classifiers named support vector machine (SVM), K-nearest neighbor (KNN), and random forest (RF) to predict recombination spots. Experimental results on a benchmark dataset with 10-fold cross-validation showed that iRSpots-DNN achieved the highest accuracy, i.e., 95.81%. Additionally, the performance of the proposed iRSpots-DNN is significantly better than the existing predictors on a benchmark dataset. The relevant benchmark dataset and source code are freely available at: https://github.com/Fatima-Khan12/iRspot_DNN/tree/master/iRspot_DNN.
Collapse
Affiliation(s)
- Fatima Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Mukhtaj Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Nadeem Iqbal
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Salman Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Dost Muhammad Khan
- Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Abbas Khan
- Department of Bioinformatics and Biological Statistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Dong-Qing Wei
- Department of Bioinformatics and Biological Statistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China.,State Key Laboratory of Microbial Metabolism, Shanghai-Islamabad-Belgrade Joint Innovation Center on Antibacterial Resistances, Joint Laboratory of International Cooperation in Metabolic and Developmental Sciences, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Ministry of Education, Shanghai, China.,Peng Cheng Laboratory, Shenzhen, China
| |
Collapse
|