1
|
Zhang T, Sun H, Wu Z, Zhao Z, Zhao X, Zhang H, Gao B, Wang G. GAADE: identification spatially variable genes based on adaptive graph attention network. Brief Bioinform 2024; 26:bbae669. [PMID: 39701602 DOI: 10.1093/bib/bbae669] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2024] [Revised: 11/16/2024] [Accepted: 12/06/2024] [Indexed: 12/21/2024] Open
Abstract
The rapid advancement of spatial transcriptomics (ST) sequencing technology has made it possible to capture gene expression with spatial coordinate information at the cellular level. Although many methods in ST data analysis can detect spatially variable genes (SVGs), these methods often fail to identify genes with explicit spatial expression patterns due to the lack of consideration for spatial domains. Considering spatial domains is crucial for identifying SVGs as it focuses the analysis of gene expression changes on biologically relevant regions, aiding in the more accurate identification of SVGs associated with specific cell types. Existing methods for identifying SVGs based on spatial domains predefine spot similarity before training, which prevents adaptive learning and limits generalizability across different tissues or samples. This limitation may also lead to inaccurate identification of specific genes at boundary regions. To address these issues, we present GAADE, an unsupervised neural network architecture based on graph-structured data representation learning. GAADE stacks encoder/decoder layers and integrates a self-attention mechanism to reconstruct node attributes and graph structure, effectively capturing spatial domain structures of different sections. Consequently, we confine the identification of SVGs within spatial domains. By performing differential expression analysis on spots within the target spatial domain and their multi-order neighbors, GAADE detects genes with enriched expression patterns within defined domains. Comparative evaluations with five other popular methods on ST datasets across four different species, regions and tissues demonstrate that GAADE exhibits superior performance in detecting SVGs and capturing the extent of spatial gene expression variation.
Collapse
Affiliation(s)
- Tianjiao Zhang
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Hao Sun
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Zhenao Wu
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Zhongqian Zhao
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Xingjie Zhao
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Hongfei Zhang
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Bo Gao
- Department of Radiology, The Second Affiliated Hospital of Harbin Medical University, No. 246 Xuefu Road, Nangang District, Harbin 150081, China
| | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
- Faculty of Computing, Harbin Institute of Technology, No. 92 West Da Zhi Street, Nangang District, Harbin 150001, China
| |
Collapse
|
2
|
Luo X, Chi ASY, Lin AH, Ong TJ, Wong L, Rahman CR. Benchmarking recent computational tools for DNA-binding protein identification. Brief Bioinform 2024; 26:bbae634. [PMID: 39657630 PMCID: PMC11630855 DOI: 10.1093/bib/bbae634] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2024] [Revised: 10/29/2024] [Accepted: 11/20/2024] [Indexed: 12/12/2024] Open
Abstract
Identification of DNA-binding proteins (DBPs) is a crucial task in genome annotation, as it aids in understanding gene regulation, DNA replication, transcriptional control, and various cellular processes. In this paper, we conduct an unbiased benchmarking of 11 state-of-the-art computational tools as well as traditional tools such as ScanProsite, BLAST, and HMMER for identifying DBPs. We highlight the data leakage issue in conventional datasets leading to inflated performance. We introduce new evaluation datasets to support further development. Through a comprehensive evaluation pipeline, we identify potential limitations in models, feature extraction techniques, and training methods, and recommend solutions regarding these issues. We show that combining the predictions of the two best computational tools with BLAST-based prediction significantly enhances DBP identification capability. We provide this consensus method as user-friendly software. The datasets and software are available at https://github.com/Rafeed-bot/DNA_BP_Benchmarking.
Collapse
Affiliation(s)
- Xizi Luo
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Amadeus Song Yi Chi
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Andre Huikai Lin
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Tze Jet Ong
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | | |
Collapse
|
3
|
Qiao G, Wang G, Li Y. Causal enhanced drug-target interaction prediction based on graph generation and multi-source information fusion. Bioinformatics 2024; 40:btae570. [PMID: 39312682 PMCID: PMC11639159 DOI: 10.1093/bioinformatics/btae570] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2024] [Revised: 08/17/2024] [Accepted: 09/20/2024] [Indexed: 09/25/2024] Open
Abstract
MOTIVATION The prediction of drug-target interaction is a vital task in the biomedical field, aiding in the discovery of potential molecular targets of drugs and the development of targeted therapy methods with higher efficacy and fewer side effects. Although there are various methods for drug-target interaction (DTI) prediction based on heterogeneous information networks, these methods face challenges in capturing the fundamental interaction between drugs and targets and ensuring the interpretability of the model. Moreover, they need to construct meta-paths artificially or a lot of feature engineering (prior knowledge), and graph generation can fuse information more flexibly without meta-path selection. RESULTS We propose a causal enhanced method for drug-target interaction (CE-DTI) prediction that integrates graph generation and multi-source information fusion. First, we represent drugs and targets by modeling the fusion of their multi-source information through automatic graph generation. Once drugs and targets are combined, a network of drug-target pairs is constructed, transforming the prediction of drug-target interactions into a node classification problem. Specifically, the influence of surrounding nodes on the central node is separated into two groups: causal and non-causal variable nodes. Causal variable nodes significantly impact the central node's classification, while non-causal variable nodes do not. Causal invariance is then used to enhance the contrastive learning of the drug-target pairs network. Our method demonstrates excellent performance compared with other competitive benchmark methods across multiple datasets. At the same time, the experimental results also show that the causal enhancement strategy can explore the potential causal effects between DTPs, and discover new potential targets. Additionally, case studies demonstrate that this method can identify potential drug targets. AVAILABILITY AND IMPLEMENTATION The source code of AdaDR is available at: https://github.com/catly/CE-DTI.
Collapse
Affiliation(s)
- Guanyu Qiao
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Guohua Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Yang Li
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| |
Collapse
|
4
|
Pradhan UK, Meher PK, Naha S, Sharma NK, Agarwal A, Gupta A, Parsad R. DBPMod: a supervised learning model for computational recognition of DNA-binding proteins in model organisms. Brief Funct Genomics 2024; 23:363-372. [PMID: 37651627 DOI: 10.1093/bfgp/elad039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 08/09/2023] [Accepted: 08/15/2023] [Indexed: 09/02/2023] Open
Abstract
DNA-binding proteins (DBPs) play critical roles in many biological processes, including gene expression, DNA replication, recombination and repair. Understanding the molecular mechanisms underlying these processes depends on the precise identification of DBPs. In recent times, several computational methods have been developed to identify DBPs. However, because of the generic nature of the models, these models are unable to identify species-specific DBPs with higher accuracy. Therefore, a species-specific computational model is needed to predict species-specific DBPs. In this paper, we introduce the computational DBPMod method, which makes use of a machine learning approach to identify species-specific DBPs. For prediction, both shallow learning algorithms and deep learning models were used, with shallow learning models achieving higher accuracy. Additionally, the evolutionary features outperformed sequence-derived features in terms of accuracy. Five model organisms, including Caenorhabditis elegans, Drosophila melanogaster, Escherichia coli, Homo sapiens and Mus musculus, were used to assess the performance of DBPMod. Five-fold cross-validation and independent test set analyses were used to evaluate the prediction accuracy in terms of area under receiver operating characteristic curve (auROC) and area under precision-recall curve (auPRC), which was found to be ~89-92% and ~89-95%, respectively. The comparative results demonstrate that the DBPMod outperforms 12 current state-of-the-art computational approaches in identifying the DBPs for all five model organisms. We further developed the web server of DBPMod to make it easier for researchers to detect DBPs and is publicly available at https://iasri-sg.icar.gov.in/dbpmod/. DBPMod is expected to be an invaluable tool for discovering DBPs, supplementing the current experimental and computational methods.
Collapse
Affiliation(s)
- Upendra K Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina K Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Nitesh K Sharma
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, 1540 Alcazar Street, Los Angeles, CA 90033, USA
| | - Aarushi Agarwal
- Amity Institute of Biotechnology, Amity University, Noida, Uttar Pradesh 201313, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| |
Collapse
|
5
|
Shi J, Chen Y, Wang Y. Deep learning and machine learning approaches to classify stomach distant metastatic tumors using DNA methylation profiles. Comput Biol Med 2024; 175:108496. [PMID: 38657466 DOI: 10.1016/j.compbiomed.2024.108496] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2024] [Revised: 04/14/2024] [Accepted: 04/21/2024] [Indexed: 04/26/2024]
Abstract
Distant metastasis of cancer is a significant contributor to cancer-related complications, and early identification of unidentified stomach adenocarcinoma is crucial for a positive prognosis. Changes inDNA methylation are being increasingly recognized as a crucial factor in predicting cancer progression. Within this research, we developed machine learning and deep learning models for distinguishing distant metastasis in samples of stomach adenocarcinoma based on DNA methylation profile. Employing deep neural networks (DNN), support vector machines (SVM), random forest (RF), Naive Bayes (NB) and decision tree (DT), and models for forecasting distant metastasis in stomach adenocarcinoma. The results show that the performance of DNN is better than that of other models, AUC and AUPR achieving 99.9 % and 99.5 % respectively. Additionally, a weighted random sampling technique was utilized to address the issue of imbalanced datasets, enabling the identification of crucial methylation markers associated with functionally significant genes in stomach distant metastasis tumors with greater performance.
Collapse
Affiliation(s)
- Jing Shi
- Department of Medical Oncology, The First Hospital of China Medical University, Shenyang, China
| | - Ying Chen
- Department of Medical Oncology, The First Hospital of China Medical University, Shenyang, China
| | - Ying Wang
- Department of Endoscopy, The First Hospital of China Medical University, Shenyang, China.
| |
Collapse
|
6
|
Datta S, Nabeel Asim M, Dengel A, Ahmed S. NTpred: a robust and precise machine learning framework for in silico identification of Tyrosine nitration sites in protein sequences. Brief Funct Genomics 2024; 23:163-179. [PMID: 37248673 DOI: 10.1093/bfgp/elad018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Revised: 04/12/2023] [Accepted: 05/02/2023] [Indexed: 05/31/2023] Open
Abstract
Post-translational modifications (PTMs) either enhance a protein's activity in various sub-cellular processes, or degrade their activity which leads toward failure of intracellular processes. Tyrosine nitration (NT) modification degrades protein's activity that initiates and propagates various diseases including neurodegenerative, cardiovascular, autoimmune diseases and carcinogenesis. Identification of NT modification supports development of novel therapies and drug discoveries for associated diseases. Identification of NT modification in biochemical labs is expensive, time consuming and error-prone. To supplement this process, several computational approaches have been proposed. However these approaches fail to precisely identify NT modification, due to the extraction of irrelevant, redundant and less discriminative features from protein sequences. This paper presents the NTpred framework that is competent in extracting comprehensive features from raw protein sequences using four different sequence encoders. To reap the benefits of different encoders, it generates four additional feature spaces by fusing different combinations of individual encodings. Furthermore, it eradicates irrelevant and redundant features from eight different feature spaces through a Recursive Feature Elimination process. Selected features of four individual encodings and four feature fusion vectors are used to train eight different Gradient Boosted Tree classifiers. The probability scores from the trained classifiers are utilized to generate a new probabilistic feature space, which is used to train a Logistic Regression classifier. On the BD1 benchmark dataset, the proposed framework outperforms the existing best-performing predictor in 5-fold cross validation and independent test evaluation with combined improvement of 13.7% in MCC and 20.1% in AUC. Similarly, on the BD2 benchmark dataset, the proposed framework outperforms the existing best-performing predictor with combined improvement of 5.3% in MCC and 1.0% in AUC. NTpred is publicly available for further experimentation and predictive use at: https://sds_genetic_analysis.opendfki.de/PredNTS/.
Collapse
Affiliation(s)
- Sourajyoti Datta
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern, 67663, Germany
| | - Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence, Kaiserslautern, 67663, Germany
| | - Andreas Dengel
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern, 67663, Germany
- German Research Center for Artificial Intelligence, Kaiserslautern, 67663, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence, Kaiserslautern, 67663, Germany
| |
Collapse
|
7
|
Li W, Li G, Sun Y, Zhang L, Cui X, Jia Y, Zhao T. Prediction of SARS-CoV-2 Infection Phosphorylation Sites and Associations of these Modifications with Lung Cancer Development. Curr Gene Ther 2024; 24:239-248. [PMID: 37957848 DOI: 10.2174/0115665232268074231026111634] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 10/05/2023] [Accepted: 10/07/2023] [Indexed: 11/15/2023]
Abstract
INTRODUCTION Since the emergence of SARS-CoV-2 viruses, multiple mutant strains have been identified. Infection with SARS-CoV-2 virus leads to alterations in host cell phosphorylation signal, which systematically modulates the immune response. METHODS Identification and analysis of SARS-CoV-2 virus infection phosphorylation sites enable insight into the mechanisms of viral infection and effects on host cells, providing important fundamental data for the study and development of potent drugs for the treatment of immune inflammatory diseases. In this paper, we have analyzed the SARS-CoV-2 virus-infected phosphorylation region and developed a transformer-based deep learning-assisted identification method for the specific identification of phosphorylation sites in SARS-CoV-2 virus-infected host cells. RESULTS Furthermore, through association analysis with lung cancer, we found that SARS-CoV-2 infection may affect the regulatory role of the immune system, leading to an abnormal increase or decrease in the immune inflammatory response, which may be associated with the development and progression of cancer. CONCLUSION We anticipate that this study will provide an important reference for SARS-CoV-2 virus evolution as well as immune-related studies and provide a reliable complementary screening tool for anti-SARS-CoV-2 virus drug and vaccine design.
Collapse
Affiliation(s)
- Wei Li
- Institute of Bioinformatics, Harbin Institute of Technology, Harbin, China
| | - Gen Li
- Department of Radiation Oncology, Harbin Medical University Cancer Hospital, Harbin, China
| | - Yuzhi Sun
- Institute for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Liyuan Zhang
- Institute for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Xinran Cui
- Institute for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yuran Jia
- Institute for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Tianyi Zhao
- School of Medicine and Health, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
8
|
Hu J, Zeng WW, Jia NX, Arif M, Yu DJ, Zhang GJ. Improving DNA-Binding Protein Prediction Using Three-Part Sequence-Order Feature Extraction and a Deep Neural Network Algorithm. J Chem Inf Model 2023; 63:1044-1057. [PMID: 36719781 DOI: 10.1021/acs.jcim.2c00943] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Identification of the DNA-binding protein (DBP) helps dig out information embedded in the DNA-protein interaction, which is significant to understanding the mechanisms of DNA replication, transcription, and repair. Although existing computational methods for predicting the DBPs based on protein sequences have obtained great success, there is still room for improvement since the sequence-order information is not fully mined in these methods. In this study, a new three-part sequence-order feature extraction (called TPSO) strategy is developed to extract more discriminative information from protein sequences for predicting the DBPs. For each query protein, TPSO first divides its primary sequence features into N- and C-terminal fragments and then extracts the numerical pseudo features of three parts including the full sequence and these two fragments, respectively. Based on TPSO, a novel deep learning-based method, called TPSO-DBP, is proposed, which employs the sequence-based single-view features, the bidirectional long short-term memory (BiLSTM) and fully connected (FC) neural networks to learn the DBP prediction model. Empirical outcomes reveal that TPSO-DBP can achieve an accuracy of 87.01%, covering 85.30% of all DBPs, while achieving a Matthew's correlation coefficient value (0.741) that is significantly higher than most existing state-of-the-art DBP prediction methods. Detailed data analyses have indicated that the advantages of TPSO-DBP lie in the utilization of TPSO, which helps extract more concealed prominent patterns, and the deep neural network framework composed of BiLSTM and FC that learns the nonlinear relationships between input features and DBPs. The standalone package and web server of TPSO-DBP are freely available at https://jun-csbio.github.io/TPSO-DBP/.
Collapse
Affiliation(s)
- Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou310023, China
| | - Wen-Wu Zeng
- College of Information Engineering, Zhejiang University of Technology, Hangzhou310023, China
| | - Ning-Xin Jia
- College of Information Engineering, Zhejiang University of Technology, Hangzhou310023, China
| | - Muhammad Arif
- School of Systems and Technology, Department of Informatics and Systems, University of Management and Technology, Lahore54770, Pakistan
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing210094, China
| | - Gui-Jun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou310023, China
| |
Collapse
|
9
|
Pradhan UK, Meher PK, Naha S, Pal S, Gupta A, Parsad R. PlDBPred: a novel computational model for discovery of DNA binding proteins in plants. Brief Bioinform 2023; 24:6840070. [PMID: 36416116 DOI: 10.1093/bib/bbac483] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2022] [Revised: 10/10/2022] [Accepted: 10/11/2022] [Indexed: 11/24/2022] Open
Abstract
DNA-binding proteins (DBPs) play crucial roles in numerous cellular processes including nucleotide recognition, transcriptional control and the regulation of gene expression. Majority of the existing computational techniques for identifying DBPs are mainly applicable to human and mouse datasets. Even though some models have been tested on Arabidopsis, they produce poor accuracy when applied to other plant species. Therefore, it is imperative to develop an effective computational model for predicting plant DBPs. In this study, we developed a comprehensive computational model for plant specific DBPs identification. Five shallow learning and six deep learning models were initially used for prediction, where shallow learning methods outperformed deep learning algorithms. In particular, support vector machine achieved highest repeated 5-fold cross-validation accuracy of 94.0% area under receiver operating characteristic curve (AUC-ROC) and 93.5% area under precision recall curve (AUC-PR). With an independent dataset, the developed approach secured 93.8% AUC-ROC and 94.6% AUC-PR. While compared with the state-of-art existing tools by using an independent dataset, the proposed model achieved much higher accuracy. Overall results suggest that the developed computational model is more efficient and reliable as compared to the existing models for the prediction of DBPs in plants. For the convenience of the majority of experimental scientists, the developed prediction server PlDBPred is publicly accessible at https://iasri-sg.icar.gov.in/pldbpred/.The source code is also provided at https://iasri-sg.icar.gov.in/pldbpred/source_code.php for prediction using a large-size dataset.
Collapse
Affiliation(s)
- Upendra Kumar Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi-110012, India
| | - Soumen Pal
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi-110012, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi-110012, India
| |
Collapse
|
10
|
DNAPred_Prot: Identification of DNA-Binding Proteins Using Composition- and Position-Based Features. Appl Bionics Biomech 2022; 2022:5483115. [PMID: 35465187 PMCID: PMC9020926 DOI: 10.1155/2022/5483115] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 12/25/2021] [Accepted: 02/05/2022] [Indexed: 12/29/2022] Open
Abstract
In the domain of genome annotation, the identification of DNA-binding protein is one of the crucial challenges. DNA is considered a blueprint for the cell. It contained all necessary information for building and maintaining the trait of an organism. It is DNA, which makes a living thing, a living thing. Protein interaction with DNA performs an essential role in regulating DNA functions such as DNA repair, transcription, and regulation. Identification of these proteins is a crucial task for understanding the regulation of genes. Several methods have been developed to identify the binding sites of DNA and protein depending upon the structures and sequences, but they were costly and time-consuming. Therefore, we propose a methodology named “DNAPred_Prot”, which uses various position and frequency-dependent features from protein sequences for efficient and effective prediction of DNA-binding proteins. Using testing techniques like 10-fold cross-validation and jackknife testing an accuracy of 94.95% and 95.11% was yielded, respectively. The results of SVM and ANN were also compared with those of a random forest classifier. The robustness of the proposed model was evaluated by using the independent dataset PDB186, and an accuracy of 91.47% was achieved by it. From these results, it can be predicted that the suggested methodology performs better than other extant methods for the identification of DNA-binding proteins.
Collapse
|