1
|
Abbasi AF, Asim MN, Ahmed S, Dengel A. Long extrachromosomal circular DNA identification by fusing sequence-derived features of physicochemical properties and nucleotide distribution patterns. Sci Rep 2024; 14:9466. [PMID: 38658614 PMCID: PMC11043385 DOI: 10.1038/s41598-024-57457-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 03/18/2024] [Indexed: 04/26/2024] Open
Abstract
Long extrachromosomal circular DNA (leccDNA) regulates several biological processes such as genomic instability, gene amplification, and oncogenesis. The identification of leccDNA holds significant importance to investigate its potential associations with cancer, autoimmune, cardiovascular, and neurological diseases. In addition, understanding these associations can provide valuable insights about disease mechanisms and potential therapeutic approaches. Conventionally, wet lab-based methods are utilized to identify leccDNA, which are hindered by the need for prior knowledge, and resource-intensive processes, potentially limiting their broader applicability. To empower the process of leccDNA identification across multiple species, the paper in hand presents the very first computational predictor. The proposed iLEC-DNA predictor makes use of SVM classifier along with sequence-derived nucleotide distribution patterns and physicochemical properties-based features. In addition, the study introduces a set of 12 benchmark leccDNA datasets related to three species, namely Homo sapiens (HM), Arabidopsis Thaliana (AT), and Saccharomyces cerevisiae (SC/YS). It performs large-scale experimentation across 12 benchmark datasets under different experimental settings using the proposed predictor, more than 140 baseline predictors, and 858 encoder ensembles. The proposed predictor outperforms baseline predictors and encoder ensembles across diverse leccDNA datasets by producing average performance values of 81.09%, 62.2% and 81.08% in terms of ACC, MCC and AUC-ROC across all the datasets. The source code of the proposed and baseline predictors is available at https://github.com/FAhtisham/Extrachrosmosomal-DNA-Prediction . To facilitate the scientific community, a web application for leccDNA identification is available at https://sds_genetic_analysis.opendfki.de/iLEC_DNA/.
Collapse
Affiliation(s)
- Ahtisham Fazeel Abbasi
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, 67663, Kaiserslautern, Germany.
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Germany.
| | - Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Germany.
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Germany
| | - Andreas Dengel
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, 67663, Kaiserslautern, Germany
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Germany
| |
Collapse
|
2
|
Semisupervised Bacterial Heuristic Feature Selection Algorithm for High-Dimensional Classification with Missing Labels. INT J INTELL SYST 2023. [DOI: 10.1155/2023/4196920] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/24/2023]
Abstract
Feature selection is a crucial method for discovering relevant features in high-dimensional data. However, most studies primarily focus on completely labeled data, ignoring the frequent occurrence of missing labels in real-world problems. To address high-dimensional and label-missing problems in data classification simultaneously, we proposed a semisupervised bacterial heuristic feature selection algorithm. To track the label-missing problem, a k-nearest neighbor semisupervised learning strategy is designed to reconstruct missing labels. In addition, the bacterial heuristic algorithm is improved using hierarchical population initialization, dynamic learning, and elite population evolution strategies to enhance the search capacity for various feature combinations. To verify the effectiveness of the proposed algorithm, three groups of comparison experiments based on eight datasets are employed, including two traditional feature selection methods, four bacterial heuristic feature selection algorithms, and two swarm-based heuristic feature selection algorithms. Experimental results demonstrate that the proposed algorithm has obvious advantages in terms of classification accuracy and selected feature numbers.
Collapse
|
3
|
Wang Z, Qin C, Wan B, Song WW. A Comparative Study of Common Nature-Inspired Algorithms for Continuous Function Optimization. ENTROPY (BASEL, SWITZERLAND) 2021; 23:874. [PMID: 34356415 PMCID: PMC8304592 DOI: 10.3390/e23070874] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Revised: 06/25/2021] [Accepted: 06/30/2021] [Indexed: 11/21/2022]
Abstract
Over previous decades, many nature-inspired optimization algorithms (NIOAs) have been proposed and applied due to their importance and significance. Some survey studies have also been made to investigate NIOAs and their variants and applications. However, these comparative studies mainly focus on one single NIOA, and there lacks a comprehensive comparative and contrastive study of the existing NIOAs. To fill this gap, we spent a great effort to conduct this comprehensive survey. In this survey, more than 120 meta-heuristic algorithms have been collected and, among them, the most popular and common 11 NIOAs are selected. Their accuracy, stability, efficiency and parameter sensitivity are evaluated based on the 30 black-box optimization benchmarking (BBOB) functions. Furthermore, we apply the Friedman test and Nemenyi test to analyze the performance of the compared NIOAs. In this survey, we provide a unified formal description of the 11 NIOAs in order to compare their similarities and differences in depth and a systematic summarization of the challenging problems and research directions for the whole NIOAs field. This comparative study attempts to provide a broader perspective and meaningful enlightenment to understand NIOAs.
Collapse
Affiliation(s)
- Zhenwu Wang
- Department of Computer Science and Technology, China University of Mining and Technology, Beijing 100083, China;
| | - Chao Qin
- Department of Computer Science and Technology, China University of Mining and Technology, Beijing 100083, China;
| | - Benting Wan
- School of Software and IoT Engineering, Jiangxi University of Finance & Economics, Nanchang 330013, China;
| | - William Wei Song
- School of Software and IoT Engineering, Jiangxi University of Finance & Economics, Nanchang 330013, China;
- Department of Information Systems, Dalarna University, S-791 88 Falun, Sweden
| |
Collapse
|
4
|
Zhang Q, Liu P, Wang X, Zhang Y, Han Y, Yu B. StackPDB: Predicting DNA-binding proteins based on XGB-RFE feature optimization and stacked ensemble classifier. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2020.106921] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
|
5
|
Nayak J, Naik B, Dinesh P, Vakula K, Dash PB. Firefly Algorithm in Biomedical and Health Care: Advances, Issues and Challenges. SN COMPUTER SCIENCE 2020; 1:311. [PMID: 33063057 PMCID: PMC7519705 DOI: 10.1007/s42979-020-00320-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 09/05/2020] [Indexed: 11/30/2022]
Abstract
Since the past decades, most of the nature inspired optimization algorithms (NIOA) have been developed and become admired due to their effectiveness for resolving a variety of complex problems of dissimilar domain. Firefly algorithm (FA) is well-known, yet efficient nature inspired swarm intelligence (SI) based metaheuristic algorithm. Since from its initiation, FA has become well-liked between the researchers due to its competence and turn out to be an interesting technique for the practitioners as well as researchers for solving the problems of numerous fields of research such as classifications, clustering, neural networks, biomedical engineering, healthcare as well as other research domain. Moreover, there is an outstanding track record of FA in solving biomedical engineering (BME) and healthcare (HC) problems. Abundant complexities have been worked out with the assist of FA and its variants. By taking these particulars into concern, in this paper, a first ever in-depth analysis has been addressed on the variants, importance, applications as well as enhancements of FA in BME as well as HC. The major intention behind this investigative work is to motivate the researchers to improve and innovate new solutions for multifaceted problems of healthcare and biomedical engineering using FA.
Collapse
Affiliation(s)
- Janmenjoy Nayak
- Department of Computer Science and Engineering, Aditya Institute of Technology and Management (AITAM), K Kotturu, Tekkali, 532201 Andhra Pradesh India
| | - Bighnaraj Naik
- Department of Computer Application, Veer Surendra Sai University of Technology, Burla, 768018 Odisha India
| | - Paidi Dinesh
- Department of Computer Science and Engineering, Sri Sivani College of Engineering, Srikakulam, 532402 Andhra Pradesh India
| | - Kanithi Vakula
- Department of Computer Science and Engineering, Sri Sivani College of Engineering, Srikakulam, 532402 Andhra Pradesh India
| | - Pandit Byomakesha Dash
- Department of Computer Application, Veer Surendra Sai University of Technology, Burla, 768018 Odisha India
| |
Collapse
|
6
|
He H, Tan Y, Ying J, Zhang W. Strengthen EEG-based emotion recognition using firefly integrated optimization algorithm. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2020.106426] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
7
|
Du X, Diao Y, Liu H, Li S. MsDBP: Exploring DNA-Binding Proteins by Integrating Multiscale Sequence Information via Chou’s Five-Step Rule. J Proteome Res 2019; 18:3119-3132. [DOI: 10.1021/acs.jproteome.9b00226] [Citation(s) in RCA: 58] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Affiliation(s)
- Xiuquan Du
- The School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| | - Yanyu Diao
- The School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| | - Heng Liu
- Department of Gastroenterology, The First Affiliated Hospital of Anhui Medical University, Hefei, Anhui, China
| | - Shuo Li
- Department of Medical Imaging, Western University, London, ON N6A 3K7, Canada
| |
Collapse
|
8
|
Lu C, Liu Z, Zhang E, He F, Ma Z, Wang H. MPLs-Pred: Predicting Membrane Protein-Ligand Binding Sites Using Hybrid Sequence-Based Features and Ligand-Specific Models. Int J Mol Sci 2019; 20:ijms20133120. [PMID: 31247932 PMCID: PMC6651575 DOI: 10.3390/ijms20133120] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2019] [Revised: 06/23/2019] [Accepted: 06/23/2019] [Indexed: 02/07/2023] Open
Abstract
Membrane proteins (MPs) are involved in many essential biomolecule mechanisms as a pivotal factor in enabling the small molecule and signal transport between the two sides of the biological membrane; this is the reason that a large portion of modern medicinal drugs target MPs. Therefore, accurately identifying the membrane protein-ligand binding sites (MPLs) will significantly improve drug discovery. In this paper, we propose a sequence-based MPLs predictor called MPLs-Pred, where evolutionary profiles, topology structure, physicochemical properties, and primary sequence segment descriptors are combined as features applied to a random forest classifier, and an under-sampling scheme is used to enhance the classification capability with imbalanced samples. Additional ligand-specific models were taken into consideration in refining the prediction. The corresponding experimental results based on our method achieved an appreciable performance, with 0.63 MCC (Matthews correlation coefficient) as the overall prediction precision, and those values were 0.604, 0.7, and 0.692, respectively, for the three main types of ligands: drugs, metal ions, and biomacromolecules. MPLs-Pred is freely accessible at http://icdtools.nenu.edu.cn/.
Collapse
Affiliation(s)
- Chang Lu
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China
| | - Zhe Liu
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China
| | - Enju Zhang
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China
| | - Fei He
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China.
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China.
| | - Zhiqiang Ma
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China.
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China.
| | - Han Wang
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China.
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China.
| |
Collapse
|
9
|
Brain storm optimization for feature selection using new individual clustering and updating mechanism. APPL INTELL 2019. [DOI: 10.1007/s10489-019-01513-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
10
|
Ali F, Ahmed S, Swati ZNK, Akbar S. DP-BINDER: machine learning model for prediction of DNA-binding proteins by fusing evolutionary and physicochemical information. J Comput Aided Mol Des 2019; 33:645-658. [PMID: 31123959 DOI: 10.1007/s10822-019-00207-x] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2019] [Accepted: 05/18/2019] [Indexed: 12/28/2022]
Abstract
DNA-binding proteins (DBPs) participate in various biological processes including DNA replication, recombination, and repair. In the human genome, about 6-7% of these proteins are utilized for genes encoding. DBPs shape the DNA into a compact structure known chromatin while some of these proteins regulate the chromosome packaging and transcription process. In the pharmaceutical industry, DBPs are used as a key component of antibiotics, steroids, and cancer drugs. These proteins also involve in biophysical, biological, and biochemical studies of DNA. Due to the crucial role in various biological activities, identification of DBPs is a hot issue in protein science. A series of experimental and computational methods have been proposed, however, some methods didn't achieve the desired results while some are inadequate in its accuracy and authenticity. Still, it is highly desired to present more intelligent computational predictors. In this work, we introduce an innovative computational method namely DP-BINDER based on physicochemical and evolutionary information. We captured local highly decisive features from physicochemical properties of primary protein sequences via normalized Moreau-Broto autocorrelation (NMBAC) and evolutionary information by position specific scoring matrix-transition probability composition (PSSM-TPC) and pseudo-position specific scoring matrix (PsePSSM) using training and independent datasets. The optimal features were selected by the support vector machine-recursive feature elimination and correlation bias reduction (SVM-RFE + CBR) from fused features and were fed into random forest (RF) and support vector machine (SVM). Our method attained 92.46% and 89.58% accuracy with jackknife and ten-fold cross-validation, respectively on the training dataset, while 81.17% accuracy on the independent dataset for prediction of DBPs. These results demonstrate that our method attained the highest success rate in the literature. The superiority of DP-BINDER over existing approaches due to several reasons including abstraction of local dominant features via effective feature descriptors, utilization of appropriate feature selection algorithms and effective classifier.
Collapse
Affiliation(s)
- Farman Ali
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | - Saeed Ahmed
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
| | - Zar Nawab Khan Swati
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
- Department of Computer Science, Karakoram International University, Gilgit, Gilgit-Baltistan, 15100, Pakistan
| | - Shahid Akbar
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan
| |
Collapse
|
11
|
Prediction of DNA-binding proteins by interaction fusion feature representation and selective ensemble. Knowl Based Syst 2019. [DOI: 10.1016/j.knosys.2018.09.023] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
12
|
Poursheikhali Asghari M, Abdolmaleki P. Prediction of RNA- and DNA-Binding Proteins Using Various Machine Learning Classifiers. Avicenna J Med Biotechnol 2019; 11:104-111. [PMID: 30800250 PMCID: PMC6359699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Nucleic acid-binding proteins play major roles in different biological processes, such as transcription, splicing and translation. Therefore, the nucleic acid-binding function prediction of proteins is a step toward full functional annotation of proteins. The aim of our research was the improvement of nucleic-acid binding function prediction. METHODS In the current study, nine machine-learning algorithms were used to predict RNA- and DNA-binding proteins and also to discriminate between RNA-binding proteins and DNA-binding proteins. The electrostatic features were utilized for prediction of each function in corresponding adapted protein datasets. The leave-one-out cross-validation process was used to measure the performance of employed classifiers. RESULTS Radial basis function classifier gave the best results in predicting RNA- and DNA-binding proteins in comparison with other classifiers applied. In discriminating between RNA- and DNA-binding proteins, multilayer perceptron classifier was the best one. CONCLUSION Our findings show that the prediction of nucleic acid-binding function based on these simple electrostatic features can be improved by applied classifiers. Moreover, a reasonable progress to distinguish between RNA- and DNA-binding proteins has been achieved.
Collapse
Affiliation(s)
| | - Parviz Abdolmaleki
- Corresponding author: Parviz Abdolmaleki, Ph.D., Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran, Tel: +98 21 82883404, Fax: +98 21 82884457, E-mail:,
| |
Collapse
|
13
|
Zhang J, Chai H, Guo S, Guo H, Li Y. High-Throughput Identification of Mammalian Secreted Proteins Using Species-Specific Scheme and Application to Human Proteome. Molecules 2018; 23:molecules23061448. [PMID: 29903999 PMCID: PMC6099666 DOI: 10.3390/molecules23061448] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2018] [Revised: 05/29/2018] [Accepted: 05/30/2018] [Indexed: 02/02/2023] Open
Abstract
Secreted proteins are widely spread in living organisms and cells. Since secreted proteins are easy to be detected in body fluids, urine, and saliva in clinical diagnosis, they play important roles in biomarkers for disease diagnosis and vaccine production. In this study, we propose a novel predictor for accurate high-throughput identification of mammalian secreted proteins that is based on sequence-derived features. We combine the features of amino acid composition, sequence motifs, and physicochemical properties to encode collected proteins. Detailed feature analyses prove the effectiveness of the considered features. Based on the differences across various species of secreted proteins, we introduce the species-specific scheme, which is expected to further explore the intrinsic attributes of specific secreted proteins. Experiments on benchmark datasets prove the effectiveness of our proposed method. The test on independent testing dataset also promises a good generalization capability. When compared with the traditional universal model, we experimentally demonstrate that the species-specific scheme is capable of significantly improving the prediction performance. We use our method to make predictions on unreviewed human proteome, and find 272 potential secreted proteins with probabilities that are higher than 99%. A user-friendly web server, named iMSPs (identification of Mammalian Secreted Proteins), which implements our proposed method, is designed and is available for free for academic use at: http://www.inforstation.com/webservers/iMSP/.
Collapse
Affiliation(s)
- Jian Zhang
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, China.
| | - Haiting Chai
- College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow G12 8QQ, UK.
| | - Song Guo
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, China.
| | - Huaping Guo
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, China.
| | - Yanling Li
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, China.
| |
Collapse
|
14
|
Zhao XS, Bao LL, Ning Q, Ji JC, Zhao XW. An Improved Binary Differential Evolution Algorithm for Feature Selection in Molecular Signatures. Mol Inform 2017; 37:e1700081. [PMID: 29106044 DOI: 10.1002/minf.201700081] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2017] [Accepted: 10/18/2017] [Indexed: 11/08/2022]
Abstract
The discovery of biomarkers from high-dimensional data is a very challenging task in cancer diagnoses. On the one hand, biomarker discovery is the so-called high-dimensional small-sample problem. On the other hand, these data are redundant and noisy. In recent years, biomarker discovery from high-throughput biological data has become an increasingly important emerging topic in the field of bioinformatics. In this study, we propose a binary differential evolution algorithm for feature selection. Firstly, we suggest using a two-stage approach, where three filter methods including the Fisher score, T-statistics, and Information gain are used to generate the feature pool for input to differential evolution (DE). Secondly, in order to improve the performance of differential evolution algorithm for feature selection, a new variant of binary DE called BDE is proposed. Three optimization strategies are incorporated into the BDE. The first strategy is the heuristic method in initial stage, the second one is the self-adaptive parameter control, and the third one is the minimum change value to improve the exploration behaviour thus enhance the diversity. Finally, Support vector machine (SVM) is used as the classifier in 10 fold cross-validation method. The experimental results of our proposed algorithm on some benchmark datasets demonstrate the effectiveness of our algorithm. In addition, the BDE forged in this study will be of great potential in feature selection problems.
Collapse
Affiliation(s)
- X S Zhao
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130000, P.R.China
| | - L L Bao
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130000, P.R.China
| | - Q Ning
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130000, P.R.China
| | - J C Ji
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130000, P.R.China
| | - X W Zhao
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130000, P.R.China.,Key Laboratory of symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China
| |
Collapse
|
15
|
Zhang J, Liu B. PSFM-DBT: Identifying DNA-Binding Proteins by Combing Position Specific Frequency Matrix and Distance-Bigram Transformation. Int J Mol Sci 2017; 18:ijms18091856. [PMID: 28841194 PMCID: PMC5618505 DOI: 10.3390/ijms18091856] [Citation(s) in RCA: 55] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2017] [Revised: 08/19/2017] [Accepted: 08/22/2017] [Indexed: 12/30/2022] Open
Abstract
DNA-binding proteins play crucial roles in various biological processes, such as DNA replication and repair, transcriptional regulation and many other biological activities associated with DNA. Experimental recognition techniques for DNA-binding proteins identification are both time consuming and expensive. Effective methods for identifying these proteins only based on protein sequences are highly required. The key for sequence-based methods is to effectively represent protein sequences. It has been reported by various previous studies that evolutionary information is crucial for DNA-binding protein identification. In this study, we employed four methods to extract the evolutionary information from Position Specific Frequency Matrix (PSFM), including Residue Probing Transformation (RPT), Evolutionary Difference Transformation (EDT), Distance-Bigram Transformation (DBT), and Trigram Transformation (TT). The PSFMs were converted into fixed length feature vectors by these four methods, and then respectively combined with Support Vector Machines (SVMs); four predictors for identifying these proteins were constructed, including PSFM-RPT, PSFM-EDT, PSFM-DBT, and PSFM-TT. Experimental results on a widely used benchmark dataset PDB1075 and an independent dataset PDB186 showed that these four methods achieved state-of-the-art-performance, and PSFM-DBT outperformed other existing methods in this field. For practical applications, a user-friendly webserver of PSFM-DBT was established, which is available at http://bioinformatics.hitsz.edu.cn/PSFM-DBT/.
Collapse
Affiliation(s)
- Jun Zhang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China.
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China.
| |
Collapse
|
16
|
Zhang J, Chai H, Yang G, Ma Z. Prediction of bioluminescent proteins by using sequence-derived features and lineage-specific scheme. BMC Bioinformatics 2017; 18:294. [PMID: 28583090 PMCID: PMC5460367 DOI: 10.1186/s12859-017-1709-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2016] [Accepted: 05/25/2017] [Indexed: 11/10/2022] Open
Abstract
Background Bioluminescent proteins (BLPs) widely exist in many living organisms. As BLPs are featured by the capability of emitting lights, they can be served as biomarkers and easily detected in biomedical research, such as gene expression analysis and signal transduction pathways. Therefore, accurate identification of BLPs is important for disease diagnosis and biomedical engineering. In this paper, we propose a novel accurate sequence-based method named PredBLP (Prediction of BioLuminescent Proteins) to predict BLPs. Results We collect a series of sequence-derived features, which have been proved to be involved in the structure and function of BLPs. These features include amino acid composition, dipeptide composition, sequence motifs and physicochemical properties. We further prove that the combination of four types of features outperforms any other combinations or individual features. To remove potential irrelevant or redundant features, we also introduce Fisher Markov Selector together with Sequential Backward Selection strategy to select the optimal feature subsets. Additionally, we design a lineage-specific scheme, which is proved to be more effective than traditional universal approaches. Conclusion Experiment on benchmark datasets proves the robustness of PredBLP. We demonstrate that lineage-specific models significantly outperform universal ones. We also test the generalization capability of PredBLP based on independent testing datasets as well as newly deposited BLPs in UniProt. PredBLP is proved to be able to exceed many state-of-art methods. A web server named PredBLP, which implements the proposed method, is free available for academic use. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1709-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jian Zhang
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, Jilin Province, 130117, People's Republic of China.,School of Computer and Information Technology, Xinyang Normal University, Xinyang, Henan Province, 464000, People's Republic of China
| | - Haiting Chai
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, Jilin Province, 130117, People's Republic of China
| | - Guifu Yang
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, Jilin Province, 130117, People's Republic of China
| | - Zhiqiang Ma
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, Jilin Province, 130117, People's Republic of China.
| |
Collapse
|
17
|
Chai H, Zhang J, Yang G, Ma Z. An evolution-based DNA-binding residue predictor using a dynamic query-driven learning scheme. MOLECULAR BIOSYSTEMS 2016; 12:3643-3650. [PMID: 27730230 DOI: 10.1039/c6mb00626d] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
DNA-binding proteins play a pivotal role in various biological activities. Identification of DNA-binding residues (DBRs) is of great importance for understanding the mechanism of gene regulations and chromatin remodeling. Most traditional computational methods usually construct their predictors on static non-redundant datasets. They excluded many homologous DNA-binding proteins so as to guarantee the generalization capability of their models. However, those ignored samples may potentially provide useful clues when studying protein-DNA interactions, which have not obtained enough attention. In view of this, we propose a novel method, namely DQPred-DBR, to fill the gap of DBR predictions. First, a large-scale extensible sample pool was compiled. Second, evolution-based features in the form of a relative position specific score matrix and covariant evolutionary conservation descriptors were used to encode the feature space. Third, a dynamic query-driven learning scheme was designed to make more use of proteins with known structure and functions. In comparison with a traditional static model, the introduction of dynamic models could obviously improve the prediction performance. Experimental results from the benchmark and independent datasets proved that our DQPred-DBR had promising generalization capability. It was capable of producing decent predictions and outperforms many state-of-the-art methods. For the convenience of academic use, our proposed method was also implemented as a web server at .
Collapse
Affiliation(s)
- H Chai
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, P. R. China.
| | - J Zhang
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, P. R. China.
| | - G Yang
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, P. R. China. and Office of Informatization Management and Planning, Northeast Normal University, Changchun, 130117, P. R. China
| | - Z Ma
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, P. R. China.
| |
Collapse
|