1
|
Ye DX, Yu JW, Li R, Hao YD, Wang TY, Yang H, Ding H. The Prediction of Recombination Hotspot Based on Automated Machine Learning. J Mol Biol 2024:168653. [PMID: 38871176 DOI: 10.1016/j.jmb.2024.168653] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2024] [Revised: 05/12/2024] [Accepted: 06/06/2024] [Indexed: 06/15/2024]
Abstract
Meiotic recombination plays a pivotal role in genetic evolution. Genetic variation induced by recombination is a crucial factor in generating biodiversity and a driving force for evolution. At present, the development of recombination hotspot prediction methods has encountered challenges related to insufficient feature extraction and limited generalization capabilities. This paper focused on the research of recombination hotspot prediction methods. We explored deep learning-based recombination hotspot prediction and scrutinized the shortcomings of prevalent models in addressing the challenge of recombination hotspot prediction. To addressing these deficiencies, an automated machine learning approach was utilized to construct recombination hotspot prediction model. The model combined sequence information with physicochemical properties by employing TF-IDF-Kmer and DNA composition components to acquire more effective feature data. Experimental results validate the effectiveness of the feature extraction method and automated machine learning technology used in this study. The final model was validated on three distinct datasets and yielded accuracy rates of 97.14%, 79.71%, and 98.73%, surpassing the current leading models by 2%, 2.56%, and 4%, respectively. In addition, we incorporated tools such as SHAP and AutoGluon to analyze the interpretability of black-box models, delved into the impact of individual features on the results, and investigated the reasons behind misclassification of samples. Finally, an application of recombination hotspot prediction was established to facilitate easy access to necessary information and tools for researchers. The research outcomes of this paper underscore the enormous potential of automated machine learning methods in gene sequence prediction.
Collapse
Affiliation(s)
- Dong-Xin Ye
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Jun-Wen Yu
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Rui Li
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Yu-Duo Hao
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Tian-Yu Wang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Yang
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China.
| | - Hui Ding
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
2
|
Yu J, Jiang W, Zhu SB, Liao Z, Dou X, Liu J, Guo FB, Dong C. Prediction of protein-coding small ORFs in multi-species using integrated sequence-derived features and the random forest model. Methods 2023; 210:10-19. [PMID: 36621557 DOI: 10.1016/j.ymeth.2022.12.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 12/27/2022] [Accepted: 12/30/2022] [Indexed: 01/07/2023] Open
Abstract
Proteins encoded by small open reading frames (sORFs) can serve as functional elements playing important roles in vivo. Such sORFs also constitute the potential pool for facilitating the de novo gene birth, driving evolutionary innovation and species diversity. Therefore, their theoretical and experimental identification has become a critical issue. Herein, we proposed a protein-coding sORFs prediction method merely based on integrative sequence-derived features. Our prediction performance is better or comparable compared with other nine prevalent methods, which shows that our method can provide a relatively reliable research tool for the prediction of protein-coding sORFs. Our method allows users to estimate the potential expression of a queried sORF, which has been demonstrated by the correlation analysis between our possibility estimation and codon adaption index (CAI). Based on the features that we used, we demonstrated that the sequence features of the protein-coding sORFs in the two domains have significant differences implying that it might be a relatively hard task in terms of cross-domain prediction, hence domain-specific models were developed, which allowed users to predict protein-coding sORFs both in eukaryotes and prokaryotes. Finally, a web-server was developed and provided to boost and facilitate the study of the related field, which is freely available at http://guolab.whu.edu.cn/codingCapacity/index.html.
Collapse
Affiliation(s)
- Jiafeng Yu
- Shandong Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China
| | - Wenwen Jiang
- Department of Bioinformatics, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
| | - Sen-Bin Zhu
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Zhen Liao
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Xianghua Dou
- Shandong Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China
| | - Jian Liu
- Shandong Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China
| | - Feng-Biao Guo
- School of Pharmaceutical Sciences, Wuhan University, Wuhan 430071, China.
| | - Chuan Dong
- School of Pharmaceutical Sciences, Wuhan University, Wuhan 430071, China.
| |
Collapse
|
3
|
Shi H, Wu C, Bai T, Chen J, Li Y, Wu H. Identify essential genes based on clustering based synthetic minority oversampling technique. Comput Biol Med 2023; 153:106523. [PMID: 36652869 DOI: 10.1016/j.compbiomed.2022.106523] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 12/13/2022] [Accepted: 12/31/2022] [Indexed: 01/03/2023]
Abstract
Prediction of essential genes in a life organism is one of the central tasks in synthetic biology. Computational predictors are desired because experimental data is often unavailable. Recently, some sequence-based predictors have been constructed to identify essential genes. However, their predictive performance should be further improved. One key problem is how to effectively extract the sequence-based features, which are able to discriminate the essential genes. Another problem is the imbalanced training set. The amount of essential genes in human cell lines is lower than that of non-essential genes. Therefore, predictors trained with such imbalanced training set tend to identify an unseen sequence as a non-essential gene. Here, a new over-sampling strategy was proposed called Clustering based Synthetic Minority Oversampling Technique (CSMOTE) to overcome the imbalanced data issue. Combining CSMOTE with the Z curve, the global features, and Support Vector Machines, a new protocol called iEsGene-CSMOTE was proposed to identify essential genes. The rigorous jackknife cross validation results indicated that iEsGene-CSMOTE is better than the other competing methods. The proposed method outperformed λ-interval Z curve by 35.48% and 11.25% in terms of Sn and BACC, respectively.
Collapse
Affiliation(s)
- Hua Shi
- School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, China.
| | - Chenjin Wu
- School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, China.
| | - Tao Bai
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China; School of Mathematics & Computer Science, Yanan University, Shanxi, 716000, China.
| | - Jiahai Chen
- Xiamen Sankuai Online Technology Co., Ltd, Xiamen, China.
| | - Yan Li
- School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, China.
| | - Hao Wu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| |
Collapse
|
4
|
Liu X, Liu Z, Mao X, Li Q. m7GPredictor: An improved machine learning-based model for predicting internal m7G modifications using sequence properties. Anal Biochem 2020; 609:113905. [DOI: 10.1016/j.ab.2020.113905] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2020] [Revised: 07/24/2020] [Accepted: 08/05/2020] [Indexed: 12/21/2022]
|
5
|
Khan F, Khan M, Iqbal N, Khan S, Muhammad Khan D, Khan A, Wei DQ. Prediction of Recombination Spots Using Novel Hybrid Feature Extraction Method via Deep Learning Approach. Front Genet 2020; 11:539227. [PMID: 33093842 PMCID: PMC7527634 DOI: 10.3389/fgene.2020.539227] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Accepted: 08/13/2020] [Indexed: 01/20/2023] Open
Abstract
Meiotic recombination is the driving force of evolutionary development and an important source of genetic variation. The meiotic recombination does not take place randomly in a chromosome but occurs in some regions of the chromosome. A region in chromosomes with higher rate of meiotic recombination events are considered as hotspots and a region where frequencies of the recombination events are lower are called coldspots. Prediction of meiotic recombination spots provides useful information about the basic functionality of inheritance and genome diversity. This study proposes an intelligent computational predictor called iRSpots-DNN for the identification of recombination spots. The proposed predictor is based on a novel feature extraction method and an optimized deep neural network (DNN). The DNN was employed as a classification engine whereas, the novel features extraction method was developed to extract meaningful features for the identification of hotspots and coldspots across the yeast genome. Unlike previous algorithms, the proposed feature extraction avoids bias among different selected features and preserved the sequence discriminant properties along with the sequence-structure information simultaneously. This study also considered other effective classifiers named support vector machine (SVM), K-nearest neighbor (KNN), and random forest (RF) to predict recombination spots. Experimental results on a benchmark dataset with 10-fold cross-validation showed that iRSpots-DNN achieved the highest accuracy, i.e., 95.81%. Additionally, the performance of the proposed iRSpots-DNN is significantly better than the existing predictors on a benchmark dataset. The relevant benchmark dataset and source code are freely available at: https://github.com/Fatima-Khan12/iRspot_DNN/tree/master/iRspot_DNN.
Collapse
Affiliation(s)
- Fatima Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Mukhtaj Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Nadeem Iqbal
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Salman Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Dost Muhammad Khan
- Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Abbas Khan
- Department of Bioinformatics and Biological Statistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Dong-Qing Wei
- Department of Bioinformatics and Biological Statistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China.,State Key Laboratory of Microbial Metabolism, Shanghai-Islamabad-Belgrade Joint Innovation Center on Antibacterial Resistances, Joint Laboratory of International Cooperation in Metabolic and Developmental Sciences, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Ministry of Education, Shanghai, China.,Peng Cheng Laboratory, Shenzhen, China
| |
Collapse
|
6
|
Yang H, Yang W, Dao FY, Lv H, Ding H, Chen W, Lin H. A comparison and assessment of computational method for identifying recombination hotspots in Saccharomyces cerevisiae. Brief Bioinform 2019; 21:1568-1580. [PMID: 31633777 DOI: 10.1093/bib/bbz123] [Citation(s) in RCA: 67] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2019] [Revised: 05/03/2019] [Accepted: 08/19/2019] [Indexed: 12/27/2022] Open
Abstract
Meiotic recombination is one of the most important driving forces of biological evolution, which is initiated by double-strand DNA breaks. Recombination has important roles in genome diversity and evolution. This review firstly provides a comprehensive survey of the 15 computational methods developed for identifying recombination hotspots in Saccharomyces cerevisiae. These computational methods were discussed and compared in terms of underlying algorithms, extracted features, predictive capability and practical utility. Subsequently, a more objective benchmark data set was constructed to develop a new predictor iRSpot-Pse6NC2.0 (http://lin-group.cn/server/iRSpot-Pse6NC2.0). To further demonstrate the generalization ability of these methods, we compared iRSpot-Pse6NC2.0 with existing methods on the chromosome XVI of S. cerevisiae. The results of the independent data set test demonstrated that the new predictor is superior to existing tools in the identification of recombination hotspots. The iRSpot-Pse6NC2.0 will become an important tool for identifying recombination hotspot.
Collapse
Affiliation(s)
- Hui Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wuritu Yang
- Development and Planning Department, Inner Mongolia University, Hohhot 010021, China
| | - Fu-Ying Dao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hao Lv
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Chen
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China
| | - Hao Lin
- Center for Genomics and Computational Biology, School of Life Sciences, North China University of Science and Technology, Tangshan 063000, China
| |
Collapse
|
7
|
Zhang L, Kong L. iRSpot-PDI: Identification of recombination spots by incorporating dinucleotide property diversity information into Chou's pseudo components. Genomics 2019; 111:457-464. [DOI: 10.1016/j.ygeno.2018.03.003] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2017] [Revised: 02/27/2018] [Accepted: 03/03/2018] [Indexed: 12/11/2022]
|
8
|
Zhang KY, Gao YZ, Du MZ, Liu S, Dong C, Guo FB. Vgas: A Viral Genome Annotation System. Front Microbiol 2019; 10:184. [PMID: 30814982 PMCID: PMC6381048 DOI: 10.3389/fmicb.2019.00184] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2018] [Accepted: 01/23/2019] [Indexed: 11/13/2022] Open
Abstract
The in-depth study of viral genomes is of great help in many aspects, especially in the treatment of human diseases caused by viral infections. With the rapid accumulation of viral sequencing data, improved, or alternative gene-finding systems have become necessary to process and mine these data. In this article, we present Vgas, a system combining an ab initio method and a similarity-based method to automatically find viral genes and perform gene function annotation. Vgas was compared with existing programs, such as Prodigal, GeneMarkS, and Glimmer. Through testing 5,705 virus genomes downloaded from RefSeq, Vgas demonstrated its superiority with the highest average precision and recall (both indexes were 1% higher or more than the other programs); particularly for small virus genomes (≤ 10 kb), it showed significantly improved performance (precision was 6% higher, and recall was 2% higher). Moreover, Vgas presents an annotation module to provide functional information for predicted genes based on BLASTp alignment. This characteristic may be specifically useful in some cases. When combining Vgas with GeneMarkS and Prodigal, better prediction results could be obtained than with each of the three individual programs, suggesting that collaborative prediction using several different software programs is an alternative for gene prediction. Vgas is freely available at http://cefg.uestc.cn/vgas/ or http://121.48.162.133/vgas/. We hope that Vgas could be an alternative virus gene finder to annotate new genomes or reannotate existing genome.
Collapse
Affiliation(s)
- Kai-Yue Zhang
- Centre for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yi-Zhou Gao
- Centre for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Meng-Ze Du
- Centre for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Shuo Liu
- Centre for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Chuan Dong
- Centre for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Feng-Biao Guo
- Centre for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
9
|
Zhang W, Yue X, Tang G, Wu W, Huang F, Zhang X. SFPEL-LPI: Sequence-based feature projection ensemble learning for predicting LncRNA-protein interactions. PLoS Comput Biol 2018; 14:e1006616. [PMID: 30533006 PMCID: PMC6331124 DOI: 10.1371/journal.pcbi.1006616] [Citation(s) in RCA: 93] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2018] [Revised: 01/14/2019] [Accepted: 11/02/2018] [Indexed: 01/12/2023] Open
Abstract
LncRNA-protein interactions play important roles in post-transcriptional gene regulation, poly-adenylation, splicing and translation. Identification of lncRNA-protein interactions helps to understand lncRNA-related activities. Existing computational methods utilize multiple lncRNA features or multiple protein features to predict lncRNA-protein interactions, but features are not available for all lncRNAs or proteins; most of existing methods are not capable of predicting interacting proteins (or lncRNAs) for new lncRNAs (or proteins), which don’t have known interactions. In this paper, we propose the sequence-based feature projection ensemble learning method, “SFPEL-LPI”, to predict lncRNA-protein interactions. First, SFPEL-LPI extracts lncRNA sequence-based features and protein sequence-based features. Second, SFPEL-LPI calculates multiple lncRNA-lncRNA similarities and protein-protein similarities by using lncRNA sequences, protein sequences and known lncRNA-protein interactions. Then, SFPEL-LPI combines multiple similarities and multiple features with a feature projection ensemble learning frame. In computational experiments, SFPEL-LPI accurately predicts lncRNA-protein associations and outperforms other state-of-the-art methods. More importantly, SFPEL-LPI can be applied to new lncRNAs (or proteins). The case studies demonstrate that our method can find out novel lncRNA-protein interactions, which are confirmed by literature. Finally, we construct a user-friendly web server, available at http://www.bioinfotech.cn/SFPEL-LPI/. LncRNA-protein interactions play important roles in post-transcriptional gene regulation, poly-adenylation, splicing and translation. Identification of lncRNA-protein interactions helps to understand lncRNA-related activities. In this paper, we propose a novel computational method “SFPEL-LPI” to predict lncRNA-protein interactions. SFPEL-LPI makes use of lncRNA sequences, protein sequences and known lncRNA-protein associations to extract features and calculate similarities for lncRNAs and proteins, and then combines them with a feature projection ensemble learning frame. SFPEL-LPI can predict unobserved interactions between lncRNAs and proteins, and also can make predictions for new lncRNAs (or proteins), which have no interactions with any proteins (or lncRNAs). SFPEL-LPI produces high-accuracy performances on the benchmark dataset when evaluated by five-fold cross validation, and outperforms state-of-the-art methods. The case studies demonstrate that SFPEL-LPI can find out novel associations, which are confirmed by literature. To facilitate the lncRNA-protein interaction prediction, we develop a user-friendly web server, available at http://www.bioinfotech.cn/SFPEL-LPI/.
Collapse
Affiliation(s)
- Wen Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan, China
- School of Computer Science, Wuhan University, Wuhan, China
- * E-mail: , (WZ); (XZ)
| | - Xiang Yue
- Department of Computer Science and Engineering, The Ohio State University, Columbus, United States of America
| | - Guifeng Tang
- School of Computer Science, Wuhan University, Wuhan, China
| | - Wenjian Wu
- Electronic Information School, Wuhan University, Wuhan, China
| | - Feng Huang
- School of Computer Science, Wuhan University, Wuhan, China
| | - Xining Zhang
- School of Computer Science, Wuhan University, Wuhan, China
- * E-mail: , (WZ); (XZ)
| |
Collapse
|
10
|
Tahir M, Hayat M, Khan SA. iNuc-ext-PseTNC: an efficient ensemble model for identification of nucleosome positioning by extending the concept of Chou's PseAAC to pseudo-tri-nucleotide composition. Mol Genet Genomics 2018; 294:199-210. [PMID: 30291426 DOI: 10.1007/s00438-018-1498-2] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2018] [Accepted: 09/28/2018] [Indexed: 10/28/2022]
Abstract
Nucleosome is a central element of eukaryotic chromatin, which composes of histone proteins and DNA molecules. It performs vital roles in many eukaryotic intra-nuclear processes, for instance, chromatin structure and transcriptional regulation formation. Identification of nucleosome positioning via wet lab is difficult; so, the attention is diverted towards the accurate intelligent automated prediction. In this regard, a novel intelligent automated model "iNuc-ext-PseTNC" is developed to identify the nucleosome positioning in genomes accurately. In this predictor, the sequences of DNA are mathematically represented by two different discrete feature extraction techniques, namely pseudo-tri-nucleotide composition (PseTNC) and pseudo-di-nucleotide composition. Several contemporary machine learning algorithms were examined. Further, the predictions of individual classifiers were integrated through an evolutionary genetic algorithm. The success rates of the ensemble model are higher than individual classifiers. After analyzing the prediction results, it is noticed that iNuc-ext-PseTNC model has achieved better performance in combination with PseTNC feature space, which are 94.3%, 93.14%, and 88.60% of accuracies using six-fold cross-validation test for the three benchmark datasets S1, S2, and S3, respectively. The achieved outcomes exposed that the results of iNuc-ext-PseTNC model are prominent compared to the existing methods so far notifiable in the literature. It is ascertained that the proposed model might be more fruitful and a practical tool for rudimentary academia and research.
Collapse
Affiliation(s)
- Muhammad Tahir
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, KP, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, KP, Pakistan.
| | - Sher Afzal Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, KP, Pakistan
| |
Collapse
|
11
|
Al Maruf MA, Shatabda S. iRSpot-SF: Prediction of recombination hotspots by incorporating sequence based features into Chou's Pseudo components. Genomics 2018; 111:966-972. [PMID: 29935224 DOI: 10.1016/j.ygeno.2018.06.003] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2018] [Revised: 06/09/2018] [Accepted: 06/13/2018] [Indexed: 11/28/2022]
Abstract
Recombination hotspots in a genome are unevenly distributed. Hotspots are regions in a genome that show higher rates of meiotic recombinations. Computational methods for recombination hotspot prediction often use sophisticated features that are derived from physico-chemical or structure based properties of nucleotides. In this paper, we propose iRSpot-SF that uses sequence based features which are computationally cheap to generate. Four feature groups are used in our method: k-mer composition, gapped k-mer composition, TF-IDF of k-mers and reverse complement k-mer composition. We have used recursive feature elimination to select 17 top features for hotspot prediction. Our analysis shows the superiority of gapped k-mer composition and reverse complement k-mer composition features over others. We have used SVM with RBF kernel as a classification algorithm. We have tested our algorithm on standard benchmark datasets. Compared to other methods iRSpot-SF is able to produce significantly better results in terms of accuracy, Mathew's Correlation Coefficient and sensitivity which are 84.58%, 0.6941 and 84.57%. We have made our method readily available to use as a python based tool and made the datasets and source codes available at: https://github.com/abdlmaruf/iRSpot-SF. An web application is developed based on iRSpot-SF and freely available to use at: http://irspot.pythonanywhere.com/server.html.
Collapse
Affiliation(s)
- Md Abdullah Al Maruf
- Department of Computer Science and Engineering, United International University, Madani Aveneue, Satarkul, Badda, Dhaka 1212, Bangladesh
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Madani Aveneue, Satarkul, Badda, Dhaka 1212, Bangladesh.
| |
Collapse
|
12
|
Yang H, Qiu WR, Liu G, Guo FB, Chen W, Chou KC, Lin H. iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC. Int J Biol Sci 2018; 14:883-891. [PMID: 29989083 PMCID: PMC6036749 DOI: 10.7150/ijbs.24616] [Citation(s) in RCA: 135] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2017] [Accepted: 02/04/2018] [Indexed: 02/06/2023] Open
Abstract
Meiotic recombination caused by meiotic double-strand DNA breaks. In some regions the frequency of DNA recombination is relatively higher, while in other regions the frequency is lower: the former is usually called "recombination hotspot", while the latter the "recombination coldspot". Information of the hot and cold spots may provide important clues for understanding the mechanism of genome revolution. Therefore, it is important to accurately predict these spots. In this study, we rebuilt the benchmark dataset by unifying its samples with a same length (131 bp). Based on such a foundation and using SVM (Support Vector Machine) classifier, a new predictor called "iRSpot-Pse6NC" was developed by incorporating the key hexamer features into the general PseKNC (Pseudo K-tuple Nucleotide Composition) via the binomial distribution approach. It has been observed via rigorous cross-validations that the proposed predictor is superior to its counterparts in overall accuracy, stability, sensitivity and specificity. For the convenience of most experimental scientists, the web-server for iRSpot-Pse6NC has been established at http://lin-group.cn/server/iRSpot-Pse6NC, by which users can easily obtain their desired result without the need to go through the detailed mathematical equations involved.
Collapse
Affiliation(s)
- Hui Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wang-Ren Qiu
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.,Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, 333403, China
| | - Guoqing Liu
- School of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Feng-Biao Guo
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Chen
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.,Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan 063000, China.,Gordon Life Science Institute, Boston, MA 02478, USA
| | - Kuo-Chen Chou
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.,Gordon Life Science Institute, Boston, MA 02478, USA
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.,Gordon Life Science Institute, Boston, MA 02478, USA
| |
Collapse
|
13
|
Guo FB, Dong C, Hua HL, Liu S, Luo H, Zhang HW, Jin YT, Zhang KY. Accurate prediction of human essential genes using only nucleotide composition and association information. Bioinformatics 2018; 33:1758-1764. [PMID: 28158612 PMCID: PMC7110051 DOI: 10.1093/bioinformatics/btx055] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2016] [Accepted: 01/25/2017] [Indexed: 12/20/2022] Open
Abstract
Motivation Previously constructed classifiers in predicting eukaryotic essential genes integrated a variety of features including experimental ones. If we can obtain satisfactory prediction using only nucleotide (sequence) information, it would be more promising. Three groups recently identified essential genes in human cancer cell lines using wet experiments and it provided wonderful opportunity to accomplish our idea. Here we improved the Z curve method into the λ-interval form to denote nucleotide composition and association information and used it to construct the SVM classifying model. Results Our model accurately predicted human gene essentiality with an AUC higher than 0.88 both for 5-fold cross-validation and jackknife tests. These results demonstrated that the essentiality of human genes could be reliably reflected by only sequence information. We re-predicted the negative dataset by our Pheg server and 118 genes were additionally predicted as essential. Among them, 20 were found to be homologues in mouse essential genes, indicating that some of the 118 genes were indeed essential, however previous experiments overlooked them. As the first available server, Pheg could predict essentiality for anonymous gene sequences of human. It is also hoped the λ-interval Z curve method could be effectively extended to classification issues of other DNA elements. Availability and Implementation http://cefg.uestc.edu.cn/Pheg. Contact fbguo@uestc.edu.cn. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Feng-Biao Guo
- School of Life Science and Technology, Center for Informational Biology and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Chuan Dong
- School of Life Science and Technology, Center for Informational Biology and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Hong-Li Hua
- School of Life Science and Technology, Center for Informational Biology and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Shuo Liu
- School of Life Science and Technology, Center for Informational Biology and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Hao Luo
- Department of Physics, Tianjin University, Tianjin, China
| | - Hong-Wan Zhang
- School of Life Science and Technology, Center for Informational Biology and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Yan-Ting Jin
- School of Life Science and Technology, Center for Informational Biology and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Kai-Yue Zhang
- School of Life Science and Technology, Center for Informational Biology and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
14
|
Zhang L, Kong L. iRSpot-ADPM: Identify recombination spots by incorporating the associated dinucleotide product model into Chou's pseudo components. J Theor Biol 2018; 441:1-8. [PMID: 29305179 DOI: 10.1016/j.jtbi.2017.12.025] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2017] [Revised: 12/18/2017] [Accepted: 12/24/2017] [Indexed: 10/18/2022]
Abstract
Gene recombination is a key process to produce hereditary differences. Recombination spot identification plays an important role in revealing genome evolution and promoting DNA function study. However, traditional experiments are not good at identifying recombination spot with huge amounts of DNA sequences springed up by sequencing. At present, some machine learning methods have been proposed to speed up this identification process. However, the correlations between nucleotides pairs at different positions along DNA sequence is often ignored, which reflects the important sequence order information. For this purpose, this study proposes a novel feature extraction method, called iRSpot-ADPM, based on DNA property in a given DNA sequence. 85 features are selected from the original feature set according to the weights calculated by support vector machine. Five-fold cross validation tests on two widely used benchmark datasets indicate that the proposed method outperforms its existing counterparts on the individual specificity(Spec), Matthews correlation coefficient(MCC) value and overall accuracy(OA). The experimental results show that the proposed method is effective for accurate recombination spot identification. Moreover, it is anticipated that the proposed method could be extended to other biology sequence and be helpful in future research. The datasets and Matlab source codes can be download from the URL: http://stxy.neuq.edu.cn/info/1095/1157.htm.
Collapse
Affiliation(s)
- Lichao Zhang
- School of Mathematics and Statistics, Northeastern University at Qinhuangdao, Qinhuangdao 066004, PR China.
| | - Liang Kong
- School of Mathematics and Information Science & Technology, Hebei Normal University of Science & Technology, Qinhuangdao 066004, PR China
| |
Collapse
|
15
|
Du PF, Zhao W, Miao YY, Wei LY, Wang L. UltraPse: A Universal and Extensible Software Platform for Representing Biological Sequences. Int J Mol Sci 2017; 18:ijms18112400. [PMID: 29135934 PMCID: PMC5713368 DOI: 10.3390/ijms18112400] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2017] [Revised: 11/01/2017] [Accepted: 11/03/2017] [Indexed: 01/12/2023] Open
Abstract
With the avalanche of biological sequences in public databases, one of the most challenging problems in computational biology is to predict their biological functions and cellular attributes. Most of the existing prediction algorithms can only handle fixed-length numerical vectors. Therefore, it is important to be able to represent biological sequences with various lengths using fixed-length numerical vectors. Although several algorithms, as well as software implementations, have been developed to address this problem, these existing programs can only provide a fixed number of representation modes. Every time a new sequence representation mode is developed, a new program will be needed. In this paper, we propose the UltraPse as a universal software platform for this problem. The function of the UltraPse is not only to generate various existing sequence representation modes, but also to simplify all future programming works in developing novel representation modes. The extensibility of UltraPse is particularly enhanced. It allows the users to define their own representation mode, their own physicochemical properties, or even their own types of biological sequences. Moreover, UltraPse is also the fastest software of its kind. The source code package, as well as the executables for both Linux and Windows platforms, can be downloaded from the GitHub repository.
Collapse
Affiliation(s)
- Pu-Feng Du
- School of Computer Science and Technology, Tianjin University, Tianjin 300350, China.
| | - Wei Zhao
- School of Computer Science and Technology, Tianjin University, Tianjin 300350, China.
| | - Yang-Yang Miao
- School of Computer Science and Technology, Tianjin University, Tianjin 300350, China.
- School of Chemical Engineering, Tianjin University, Tianjin 300350, China.
| | - Le-Yi Wei
- School of Computer Science and Technology, Tianjin University, Tianjin 300350, China.
| | - Likun Wang
- Institute of Systems Biomedicine, Beijing Key Laboratory of Tumor Systems Biology, Department of Pathology, School of Basic Medical Sciences, Peking University Health Science Center, Beijing 100191, China.
| |
Collapse
|
16
|
Tahir M, Hayat M, Kabir M. Sequence based predictor for discrimination of enhancer and their types by applying general form of Chou's trinucleotide composition. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2017; 146:69-75. [PMID: 28688491 DOI: 10.1016/j.cmpb.2017.05.008] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/04/2016] [Revised: 05/05/2017] [Accepted: 05/19/2017] [Indexed: 06/07/2023]
Abstract
BACKGROUND AND OBJECTIVES Enhancers are pivotal DNA elements, which are widely used in eukaryotes for activation of transcription genes. On the basis of enhancer strength, they are further classified into two groups; strong enhancers and weak enhancers. Due to high availability of huge amount of DNA sequences, it is needed to develop fast, reliable and robust intelligent computational method, which not only identify enhancers but also determines their strength. Considerable progress has been achieved in this regard; however, timely and precisely identification of enhancers is still a challenging task. METHODS Two-level intelligent computational model for identification of enhancers and their subgroups is proposed. Two different feature extraction techniques including di-nucleotide composition and tri-nucleotide composition were adopted for extraction of numerical descriptors. Four classification methods including probabilistic neural network, support vector machine, k-nearest neighbor and random forest were utilized for classification. RESULTS The proposed method yielded 77.25% of accuracy for dataset S1 contains enhancers and non-enhancers, whereas 64.70% of accuracy for dataset S2 comprises of strong enhancer and weak enhancer sequences using jackknife cross-validation test. CONCLUSION The predictive results validated that the proposed method is better than that of existing approaches so far reported in the literature. It is thus highly observed that the developed method will be useful and expedient for basic research and academia.
Collapse
Affiliation(s)
- Muhammad Tahir
- Department of Computer Science, Abdul Wali Khan University Mardan, KP Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, KP Pakistan.
| | - Muhammad Kabir
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
| |
Collapse
|