1
|
Du Q, Guo Y, Zhang J, Lu F, Peng C, Zhou C. Predicting Promoters in Multiple Prokaryotes with Prompt. Interdiscip Sci 2024; 16:814-828. [PMID: 39110340 DOI: 10.1007/s12539-024-00637-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 05/17/2024] [Accepted: 05/21/2024] [Indexed: 10/27/2024]
Abstract
Promoters are important cis-regulatory elements for the regulation of gene expression, and their accurate predictions are crucial for elucidating the biological functions and potential mechanisms of genes. Many previous prokaryotic promoter prediction methods are encouraging in terms of the prediction performance, but most of them focus on the recognition of promoters in only one or a few bacterial species. Moreover, due to ignoring the promoter sequence motifs, the interpretability of predictions with existing methods is limited. In this work, we present a generalized method Prompt (Promoters in multiple prokaryotes) to predict promoters in 16 prokaryotes and improve the interpretability of prediction results. Prompt integrates three methods including RSK (Regression based on Selected k-mer), CL (Contrastive Learning) and MLP (Multilayer Perception), and employs a voting strategy to divide the datasets into high-confidence and low-confidence categories. Results on the promoter prediction tasks in 16 prokaryotes show that the accuracy (Accuracy, Matthews correlation coefficient) of Prompt is greater than 80% in highly credible datasets of 16 prokaryotes, and is greater than 90% in 12 prokaryotes, and Prompt performs the best compared with other existing methods. Moreover, by identifying promoter sequence motifs, Prompt can improve the interpretability of the predictions. Prompt is freely available at https://github.com/duqimeng/PromptPrompt , and will contribute to the research of promoters in prokaryote.
Collapse
Affiliation(s)
- Qimeng Du
- School of Engineering, Air-Space-Ground Integrated Intelligence and Big Data Application Engineering Research Center of Yunnan Provincial Department of Education, Dali University, Dali, 671003, China
| | - Yixue Guo
- College of Biotechnology, Tianjin University of Science & Technology, Tianjin, 300457, China
| | - Junpeng Zhang
- School of Engineering, Air-Space-Ground Integrated Intelligence and Big Data Application Engineering Research Center of Yunnan Provincial Department of Education, Dali University, Dali, 671003, China
| | - Fuping Lu
- College of Biotechnology, Tianjin University of Science & Technology, Tianjin, 300457, China
| | - Chong Peng
- College of Biotechnology, Tianjin University of Science & Technology, Tianjin, 300457, China.
| | - Chichun Zhou
- School of Engineering, Air-Space-Ground Integrated Intelligence and Big Data Application Engineering Research Center of Yunnan Provincial Department of Education, Dali University, Dali, 671003, China.
| |
Collapse
|
2
|
Amjad A, Ahmed S, Kabir M, Arif M, Alam T. A novel deep learning identifier for promoters and their strength using heterogeneous features. Methods 2024; 230:119-128. [PMID: 39168294 DOI: 10.1016/j.ymeth.2024.08.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Revised: 07/24/2024] [Accepted: 08/17/2024] [Indexed: 08/23/2024] Open
Abstract
Promoters, which are short (50-1500 base-pair) in DNA regions, have emerged to play a critical role in the regulation of gene transcription. Numerous dangerous diseases, likewise cancer, cardiovascular, and inflammatory bowel diseases, are caused by genetic variations in promoters. Consequently, the correct identification and characterization of promoters are significant for the discovery of drugs. However, experimental approaches to recognizing promoters and their strengths are challenging in terms of cost, time, and resources. Therefore, computational techniques are highly desirable for the correct characterization of promoters from unannotated genomic data. Here, we designed a powerful bi-layer deep-learning based predictor named "PROCABLES", which discriminates DNA samples as promoters in the first-phase and strong or weak promoters in the second-phase respectively. The proposed method utilizes five distinct features, such as word2vec, k-spaced nucleotide pairs, trinucleotide propensity-based features, trinucleotide composition, and electron-ion interaction pseudopotentials, to extract the hidden patterns from the DNA sequence. Afterwards, a stacked framework is formed by integrating a convolutional neural network (CNN) with bidirectional long-short-term memory (LSTM) using multi-view attributes to train the proposed model. The PROCABLES model achieved an accuracy of 0.971 and 0.920 and the MCC 0.940 and 0.840 for the first and second-layer using the ten-fold cross-validation test, respectively. The predicted results anticipate that the proposed PROCABLES protocol outperformed the advanced computational predictors targeting promoters and their types. In summary, this research will provide useful hints for the recognition of large-scale promoters in particular and other DNA problems in general.
Collapse
Affiliation(s)
- Aqsa Amjad
- School of Systems and Technology, University of Management and Technology, Lahore 54770, Pakistan
| | - Saeed Ahmed
- School of Systems and Technology, University of Management and Technology, Lahore 54770, Pakistan
| | - Muhammad Kabir
- School of Systems and Technology, University of Management and Technology, Lahore 54770, Pakistan.
| | - Muhammad Arif
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar.
| | - Tanvir Alam
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar.
| |
Collapse
|
3
|
Ren R, Yu H, Teng J, Mao S, Bian Z, Tao Y, Yau SST. CAPE: a deep learning framework with Chaos-Attention net for Promoter Evolution. Brief Bioinform 2024; 25:bbae398. [PMID: 39120645 PMCID: PMC11311715 DOI: 10.1093/bib/bbae398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 07/13/2024] [Accepted: 07/27/2024] [Indexed: 08/10/2024] Open
Abstract
Predicting the strength of promoters and guiding their directed evolution is a crucial task in synthetic biology. This approach significantly reduces the experimental costs in conventional promoter engineering. Previous studies employing machine learning or deep learning methods have shown some success in this task, but their outcomes were not satisfactory enough, primarily due to the neglect of evolutionary information. In this paper, we introduce the Chaos-Attention net for Promoter Evolution (CAPE) to address the limitations of existing methods. We comprehensively extract evolutionary information within promoters using merged chaos game representation and process the overall information with modified DenseNet and Transformer structures. Our model achieves state-of-the-art results on two kinds of distinct tasks related to prokaryotic promoter strength prediction. The incorporation of evolutionary information enhances the model's accuracy, with transfer learning further extending its adaptability. Furthermore, experimental results confirm CAPE's efficacy in simulating in silico directed evolution of promoters, marking a significant advancement in predictive modeling for prokaryotic promoter strength. Our paper also presents a user-friendly website for the practical implementation of in silico directed evolution on promoters. The source code implemented in this study and the instructions on accessing the website can be found in our GitHub repository https://github.com/BobYHY/CAPE.
Collapse
Affiliation(s)
- Ruohan Ren
- Zhili College, Tsinghua University, Beijing 100084, China
| | - Hongyu Yu
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Jiahao Teng
- School of Life Sciences, Tsinghua University, Beijing 100084, China
| | - Sihui Mao
- Zhili College, Tsinghua University, Beijing 100084, China
| | - Zixuan Bian
- Weiyang College, Tsinghua University, Beijing 100084, China
| | - Yangtianze Tao
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
- Beijing Institute of Mathematical Sciences and Applications (Bimsa), Beijing 101408, China
| |
Collapse
|
4
|
Lei R, Jia J, Qin L, Wei X. iPro2L-DG: Hybrid network based on improved densenet and global attention mechanism for identifying promoter sequences. Heliyon 2024; 10:e27364. [PMID: 38510021 PMCID: PMC10950492 DOI: 10.1016/j.heliyon.2024.e27364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 02/24/2024] [Accepted: 02/28/2024] [Indexed: 03/22/2024] Open
Abstract
The promoter is a key DNA sequence whose primary function is to control the initiation time and the degree of expression of gene transcription. Accurate identification of promoters is essential for understanding gene expression studies. Traditional sequencing techniques for identifying promoters are costly and time-consuming. Therefore, the development of computational methods to identify promoters has become critical. Since deep learning methods show great potential in identifying promoters, this study proposes a new promoter prediction model, called iPro2L-DG. The iPro2L-DG predictor, based on an improved Densely Connected Convolutional Network (DenseNet) and a Global Attention Mechanism (GAM), is constructed to achieve the prediction of promoters. The promoter sequences are combined feature encoding using C2 encoding and nucleotide chemical property (NCP) encoding. An improved DenseNet extracts advanced feature information from the combined feature encoding. GAM evaluates the importance of advanced feature information in terms of channel and spatial dimensions, and finally uses a Full Connect Neural Network (FNN) to derive prediction probabilities. The experimental results showed that the accuracy of iPro2L-DG in the first layer (promoter identification) was 94.10% with Matthews correlation coefficient value of 0.8833. In the second layer (promoter strength prediction), the accuracy was 89.42% with Matthews correlation coefficient value of 0.7915. The iPro2L-DG predictor significantly outperforms other existing predictors in promoter identification and promoter strength prediction. Therefore, our proposed model iPro2L-DG is the most advanced promoter prediction tool. The source code of the iPro2L-DG model can be found in https://github.com/leirufeng/iPro2L-DG.
Collapse
Affiliation(s)
- Rufeng Lei
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Jianhua Jia
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Lulu Qin
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Xin Wei
- Business School, Jiangxi Institute of Fashion Technology, Nanchang, 330044, China
| |
Collapse
|
5
|
Dwivedi K, Rajpal A, Rajpal S, Kumar V, Agarwal M, Kumar N. XL 1R-Net: Explainable AI-driven improved L 1-regularized deep neural architecture for NSCLC biomarker identification. Comput Biol Chem 2024; 108:107990. [PMID: 38000327 DOI: 10.1016/j.compbiolchem.2023.107990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Revised: 10/29/2023] [Accepted: 11/21/2023] [Indexed: 11/26/2023]
Abstract
BACKGROUND AND OBJECTIVE Non-small cell lung cancer (NSCLC) exhibits intrinsic molecular heterogeneity, primarily driven by the mutation of specific biomarkers. Identification of these biomarkers would assist not only in distinguishing NSCLC into its major subtypes - Adenocarcinoma and Squamous Cell Carcinoma, but also in developing targeted therapy. Medical practitioners use one or more types of omic data to identify these biomarkers, copy number variation (CNV) being one such type. CNV provides a measure of genomic instability, which is considered a hallmark of carcinoma. However, the CNV data has not received much attention for biomarker identification. This paper aims to identify biomarkers for NSCLC using CNV data. METHODS An eXplainable AI (XAI)-driven L1-regularized deep learning architecture, XL1R-Net, is proposed that introduces a novel modification of the standard L1-regularized gradient descent algorithm to arrive at an improved deep neural classifier for NSCLC subtyping. Further, XAI-based feature identification has been used to leverage the trained classifier to uncover a set of twenty NCSLC-relevant biomarkers. RESULTS The identified biomarkers are evaluated based on their classification performance and clinical relevance. Using Multilayer Perceptron (MLP)-based model, a classification accuracy of 84.95% using 10-fold cross-validation is achieved. Moreover, the statistical significance test on the classification performance also revealed the superiority of the MLP model over the competitive machine learning models. Further, the publicly available Drug-Gene Interaction Database reveals twelve of the identified biomarkers as potentially druggable. The K-M Plotter tool was used to verify eighteen of the identified biomarkers with a high probability of predicting NSCLC patients' likelihood of survival. While nine of the identified biomarkers confirm the recent literature, five find mention in the OncoKB Gene List. CONCLUSION A set of seven novel biomarkers that have not been reported in the literature could be investigated for their potential contribution towards NSCLC therapy. Given NSCLC's genetic diversity, using only one omics data type may not adequately capture the tumor's complexity. Multiomics data and its integration with other sources will be examined in the future to better understand NSCLC heterogeneity.
Collapse
Affiliation(s)
- Kountay Dwivedi
- Department of Computer Science, University of Delhi, Delhi, India.
| | - Ankit Rajpal
- Department of Computer Science, University of Delhi, Delhi, India.
| | - Sheetal Rajpal
- Department of Computer Science, Dyal Singh College, Delhi, India.
| | - Virendra Kumar
- Department of Nuclear Magnetic Resonance, All India Institute of Medical Sciences, New Delhi, India.
| | - Manoj Agarwal
- Department of Computer Science, Hans Raj College, University of Delhi, Delhi, India.
| | - Naveen Kumar
- Department of Computer Science, University of Delhi, Delhi, India.
| |
Collapse
|
6
|
Lin Y, Sun M, Zhang J, Li M, Yang K, Wu C, Zulfiqar H, Lai H. Computational identification of promoters in Klebsiella aerogenes by using support vector machine. Front Microbiol 2023; 14:1200678. [PMID: 37250059 PMCID: PMC10215528 DOI: 10.3389/fmicb.2023.1200678] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Accepted: 04/18/2023] [Indexed: 05/31/2023] Open
Abstract
Promoters are the basic functional cis-elements to which RNA polymerase binds to initiate the process of gene transcription. Comprehensive understanding gene expression and regulation depends on the precise identification of promoters, as they are the most important component of gene expression. This study aimed to develop a machine learning-based model to predict promoters in Klebsiella aerogenes (K. aerogenes). In the prediction model, the promoter sequences in K. aerogenes genome were encoded by pseudo k-tuple nucleotide composition (PseKNC) and position-correlation scoring function (PCSF). Numerical features were obtained and then optimized using mRMR by combining with support vector machine (SVM) and 5-fold cross-validation (CV). Subsequently, these optimized features were inputted into SVM-based classifier to discriminate promoter sequences from non-promoter sequences in K. aerogenes. Results of 10-fold CV showed that the model could yield the overall accuracy of 96.0% and the area under the ROC curve (AUC) of 0.990. We hope that this model will provide help for the study of promoter and gene regulation in K. aerogenes.
Collapse
Affiliation(s)
- Yan Lin
- Key Laboratory for Animal Disease-Resistance Nutrition of the Ministry of Agriculture, Animal Nutrition Institute, Sichuan Agricultural University, Chengdu, China
| | - Meili Sun
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Junjie Zhang
- Key Laboratory for Animal Disease-Resistance Nutrition of the Ministry of Agriculture, Animal Nutrition Institute, Sichuan Agricultural University, Chengdu, China
| | - Mingyan Li
- Chifeng Product Quality Inspection and Testing Centre, Chifeng, China
| | - Keli Yang
- Nonlinear Research Institute, Baoji University of Arts and Sciences, Baoji, China
| | - Chengyan Wu
- Baotou Teacher’s College, Inner Mongolia University of Science and Technology, Baotou, China
| | - Hasan Zulfiqar
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, Zhejiang, China
| | - Hongyan Lai
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, China
| |
Collapse
|
7
|
Shujaat M, Kim H, Tayara H, Chong KT. iProm-Sigma54: A CNN Base Prediction Tool for σ54 Promoters. Cells 2023; 12:cells12060829. [PMID: 36980170 PMCID: PMC10047130 DOI: 10.3390/cells12060829] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Revised: 02/23/2023] [Accepted: 02/23/2023] [Indexed: 03/11/2023] Open
Abstract
The sigma (σ) factor of RNA holoenzymes is essential for identifying and binding to promoter regions during gene transcription in prokaryotes. σ54 promoters carried out various ancillary methods and environmentally responsive procedures; therefore, it is crucial to accurately identify σ54 promoter sequences to comprehend the underlying process of gene regulation. Herein, we come up with a convolutional neural network (CNN) based prediction tool named “iProm-Sigma54” for the prediction of σ54 promoters. The CNN consists of two one-dimensional convolutional layers, which are followed by max pooling layers and dropout layers. A one-hot encoding scheme was used to extract the input matrix. To determine the prediction performance of iProm-Sigma54, we employed four assessment metrics and five-fold cross-validation; performance was measured using a benchmark and test dataset. According to the findings of this comparison, iProm-Sigma54 outperformed existing methodologies for identifying σ54 promoters. Additionally, a publicly accessible web server was constructed.
Collapse
Affiliation(s)
- Muhammad Shujaat
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea
| | - Hoonjoo Kim
- School of Pharmacy, Jeonbuk National University, Jeonju 54896, Republic of Korea
- Correspondence: (H.K.); (K.T.C.)
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Republic of Korea
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea
- Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Republic of Korea
- Correspondence: (H.K.); (K.T.C.)
| |
Collapse
|
8
|
Shi H, Wu C, Bai T, Chen J, Li Y, Wu H. Identify essential genes based on clustering based synthetic minority oversampling technique. Comput Biol Med 2023; 153:106523. [PMID: 36652869 DOI: 10.1016/j.compbiomed.2022.106523] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 12/13/2022] [Accepted: 12/31/2022] [Indexed: 01/03/2023]
Abstract
Prediction of essential genes in a life organism is one of the central tasks in synthetic biology. Computational predictors are desired because experimental data is often unavailable. Recently, some sequence-based predictors have been constructed to identify essential genes. However, their predictive performance should be further improved. One key problem is how to effectively extract the sequence-based features, which are able to discriminate the essential genes. Another problem is the imbalanced training set. The amount of essential genes in human cell lines is lower than that of non-essential genes. Therefore, predictors trained with such imbalanced training set tend to identify an unseen sequence as a non-essential gene. Here, a new over-sampling strategy was proposed called Clustering based Synthetic Minority Oversampling Technique (CSMOTE) to overcome the imbalanced data issue. Combining CSMOTE with the Z curve, the global features, and Support Vector Machines, a new protocol called iEsGene-CSMOTE was proposed to identify essential genes. The rigorous jackknife cross validation results indicated that iEsGene-CSMOTE is better than the other competing methods. The proposed method outperformed λ-interval Z curve by 35.48% and 11.25% in terms of Sn and BACC, respectively.
Collapse
Affiliation(s)
- Hua Shi
- School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, China.
| | - Chenjin Wu
- School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, China.
| | - Tao Bai
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China; School of Mathematics & Computer Science, Yanan University, Shanxi, 716000, China.
| | - Jiahai Chen
- Xiamen Sankuai Online Technology Co., Ltd, Xiamen, China.
| | - Yan Li
- School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, China.
| | - Hao Wu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| |
Collapse
|
9
|
Patiyal S, Singh N, Ali MZ, Pundir DS, Raghava GPS. Sigma70Pred: A highly accurate method for predicting sigma70 promoter in Escherichia coli K-12 strains. Front Microbiol 2022; 13:1042127. [PMID: 36452927 PMCID: PMC9701712 DOI: 10.3389/fmicb.2022.1042127] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Accepted: 10/27/2022] [Indexed: 12/01/2023] Open
Abstract
Sigma70 factor plays a crucial role in prokaryotes and regulates the transcription of most of the housekeeping genes. One of the major challenges is to predict the sigma70 promoter or sigma70 factor binding site with high precision. In this study, we trained and evaluate our models on a dataset consists of 741 sigma70 promoters and 1,400 non-promoters. We have generated a wide range of features around 8,000, which includes Dinucleotide Auto-Correlation, Dinucleotide Cross-Correlation, Dinucleotide Auto Cross-Correlation, Moran Auto-Correlation, Normalized Moreau-Broto Auto-Correlation, Parallel Correlation Pseudo Tri-Nucleotide Composition, etc. Our SVM based model achieved maximum accuracy 97.38% with AUROC 0.99 on training dataset, using 200 most relevant features. In order to check the robustness of the model, we have tested our model on the independent dataset made by using RegulonDB10.8, which included 1,134 sigma70 and 638 non-promoters, and able to achieve accuracy of 90.41% with AUROC of 0.95. Our model successfully predicted constitutive promoters with accuracy of 81.46% on an independent dataset. We have developed a method, Sigma70Pred, which is available as webserver and standalone packages at https://webs.iiitd.edu.in/raghava/sigma70pred/. The services are freely accessible.
Collapse
Affiliation(s)
- Sumeet Patiyal
- Department of Computational Biology, Indraprastha Institute of Information Technology Delhi, New Delhi, India
| | - Nitindeep Singh
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology Delhi, New Delhi, India
| | - Mohd Zartab Ali
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology Delhi, New Delhi, India
| | - Dhawal Singh Pundir
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology Delhi, New Delhi, India
| | - Gajendra P. S. Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology Delhi, New Delhi, India
| |
Collapse
|
10
|
Zhang ZM, Zhao JP, Wei PJ, Zheng CH. iPromoter-CLA: Identifying promoters and their strength by deep capsule networks with bidirectional long short-term memory. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2022; 226:107087. [PMID: 36099675 DOI: 10.1016/j.cmpb.2022.107087] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Revised: 05/14/2022] [Accepted: 08/23/2022] [Indexed: 06/15/2023]
Abstract
BACKGROUND AND OBJECTIVE The promoter is a fragment of DNA and a specific sequence with transcriptional regulation function in DNA. Promoters are located upstream at the transcription start site, which is used to initiate downstream gene expression. So far, promoter identification is mainly achieved by biological methods, which often require more effort. It has become a more effective classification and prediction method to identify promoter types through computational methods. METHODS In this study, we proposed a new capsule network and recurrent neural network hybrid model to identify promoters and predict their strength. Firstly, we used one-hot to encode DNA sequence. Secondly, we used three one-dimensional convolutional layers, a one-dimensional convolutional capsule layer and digit capsule layer to learn local features. Thirdly, a bidirectional long short-time memory was utilized to extract global features. Finally, we adopted the self-attention mechanism to improve the contribution of relatively important features, which further enhances the performance of the model. RESULTS Our model attains a cross-validation accuracy of 86% and 73.46% in prokaryotic promoter recognition and their strength prediction, which showcases a better performance compared with the existing approaches in both the first layer promoter identification and the second layer promoter's strength prediction. CONCLUSIONS our model not only combines convolutional neural network and capsule layer but also uses a self-attention mechanism to better capture hidden information features from the perspective of sequence. Thus, we hope that our model can be widely applied to other components.
Collapse
Affiliation(s)
- Zhi-Min Zhang
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China
| | - Jian-Ping Zhao
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China.
| | - Pi-Jing Wei
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, China
| | - Chun-Hou Zheng
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China; School of Artificial Intelligence, Anhui University, Hefei, China
| |
Collapse
|
11
|
Zhang P, Zhang H, Wu H. iPro-WAEL: a comprehensive and robust framework for identifying promoters in multiple species. Nucleic Acids Res 2022; 50:10278-10289. [PMID: 36161334 PMCID: PMC9561371 DOI: 10.1093/nar/gkac824] [Citation(s) in RCA: 34] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Revised: 08/24/2022] [Accepted: 09/14/2022] [Indexed: 11/27/2022] Open
Abstract
Promoters are consensus DNA sequences located near the transcription start sites and they play an important role in transcription initiation. Due to their importance in biological processes, the identification of promoters is significantly important for characterizing the expression of the genes. Numerous computational methods have been proposed to predict promoters. However, it is difficult for these methods to achieve satisfactory performance in multiple species. In this study, we propose a novel weighted average ensemble learning model, termed iPro-WAEL, for identifying promoters in multiple species, including Human, Mouse, E.coli, Arabidopsis, B.amyloliquefaciens, B.subtilis and R.capsulatus. Extensive benchmarking experiments illustrate that iPro-WAEL has optimal performance and is superior to the current methods in promoter prediction. The experimental results also demonstrate a satisfactory prediction ability of iPro-WAEL on cross-cell lines, promoters annotated by other methods and distinguishing between promoters and enhancers. Moreover, we identify the most important transcription factor binding site (TFBS) motif in promoter regions to facilitate the study of identifying important motifs in the promoter regions. The source code of iPro-WAEL is freely available at https://github.com/HaoWuLab-Bioinformatics/iPro-WAEL.
Collapse
Affiliation(s)
- Pengyu Zhang
- School of Software, Shandong University, Jinan, 250101, Shandong, China.,College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China
| | - Hongming Zhang
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China
| | - Hao Wu
- School of Software, Shandong University, Jinan, 250101, Shandong, China
| |
Collapse
|
12
|
Asim MN, Ibrahim MA, Imran Malik M, Dengel A, Ahmed S. Circ-LocNet: A Computational Framework for Circular RNA Sub-Cellular Localization Prediction. Int J Mol Sci 2022; 23:ijms23158221. [PMID: 35897818 PMCID: PMC9329987 DOI: 10.3390/ijms23158221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 07/15/2022] [Accepted: 07/20/2022] [Indexed: 02/04/2023] Open
Abstract
Circular ribonucleic acids (circRNAs) are novel non-coding RNAs that emanate from alternative splicing of precursor mRNA in reversed order across exons. Despite the abundant presence of circRNAs in human genes and their involvement in diverse physiological processes, the functionality of most circRNAs remains a mystery. Like other non-coding RNAs, sub-cellular localization knowledge of circRNAs has the aptitude to demystify the influence of circRNAs on protein synthesis, degradation, destination, their association with different diseases, and potential for drug development. To date, wet experimental approaches are being used to detect sub-cellular locations of circular RNAs. These approaches help to elucidate the role of circRNAs as protein scaffolds, RNA-binding protein (RBP) sponges, micro-RNA (miRNA) sponges, parental gene expression modifiers, alternative splicing regulators, and transcription regulators. To complement wet-lab experiments, considering the progress made by machine learning approaches for the determination of sub-cellular localization of other non-coding RNAs, the paper in hand develops a computational framework, Circ-LocNet, to precisely detect circRNA sub-cellular localization. Circ-LocNet performs comprehensive extrinsic evaluation of 7 residue frequency-based, residue order and frequency-based, and physio-chemical property-based sequence descriptors using the five most widely used machine learning classifiers. Further, it explores the performance impact of K-order sequence descriptor fusion where it ensembles similar as well dissimilar genres of statistical representation learning approaches to reap the combined benefits. Considering the diversity of statistical representation learning schemes, it assesses the performance of second-order, third-order, and going all the way up to seventh-order sequence descriptor fusion. A comprehensive empirical evaluation of Circ-LocNet over a newly developed benchmark dataset using different settings reveals that standalone residue frequency-based sequence descriptors and tree-based classifiers are more suitable to predict sub-cellular localization of circular RNAs. Further, K-order heterogeneous sequence descriptors fusion in combination with tree-based classifiers most accurately predict sub-cellular localization of circular RNAs. We anticipate this study will act as a rich baseline and push the development of robust computational methodologies for the accurate sub-cellular localization determination of novel circRNAs.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
- Correspondence:
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Muhammad Imran Malik
- School of Computer Science & Electrical Engineering, National University of Sciences and Technology, Islamabad 44000, Pakistan;
| | - Andreas Dengel
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- DeepReader GmbH, Trippstadter Str. 122, 67663 Kaiserslautern, Germany
| |
Collapse
|
13
|
Prokaryotic and eukaryotic promoters identification based on residual network transfer learning. Bioprocess Biosyst Eng 2022; 45:955-967. [DOI: 10.1007/s00449-022-02716-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2022] [Accepted: 02/27/2022] [Indexed: 11/26/2022]
|
14
|
Qiao H, Zhang S, Xue T, Wang J, Wang B. iPro-GAN: A novel model based on generative adversarial learning for identifying promoters and their strength. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2022; 215:106625. [PMID: 35038653 DOI: 10.1016/j.cmpb.2022.106625] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 12/13/2021] [Accepted: 01/06/2022] [Indexed: 06/14/2023]
Abstract
BACKGROUND AND OBJECTIVE Promoter is a component of the gene, which can specifically bind with RNA polymerase and determine where transcription starts, and also determine the transcription efficiency of the gene. Promoters can be divided into strong promoters and weak promoters because their structures and the interaction time interval are quite different. The functional variation of the promoter can lead to a variety of diseases. Therefore, identifying promoters and their strength is necessary and has important biological significance. A novel and promising model based on deep learning is proposed to achieve it. METHODS In this work, we build a power model named iPro-GAN for identification of promoters and their strength. First, we collect benchmark datasets and independent datasets for training and testing. Then, Moran-based spatial auto-cross correlation method is used as feature extraction method. Finally, deep convolution generative adversarial network with 10-fold cross validation is applied for classifying. The first layer of the model is used to identify the promoter and the second layer is used to determine its type. RESULTS On the benchmark data set, the accuracy of the first layer predictor is 93.15%, and the accuracy of the second layer predictor is 92.30%. On the independent data set, the accuracy of the first layer predictor is 86.77%, and the accuracy of the second layer predictor is 91.66%. In particular, breakthrough progress has been made in the identification of promoters' strength. CONCLUSIONS These results are far higher than the existing best predictor, which indicate that our model is serviceable and practicable to identify promoters and their strength. Furthermore, the datasets and source codes are available from this link: https://github.com/Bovbene/iPro-GAN.
Collapse
Affiliation(s)
- Huijuan Qiao
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China.
| | - Tian Xue
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Jinyue Wang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Bowei Wang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| |
Collapse
|
15
|
Zhang M, Jia C, Li F, Li C, Zhu Y, Akutsu T, Webb GI, Zou Q, Coin LJM, Song J. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction. Brief Bioinform 2022; 23:6502561. [PMID: 35021193 PMCID: PMC8921625 DOI: 10.1093/bib/bbab551] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 11/12/2021] [Accepted: 11/30/2021] [Indexed: 01/13/2023] Open
Abstract
Promoters are crucial regulatory DNA regions for gene transcriptional activation. Rapid advances in next-generation sequencing technologies have accelerated the accumulation of genome sequences, providing increased training data to inform computational approaches for both prokaryotic and eukaryotic promoter prediction. However, it remains a significant challenge to accurately identify species-specific promoter sequences using computational approaches. To advance computational support for promoter prediction, in this study, we curated 58 comprehensive, up-to-date, benchmark datasets for 7 different species (i.e. Escherichia coli, Bacillus subtilis, Homo sapiens, Mus musculus, Arabidopsis thaliana, Zea mays and Drosophila melanogaster) to assist the research community to assess the relative functionality of alternative approaches and support future research on both prokaryotic and eukaryotic promoters. We revisited 106 predictors published since 2000 for promoter identification (40 for prokaryotic promoter, 61 for eukaryotic promoter, and 5 for both). We systematically evaluated their training datasets, computational methodologies, calculated features, performance and software usability. On the basis of these benchmark datasets, we benchmarked 19 predictors with functioning webservers/local tools and assessed their prediction performance. We found that deep learning and traditional machine learning-based approaches generally outperformed scoring function-based approaches. Taken together, the curated benchmark dataset repository and the benchmarking analysis in this study serve to inform the design and implementation of computational approaches for promoter prediction and facilitate more rigorous comparison of new techniques in the future.
Collapse
Affiliation(s)
| | - Cangzhi Jia
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | | | | | | | | | - Geoffrey I Webb
- Department of Data Science and Artificial Intelligence, Monash University, Melbourne, VIC 3800, Australia,Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Quan Zou
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | - Lachlan J M Coin
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | - Jiangning Song
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| |
Collapse
|
16
|
Chevez-Guardado R, Peña-Castillo L. Promotech: a general tool for bacterial promoter recognition. Genome Biol 2021; 22:318. [PMID: 34789306 PMCID: PMC8597233 DOI: 10.1186/s13059-021-02514-9] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Accepted: 10/11/2021] [Indexed: 12/14/2022] Open
Abstract
Promoters are genomic regions where the transcription machinery binds to initiate the transcription of specific genes. Computational tools for identifying bacterial promoters have been around for decades. However, most of these tools were designed to recognize promoters in one or few bacterial species. Here, we present Promotech, a machine-learning-based method for promoter recognition in a wide range of bacterial species. We compare Promotech's performance with the performance of five other promoter prediction methods. Promotech outperforms these other programs in terms of area under the precision-recall curve (AUPRC) or precision at the same level of recall. Promotech is available at https://github.com/BioinformaticsLabAtMUN/PromoTech .
Collapse
Affiliation(s)
- Ruben Chevez-Guardado
- Department of Computer Science, Memorial University of Newfoundland, 230 Elizabeth Ave, St. John's, Newfoundland, A1C 5S7, Canada
| | - Lourdes Peña-Castillo
- Department of Computer Science, Memorial University of Newfoundland, 230 Elizabeth Ave, St. John's, Newfoundland, A1C 5S7, Canada. .,Department of Biology, Memorial University of Newfoundland, 230 Elizabeth Ave, St. John's, Newfoundland, A1C 5S7, Canada.
| |
Collapse
|
17
|
Liang Y, Zhang S, Qiao H, Yao Y. iPromoter-ET: Identifying promoters and their strength by extremely randomized trees-based feature selection. Anal Biochem 2021; 630:114335. [PMID: 34389299 DOI: 10.1016/j.ab.2021.114335] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2021] [Revised: 07/24/2021] [Accepted: 08/09/2021] [Indexed: 10/20/2022]
Abstract
Promoter is a region of DNA that determines the transcription of a particular gene. There are several σ factors in the RNA polymerase, which has the function of identifying the promoter and facilitating the binding of the RNA polymerase to the promoter. Owing to the importance of promoter in genome research, it is an urgent task to develop computational tool for effectively identifying promoters and their strength facing the avalanche of DNA sequences discovered in the post-genomic age. In this paper, we develop a model named iPromoter-ET using the k-mer nucleotide composition, binary encoding and dinucleotide property matrix-based distance transformation for features extraction, and extremely randomized trees (extra trees) for feature selection. Its 1st layer is used to identify whether a DNA sequence is of promoter or not, while its 2nd layer is to identify promoter samples as being strong or weak promoter. Support vector machine and the five cross-validation are used to perform identification and assess performance, respectively. The results indicate that our model remarkably outperforms the existing models in both the 1st and 2nd layers for accuracy and stability. We anticipate that our proposed model will become a very effective intelligent tool, or at the least, a complementary tool to the existing modes of identifying promoters and their strength. Moreover, the datasets and codes for iPromoter-ET are freely available at https://github.com/shengli0201/iPromoter-ET.
Collapse
Affiliation(s)
- Yunyun Liang
- School of Science, Xi'an Polytechnic University, Xi'an, 710048, PR China.
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Huijuan Qiao
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Yingying Yao
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| |
Collapse
|
18
|
Shujaat M, Wahab A, Tayara H, Chong KT. pcPromoter-CNN: A CNN-Based Prediction and Classification of Promoters. Genes (Basel) 2020; 11:genes11121529. [PMID: 33371507 PMCID: PMC7767505 DOI: 10.3390/genes11121529] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Revised: 12/11/2020] [Accepted: 12/18/2020] [Indexed: 01/13/2023] Open
Abstract
A promoter is a small region within the DNA structure that has an important role in initiating transcription of a specific gene in the genome. Different types of promoters are recognized by their different functions. Due to the importance of promoter functions, computational tools for the prediction and classification of a promoter are highly desired. Promoters resemble each other; therefore, their precise classification is an important challenge. In this study, we propose a convolutional neural network (CNN)-based tool, the pcPromoter-CNN, for application in the prediction of promotors and their classification into subclasses σ70, σ54, σ38, σ32, σ28 and σ24. This CNN-based tool uses a one-hot encoding scheme for promoter classification. The tools architecture was trained and tested on a benchmark dataset. To evaluate its classification performance, we used four evaluation metrics. The model exhibited notable improvement over that of existing state-of-the-art tools.
Collapse
Affiliation(s)
- Muhammad Shujaat
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea or (M.S.); (A.W.)
- Department of Computer Sciences, Bahria University, Lahore 54000, Pakistan
| | - Abdul Wahab
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea or (M.S.); (A.W.)
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Korea
- Correspondence: (H.T.); (K.T.C.)
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea or (M.S.); (A.W.)
- Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Korea
- Correspondence: (H.T.); (K.T.C.)
| |
Collapse
|
19
|
Abstract
The correct mapping of promoter elements is a crucial step in microbial genomics. Also, when combining new DNA elements into synthetic sequences, predicting the potential generation of new promoter sequences is critical. Over the last years, many bioinformatics tools have been created to allow users to predict promoter elements in a sequence or genome of interest. Here, we assess the predictive power of some of the main prediction tools available using well-defined promoter data sets. Using Escherichia coli as a model organism, we demonstrated that while some tools are biased toward AT-rich sequences, others are very efficient in identifying real promoters with low false-negative rates. We hope the potentials and limitations presented here will help the microbiology community to choose promoter prediction tools among many available alternatives. The promoter region is a key element required for the production of RNA in bacteria. While new high-throughput technology allows massively parallel mapping of promoter elements, we still mainly rely on bioinformatics tools to predict such elements in bacterial genomes. Additionally, despite many different prediction tools having become popular to identify bacterial promoters, no systematic comparison of such tools has been performed. Here, we performed a systematic comparison between several widely used promoter prediction tools (BPROM, bTSSfinder, BacPP, CNNProm, IBBP, Virtual Footprint, iPro70-FMWin, 70ProPred, iPromoter-2L, and MULTiPly) using well-defined sequence data sets and standardized metrics to determine how well those tools performed related to each other. For this, we used data sets of experimentally validated promoters from Escherichia coli and a control data set composed of randomly generated sequences with similar nucleotide distributions. We compared the performance of the tools using metrics such as specificity, sensitivity, accuracy, and Matthews correlation coefficient (MCC). We show that the widely used BPROM presented the worse performance among the compared tools, while four tools (CNNProm, iPro70-FMWin, 70ProPred, and iPromoter-2L) offered high predictive power. Of these tools, iPro70-FMWin exhibited the best results for most of the metrics used. We present here some potentials and limitations of available tools, and we hope that future work can build upon our effort to systematically characterize this useful class of bioinformatics tools. IMPORTANCE The correct mapping of promoter elements is a crucial step in microbial genomics. Also, when combining new DNA elements into synthetic sequences, predicting the potential generation of new promoter sequences is critical. Over the last years, many bioinformatics tools have been created to allow users to predict promoter elements in a sequence or genome of interest. Here, we assess the predictive power of some of the main prediction tools available using well-defined promoter data sets. Using Escherichia coli as a model organism, we demonstrated that while some tools are biased toward AT-rich sequences, others are very efficient in identifying real promoters with low false-negative rates. We hope the potentials and limitations presented here will help the microbiology community to choose promoter prediction tools among many available alternatives.
Collapse
|
20
|
Amin R, Rahman CR, Ahmed S, Sifat MHR, Liton MNK, Rahman MM, Khan MZH, Shatabda S. iPromoter-BnCNN: a novel branched CNN-based predictor for identifying and classifying sigma promoters. Bioinformatics 2020; 36:4869-4875. [DOI: 10.1093/bioinformatics/btaa609] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2019] [Revised: 05/19/2020] [Accepted: 06/24/2020] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Promoter is a short region of DNA which is responsible for initiating transcription of specific genes. Development of computational tools for automatic identification of promoters is in high demand. According to the difference of functions, promoters can be of different types. Promoters may have both intra- and interclass variation and similarity in terms of consensus sequences. Accurate classification of various types of sigma promoters still remains a challenge.
Results
We present iPromoter-BnCNN for identification and accurate classification of six types of promoters—σ24,σ28,σ32,σ38,σ54,σ70. It is a CNN-based classifier which combines local features related to monomer nucleotide sequence, trimer nucleotide sequence, dimer structural properties and trimer structural properties through the use of parallel branching. We conducted experiments on a benchmark dataset and compared with six state-of-the-art tools to show our supremacy on 5-fold cross-validation. Moreover, we tested our classifier on an independent test dataset.
Availability and implementation
Our proposed tool iPromoter-BnCNN web server is freely available at http://103.109.52.8/iPromoter-BnCNN. The runnable source code can be found https://colab.research.google.com/drive/1yWWh7BXhsm8U4PODgPqlQRy23QGjF2DZ.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ruhul Amin
- Department of Computer Science and Engineering, United International University, Dhaka 1207, Bangladesh
| | - Chowdhury Rafeed Rahman
- Department of Computer Science and Engineering, United International University, Dhaka 1207, Bangladesh
| | - Sajid Ahmed
- Department of Computer Science and Engineering, United International University, Dhaka 1207, Bangladesh
| | - Md Habibur Rahman Sifat
- Department of Computer Science and Engineering, United International University, Dhaka 1207, Bangladesh
| | - Md Nazmul Khan Liton
- Department of Computer Science and Engineering, United International University, Dhaka 1207, Bangladesh
| | - Md Moshiur Rahman
- Department of Computer Science and Engineering, United International University, Dhaka 1207, Bangladesh
| | - Md Zahid Hossain Khan
- Department of Computer Science and Engineering, United International University, Dhaka 1207, Bangladesh
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Dhaka 1207, Bangladesh
| |
Collapse
|
21
|
Li F, Chen J, Ge Z, Wen Y, Yue Y, Hayashida M, Baggag A, Bensmail H, Song J. Computational prediction and interpretation of both general and specific types of promoters in Escherichia coli by exploiting a stacked ensemble-learning framework. Brief Bioinform 2020; 22:2126-2140. [PMID: 32363397 DOI: 10.1093/bib/bbaa049] [Citation(s) in RCA: 48] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 02/25/2020] [Accepted: 03/11/2020] [Indexed: 12/12/2022] Open
Abstract
Promoters are short consensus sequences of DNA, which are responsible for transcription activation or the repression of all genes. There are many types of promoters in bacteria with important roles in initiating gene transcription. Therefore, solving promoter-identification problems has important implications for improving the understanding of their functions. To this end, computational methods targeting promoter classification have been established; however, their performance remains unsatisfactory. In this study, we present a novel stacked-ensemble approach (termed SELECTOR) for identifying both promoters and their respective classification. SELECTOR combined the composition of k-spaced nucleic acid pairs, parallel correlation pseudo-dinucleotide composition, position-specific trinucleotide propensity based on single-strand, and DNA strand features and using five popular tree-based ensemble learning algorithms to build a stacked model. Both 5-fold cross-validation tests using benchmark datasets and independent tests using the newly collected independent test dataset showed that SELECTOR outperformed state-of-the-art methods in both general and specific types of promoter prediction in Escherichia coli. Furthermore, this novel framework provides essential interpretations that aid understanding of model success by leveraging the powerful Shapley Additive exPlanation algorithm, thereby highlighting the most important features relevant for predicting both general and specific types of promoters and overcoming the limitations of existing 'Black-box' approaches that are unable to reveal causal relationships from large amounts of initially encoded features.
Collapse
Affiliation(s)
- Fuyi Li
- Northwest A&F University, China.,Department of Biochemistry and Molecular Biology and the Infection and Immunity Program, Biomedicine Discovery Institute, Monash University, Australia
| | - Jinxiang Chen
- Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University from the College of Information Engineering, Northwest A&F University, China
| | - Zongyuan Ge
- Monash University and also serves as a Deep Learning Specialist at NVIDIA AI Technology Centre. Before joining Monash, he was a research scientist at IBM Research Australia doing research in medical AI during 2016-2018. His research interests are AI, computer vision, medical image, robotics and deep learning
| | - Ya Wen
- computer technology from Ningxia University, China
| | - Yanwei Yue
- medical science from Southern Medical University, China
| | - Morihiro Hayashida
- informatics from Kyoto University, Japan, in 2005. He is an Assistant Professor in the Department of Electrical Engineering and Computer Science, National Institute of Technology, Matsue College, Japan
| | - Abdelkader Baggag
- computer science from the University of Minnesota. He is a Senior Scientist at the Qatar Computing Research Institute (QCRI) and has a joint appointment as an Associate Professor at Hamad Bin Khalifa University (HBKU) in the Division of Information and Computing Technology. His research interests include data mining, linear algebra and machine learning
| | - Halima Bensmail
- University of Pierre & Marie Currie (Paris 6) in France. She is currently a Principal Scientist at QCRI-HBKU and a joint Associate Professor at the College of Computer and Science Engineering, HBKU
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Monash University, Australia. He is also affiliated with the Monash Centre for Data Science, Faculty of Information Technology, Monash University. His research interests include bioinformatics, computational biology, machine learning, data mining, and pattern recognition
| |
Collapse
|
22
|
Identification of prokaryotic promoters and their strength by integrating heterogeneous features. Genomics 2020; 112:1396-1403. [DOI: 10.1016/j.ygeno.2019.08.009] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Revised: 07/31/2019] [Accepted: 08/14/2019] [Indexed: 12/21/2022]
|
23
|
Wang HT, Xiao FH, Li GH, Kong QP. Identification of DNA N 6-methyladenine sites by integration of sequence features. Epigenetics Chromatin 2020; 13:8. [PMID: 32093759 PMCID: PMC7038560 DOI: 10.1186/s13072-020-00330-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Accepted: 02/03/2020] [Indexed: 02/21/2023] Open
Abstract
Background An increasing number of nucleic acid modifications have been profiled with the development of sequencing technologies. DNA N6-methyladenine (6mA), which is a prevalent epigenetic modification, plays important roles in a series of biological processes. So far, identification of DNA 6mA relies primarily on time-consuming and expensive experimental approaches. However, in silico methods can be implemented to conduct preliminary screening to save experimental resources and time, especially given the rapid accumulation of sequencing data. Results In this study, we constructed a 6mA predictor, p6mA, from a series of sequence-based features, including physicochemical properties, position-specific triple-nucleotide propensity (PSTNP), and electron–ion interaction pseudopotential (EIIP). We performed maximum relevance maximum distance (MRMD) analysis to select key features and used the Extreme Gradient Boosting (XGBoost) algorithm to build our predictor. Results demonstrated that p6mA outperformed other existing predictors using different datasets. Conclusions p6mA can predict the methylation status of DNA adenines, using only sequence files. It may be used as a tool to help the study of 6mA distribution pattern. Users can download it from https://github.com/Konglab404/p6mA.
Collapse
Affiliation(s)
- Hao-Tian Wang
- State Key Laboratory of Genetic Resources and Evolution/Key Laboratory of Healthy Aging Research of Yunnan Province, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, 650223, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China.,Kunming Key Laboratory of Healthy Aging Study, Kunming, 650223, China.,Kunming College of Life Science, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Fu-Hui Xiao
- State Key Laboratory of Genetic Resources and Evolution/Key Laboratory of Healthy Aging Research of Yunnan Province, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, 650223, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China.,Kunming Key Laboratory of Healthy Aging Study, Kunming, 650223, China
| | - Gong-Hua Li
- State Key Laboratory of Genetic Resources and Evolution/Key Laboratory of Healthy Aging Research of Yunnan Province, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, 650223, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China.,Kunming Key Laboratory of Healthy Aging Study, Kunming, 650223, China
| | - Qing-Peng Kong
- State Key Laboratory of Genetic Resources and Evolution/Key Laboratory of Healthy Aging Research of Yunnan Province, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, 650223, China. .,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China. .,Kunming Key Laboratory of Healthy Aging Study, Kunming, 650223, China. .,KIZ/CUHK Joint Laboratory of Bioresources and Molecular Research in Common Diseases, Kunming, 650223, China.
| |
Collapse
|
24
|
Dong YM, Qin LD, Tong YF, He QE, Wang L, Song K. Multiple genome pattern analysis and signature gene identification for the Caucasian lung adenocarcinoma patients with different tobacco exposure patterns. PeerJ 2020; 8:e8349. [PMID: 32030321 PMCID: PMC6995662 DOI: 10.7717/peerj.8349] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2019] [Accepted: 12/04/2019] [Indexed: 11/20/2022] Open
Abstract
Background When considering therapies for lung adenocarcinoma (LUAD) patients, the carcinogenic mechanisms of smokers are believed to differ from those who have never smoked. The rising trend in the proportion of nonsmokers in LUAD urgently requires the understanding of such differences at a molecular level for the development of precision medicine. Methods Three independent LUAD tumor sample sets—TCGA, SPORE and EDRN—were used. Genome patterns of expression (GE), copy number variation (CNV) and methylation (ME) were reviewed to discover the differences between them for both smokers and nonsmokers. Tobacco-related signature genes distinguishing these two groups of LUAD were identified using the GE, ME and CNV values of the whole genome. To do this, a novel iterative multi-step selection method based on the partial least squares (PLS) algorithm was proposed to overcome the high variable dimension and high noise inherent in the data. This method can thoroughly evaluate the importance of genes according to their statistical differences, biological functions and contributions to the tobacco exposure classification model. The kernel partial least squares (KPLS) method was used to further optimize the accuracies of the classification models. Results Forty-three, forty-eight and seventy-five genes were identified as GE, ME and CNV signatures, respectively, to distinguish smokers from nonsmokers. Using only the gene expression values of these 43 GE signature genes, ME values of the 48 ME signature genes or copy numbers of the 75 CNV signature genes, the accuracies of TCGA training and SPORE/EDRN independent validation datasets all exceed 76%. More importantly, the focal amplicon in Telomerase Reverse Transcriptase in nonsmokers, the broad deletion in ChrY in male nonsmokers and the greater amplification of MDM2 in female nonsmokers may explain why nonsmokers of both genders tend to suffer LUAD. These pattern analysis results may have clear biological interpretation in the molecular mechanism of tumorigenesis. Meanwhile, the identified signature genes may serve as potential drug targets for the precision medicine of LUAD.
Collapse
Affiliation(s)
- Yan-mei Dong
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China
| | - Li-da Qin
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China
| | - Yi-fan Tong
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China
| | - Qi-en He
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China
| | - Ling Wang
- The First Affiliated Hospital Oncology, Dalian Medical University, Dalian, Liaoning, China
| | - Kai Song
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China
| |
Collapse
|
25
|
Chen YL, Guo DH, Li QZ. An energy model for recognizing the prokaryotic promoters based on molecular structure. Genomics 2019; 112:2072-2079. [PMID: 31809797 DOI: 10.1016/j.ygeno.2019.12.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2019] [Revised: 11/06/2019] [Accepted: 12/01/2019] [Indexed: 11/19/2022]
Abstract
Promoter is an important functional elements of DNA sequences, which is in charge of gene transcription initiation. Recognizing promoter have important help for understanding the relative life phenomena. Based on the concept that promoter is mainly determined by its sequence and structure, a novel statistical physics model for predicting promoter in Escherichia coli K-12 is proposed. The total energies of DNA local structure of sequence segments in the three benchmark promoter sequence datasets, the sole prediction parameter, are calculated by using principles from statistical physics and information theory. The better results are obtained. And a web-server PhysMPrePro for predicting promoter is established at http://202.207.14.87:8032/bioinformation/PhysMPrePro/index.asp, so that other scientists can easily get their desired results by our web-server.
Collapse
Affiliation(s)
- Ying-Li Chen
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China; The State key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Inner Mongolia University, Hohhot 010070, China.
| | - Dong-Hua Guo
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Qian-Zhong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China; The State key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Inner Mongolia University, Hohhot 010070, China.
| |
Collapse
|
26
|
Le NQK, Yapp EKY, Nagasundaram N, Yeh HY. Classifying Promoters by Interpreting the Hidden Information of DNA Sequences via Deep Learning and Combination of Continuous FastText N-Grams. Front Bioeng Biotechnol 2019; 7:305. [PMID: 31750297 PMCID: PMC6848157 DOI: 10.3389/fbioe.2019.00305] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 10/17/2019] [Indexed: 01/16/2023] Open
Abstract
A promoter is a short region of DNA (100-1,000 bp) where transcription of a gene by RNA polymerase begins. It is typically located directly upstream or at the 5' end of the transcription initiation site. DNA promoter has been proven to be the primary cause of many human diseases, especially diabetes, cancer, or Huntington's disease. Therefore, classifying promoters has become an interesting problem and it has attracted the attention of a lot of researchers in the bioinformatics field. There were a variety of studies conducted to resolve this problem, however, their performance results still require further improvement. In this study, we will present an innovative approach by interpreting DNA sequences as a combination of continuous FastText N-grams, which are then fed into a deep neural network in order to classify them. Our approach is able to attain a cross-validation accuracy of 85.41 and 73.1% in the two layers, respectively. Our results outperformed the state-of-the-art methods on the same dataset, especially in the second layer (strength classification). Throughout this study, promoter regions could be identified with high accuracy and it provides analysis for further biological research as well as precision medicine. In addition, this study opens new paths for the natural language processing application in omics data in general and DNA sequences in particular.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
| | | | - N. Nagasundaram
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, Singapore, Singapore
| | - Hui-Yuan Yeh
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
27
|
Zhang M, Li F, Marquez-Lago TT, Leier A, Fan C, Kwoh CK, Chou KC, Song J, Jia C. MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters. Bioinformatics 2019; 35:2957-2965. [PMID: 30649179 PMCID: PMC6736106 DOI: 10.1093/bioinformatics/btz016] [Citation(s) in RCA: 75] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2018] [Revised: 12/09/2018] [Accepted: 01/05/2019] [Indexed: 12/22/2022] Open
Abstract
MOTIVATION Promoters are short DNA consensus sequences that are localized proximal to the transcription start sites of genes, allowing transcription initiation of particular genes. However, the precise prediction of promoters remains a challenging task because individual promoters often differ from the consensus at one or more positions. RESULTS In this study, we present a new multi-layer computational approach, called MULTiPly, for recognizing promoters and their specific types. MULTiPly took into account the sequences themselves, including both local information such as k-tuple nucleotide composition, dinucleotide-based auto covariance and global information of the entire samples based on bi-profile Bayes and k-nearest neighbour feature encodings. Specifically, the F-score feature selection method was applied to identify the best unique type of feature prediction results, in combination with other types of features that were subsequently added to further improve the prediction performance of MULTiPly. Benchmarking experiments on the benchmark dataset and comparisons with five state-of-the-art tools show that MULTiPly can achieve a better prediction performance on 5-fold cross-validation and jackknife tests. Moreover, the superiority of MULTiPly was also validated on a newly constructed independent test dataset. MULTiPly is expected to be used as a useful tool that will facilitate the discovery of both general and specific types of promoters in the post-genomic era. AVAILABILITY AND IMPLEMENTATION The MULTiPly webserver and curated datasets are freely available at http://flagshipnt.erc.monash.edu/MULTiPly/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Meng Zhang
- School of Science, Dalian Maritime University, Dalian, China
| | - Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
| | - Tatiana T Marquez-Lago
- Department of Genetics, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - André Leier
- Department of Genetics, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Cunshuo Fan
- College of Information Engineering, Northwest A&F University, Yangling, China
| | - Chee Keong Kwoh
- School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| | | | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC, Australia
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian, China
- College of Information Engineering, Northwest A&F University, Yangling, China
| |
Collapse
|
28
|
iPromoter-2L2.0: Identifying Promoters and Their Types by Combining Smoothing Cutting Window Algorithm and Sequence-Based Features. MOLECULAR THERAPY-NUCLEIC ACIDS 2019; 18:80-87. [PMID: 31536883 PMCID: PMC6796744 DOI: 10.1016/j.omtn.2019.08.008] [Citation(s) in RCA: 46] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/07/2019] [Revised: 07/17/2019] [Accepted: 08/02/2019] [Indexed: 11/23/2022]
Abstract
Promoters are short regions at specific locations of DNA sequences, which are playing key roles in directing gene transcription. They can be grouped into six types (σ24,σ28,σ32,σ38,σ54,σ70). Recently, a predictor called "iPromoter-2L" was constructed to predict the promoters and their six types, which is the first approach to predict all the six types of promoters. However, its predictive quality still needs to be further improved for real-world application requirement. In this study, we proposed the smoothing cutting window algorithm to find the window fragments of the DNA sequences based on the conservation scores to capture the sequence patterns of promoters. For each window fragment, the discriminative features were extracted by using kmer and PseKNC. Combined with support vector machines (SVMs), different predictors were constructed and then clustered into several groups based on their distances. Finally, a new predictor called iPromoter-2L2.0 was constructed to identify the promoters and their six types, which was developed by ensemble learning based on the key predictors selected from the cluster groups. The results showed that iPromoter-2L2.0 outperformed other existing methods for both promoter prediction and identification of their six types, indicating that iPromoter-2L2.0 will be helpful for genomics analysis.
Collapse
|
29
|
Lin H, Liang ZY, Tang H, Chen W. Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1316-1321. [PMID: 28186907 DOI: 10.1109/tcbb.2017.2666141] [Citation(s) in RCA: 95] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Promoters are DNA regulatory elements located directly upstream or at the 5' end of the transcription initiation site (TSS), which are in charge of gene transcription initiation. With the completion of a large number of microorganism genomics, it is urgent to predict promoters accurately in bacteria by using the computational method. In this work, a sequence-based predictor named "iPro70-PseZNC" was designed for identifying sigma70 promoters in prokaryote. In the predictor, the samples of DNA sequences are formulated by a novel pseudo nucleotide composition, called PseZNC, into which the multi-window Z-curve composition and six local DNA structural properties are incorporated. In the 5-fold cross-validation, the area under the curve of receiver operating characteristic of 0.909 was obtained on our benchmark dataset, indicating that the proposed predictor is promising and will provide an important guide in this area. Further studies showed that the performance of PseZNC is better than it of multi-window Z-curve composition. For the sake of convenience for researchers, a user-friendly online service was established and can be freely accessible at http://lin.uestc.edu.cn/server/iPro70-PseZNC. The PseZNC approach can be also extended to other DNA-related problems.
Collapse
|
30
|
Liu B, Han L, Liu X, Wu J, Ma Q. Computational Prediction of Sigma-54 Promoters in Bacterial Genomes by Integrating Motif Finding and Machine Learning Strategies. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1211-1218. [PMID: 29993815 DOI: 10.1109/tcbb.2018.2816032] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Sigma factor, as a unit of RNA polymerase holoenzyme, is a critical factor in the process of gene transcriptional regulation. It recognizes the specific DNA sites and brings the core enzyme of RNA polymerase to the upstream regions of target genes. Therefore, the prediction of the promoters for a particular sigma factor is essential for interpreting functional genomic data and observation. This paper develops a new method to predict sigma-54 promoters in bacterial genomes. The new method organically integrates motif finding and machine learning strategies to capture the intrinsic features of sigma-54 promoters. The experiments on E. coli benchmark test set show that our method has good capability to distinguish sigma-54 promoters from surrounding or randomly selected DNA sequences. The applications of the other three bacterial genomes indicate the potential robustness and applicative power of our method on a large number of bacterial genomes. The source code of our method can be freely downloaded at https://github.com/maqin2001/PromotePredictor.
Collapse
|
31
|
Lai HY, Zhang ZY, Su ZD, Su W, Ding H, Chen W, Lin H. iProEP: A Computational Predictor for Predicting Promoter. MOLECULAR THERAPY. NUCLEIC ACIDS 2019; 17:337-346. [PMID: 31299595 PMCID: PMC6616480 DOI: 10.1016/j.omtn.2019.05.028] [Citation(s) in RCA: 103] [Impact Index Per Article: 20.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/23/2019] [Revised: 05/18/2019] [Accepted: 05/19/2019] [Indexed: 11/29/2022]
Abstract
Promoter is a fundamental DNA element located around the transcription start site (TSS) and could regulate gene transcription. Promoter recognition is of great significance in determining transcription units, studying gene structure, analyzing gene regulation mechanisms, and annotating gene functional information. Many models have already been proposed to predict promoters. However, the performances of these methods still need to be improved. In this work, we combined pseudo k-tuple nucleotide composition (PseKNC) with position-correlation scoring function (PCSF) to formulate promoter sequences of Homo sapiens (H. sapiens), Drosophila melanogaster (D. melanogaster), Caenorhabditis elegans (C. elegans), Bacillus subtilis (B. subtilis), and Escherichia coli (E. coli). Minimum Redundancy Maximum Relevance (mRMR) algorithm and increment feature selection strategy were then adopted to find out optimal feature subsets. Support vector machine (SVM) was used to distinguish between promoters and non-promoters. In the 10-fold cross-validation test, accuracies of 93.3%, 93.9%, 95.7%, 95.2%, and 93.1% were obtained for H. sapiens, D. melanogaster, C. elegans, B. subtilis, and E. coli, with the areas under receiver operating curves (AUCs) of 0.974, 0.975, 0.981, 0.988, and 0.976, respectively. Comparative results demonstrated that our method outperforms existing methods for identifying promoters. An online web server was established that can be freely accessed (http://lin-group.cn/server/iProEP/).
Collapse
Affiliation(s)
- Hong-Yan Lai
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zhao-Yue Zhang
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zhen-Dong Su
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Su
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Ding
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Chen
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China; Center for Genomics and Computational Biology, School of Life Sciences, North China University of Science and Technology, Tangshan 063000, China.
| | - Hao Lin
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
32
|
Xiao X, Xu ZC, Qiu WR, Wang P, Ge HT, Chou KC. iPSW(2L)-PseKNC: A two-layer predictor for identifying promoters and their strength by hybrid features via pseudo K-tuple nucleotide composition. Genomics 2018; 111:1785-1793. [PMID: 30529532 DOI: 10.1016/j.ygeno.2018.12.001] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2018] [Revised: 11/20/2018] [Accepted: 12/04/2018] [Indexed: 12/20/2022]
Abstract
The promoter is a regulatory DNA region about 81-1000 base pairs long, usually located near the transcription start site (TSS) along upstream of a given gene. By combining a certain protein called transcription factor, the promoter provides the starting point for regulated gene transcription, and hence plays a vitally important role in gene transcriptional regulation. With explosive growth of DNA sequences in the post-genomic age, it has become an urgent challenge to develop computational method for effectively identifying promoters because the information thus obtained is very useful for both basic research and drug development. Although some prediction methods were developed in this regard, most of them were limited at merely identifying whether a query DNA sequence being of a promoter or not. However, based on their strength-distinct levels for transcriptional activation and expression, promoter should be divided into two categories: strong and weak types. Here a new two-layer predictor, called "iPSW(2L)-PseKNC", was developed by fusing the physicochemical properties of nucleotides and their nucleotide density into PseKNC (pseudo K-tuple nucleotide composition). Its 1st-layer serves to predict whether a query DNA sequence sample is of promoter or not, while its 2nd-layer is able to predict the strength of promoters. It has been observed through rigorous cross-validations that the 1st-layer sub-predictor is remarkably superior to the existing state-of-the-art predictors in identifying the promoters and non-promoters, and that the 2nd-layer sub-predictor can do what is beyond the reach of the existing predictors. Moreover, the web-server for iPSW(2L)-PseKNC has been established at http://www.jci-bioinfo.cn/iPSW(2L)-PseKNC, by which the majority of experimental scientists can easily get the results they need.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China; The Gordon Life Science Institute, Boston, MA 02478, USA.
| | - Zhao-Chun Xu
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China.
| | - Wang-Ren Qiu
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China; The Gordon Life Science Institute, Boston, MA 02478, USA
| | - Peng Wang
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Hui-Ting Ge
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Kuo-Chen Chou
- The Gordon Life Science Institute, Boston, MA 02478, USA; Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
33
|
Rahman MS, Aktar U, Jani MR, Shatabda S. iPro70-FMWin: identifying Sigma70 promoters using multiple windowing and minimal features. Mol Genet Genomics 2018; 294:69-84. [PMID: 30187132 DOI: 10.1007/s00438-018-1487-5] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2018] [Accepted: 08/29/2018] [Indexed: 01/16/2023]
Abstract
In bacterial DNA, there are specific sequences of nucleotides called promoters that can bind to the RNA polymerase. Sigma70 ([Formula: see text]) is one of the most important promoter sequences due to its presence in most of the DNA regulatory functions. In this paper, we identify the most effective and optimal sequence-based features for prediction of [Formula: see text] promoter sequences in a bacterial genome. We used both short-range and long-range DNA sequences in our proposed method. A very small number of effective features are selected from a large number of the extracted features using multi-window of different sizes within the DNA sequences. We call our prediction method iPro70-FMWin and made it freely accessible online via a web application established at http://ipro70.pythonanywhere.com/server for the sake of convenience of the researchers. We have tested our method using a standard benchmark dataset. In the experiments, iPro70-FMWin has achieved an area under the curve of the receiver operating characteristic and accuracy of 0.959 and 90.57%, respectively, which significantly outperforms the state-of-the-art predictors.
Collapse
Affiliation(s)
- Md Siddiqur Rahman
- Department of Computer Science and Engineering, United International University, Madani Avenue, Satarkul, Badda, Dhaka, 1212, Bangladesh
| | - Usma Aktar
- Department of Computer Science and Engineering, United International University, Madani Avenue, Satarkul, Badda, Dhaka, 1212, Bangladesh
| | - Md Rafsan Jani
- Department of Computer Science and Engineering, United International University, Madani Avenue, Satarkul, Badda, Dhaka, 1212, Bangladesh
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Madani Avenue, Satarkul, Badda, Dhaka, 1212, Bangladesh.
| |
Collapse
|
34
|
Rahman MS, Aktar U, Jani MR, Shatabda S. iPromoter-FSEn: Identification of bacterial σ 70 promoter sequences using feature subspace based ensemble classifier. Genomics 2018; 111:1160-1166. [PMID: 30059731 DOI: 10.1016/j.ygeno.2018.07.011] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2018] [Revised: 07/07/2018] [Accepted: 07/12/2018] [Indexed: 10/28/2022]
Abstract
Sigma promoter sequences in bacterial genomes are important due to their role in transcription initiation. Sigma 70 is one of the most important and crucial sigma factors. In this paper, we address the problem of identification of σ70 promoter sequences in bacterial genome. We propose iPromoter-FSEn, a novel predictor for identification of σ70 promoter sequences. Our proposed method is based on a feature subspace based ensemble classifier. A large set of of features extracted from the sequence of nucleotides are divided into subsets and each subset is given to individual single classifiers to learn. Based on the decisions of the ensemble an aggregate decision is made by the ensemble voting classifier. We tested our method on a standard benchmark dataset extracted from experimentally validated results. Experimental results shows that iPromoter-FSEn significantly improves over the state-of-the art σ70 promoter sequence predictors. The accuracy and area under receiver operating characteristic curve of iPromoter-FSEn are 86.32% and 0.9319 respectively. We have also made our method readily available for use as an web application from: http://ipromoterfsen.pythonanywhere.com/server.
Collapse
Affiliation(s)
- Md Siddiqur Rahman
- Department of Computer Science and Engineering, United International University Madani Avenue, Satarkul, Badda, Dhaka 1212, Bangladesh
| | - Usma Aktar
- Department of Computer Science and Engineering, United International University Madani Avenue, Satarkul, Badda, Dhaka 1212, Bangladesh
| | - Md Rafsan Jani
- Department of Computer Science and Engineering, United International University Madani Avenue, Satarkul, Badda, Dhaka 1212, Bangladesh
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University Madani Avenue, Satarkul, Badda, Dhaka 1212, Bangladesh.
| |
Collapse
|
35
|
He W, Jia C, Duan Y, Zou Q. 70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features. BMC SYSTEMS BIOLOGY 2018; 12:44. [PMID: 29745856 PMCID: PMC5998878 DOI: 10.1186/s12918-018-0570-1] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
BACKGROUND Promoter is an important sequence regulation element, which is in charge of gene transcription initiation. In prokaryotes, σ70 promoters regulate the transcription of most genes. The promoter recognition has been a crucial part of gene structure recognition. It's also the core issue of constructing gene transcriptional regulation network. With the successfully completion of genome sequencing from an increasing number of microbe species, the accurate identification of σ70 promoter regions in DNA sequence is not easy. RESULTS In order to improve the prediction accuracy of sigma70 promoters in prokaryote, a promoter recognition model 70ProPred was established. In this work, two sequence-based features, including position-specific trinucleotide propensity based on single-stranded characteristic (PSTNPss) and electron-ion potential values for trinucleotides (PseEIIP), were assessed to build the best prediction model. It was found that 79 features of PSTNPSS combined with 64 features of PseEIIP obtained the best performance for sigma70 promoter identification, with a promising accuracy and the Matthews correlation coefficient (MCC) at 95.56% and 0.90, respectively. CONCLUSION The jackknife tests showed that 70ProPred outperforms the existing sigma70 promoter prediction approaches in terms of accuracy and stability. Additionally, this approach can also be extended to predict promoters of other species. In order to facilitate experimental biologists, an online web server for the proposed method was established, which is freely available at http://server.malab.cn/70ProPred/ .
Collapse
Affiliation(s)
- Wenying He
- School of Computer Science and Technology, Tianjin University, Tianjin, 300072 China
| | - Cangzhi Jia
- Department of Mathematics, Dalian Maritime University, Dalian, 116026 China
| | - Yucong Duan
- College of Information and Technology, Hainan University, Haikou, 570228 China
| | - Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, 300072 China
| |
Collapse
|
36
|
Ryasik A, Orlov M, Zykova E, Ermak T, Sorokin A. Bacterial promoter prediction: Selection of dynamic and static physical properties of DNA for reliable sequence classification. J Bioinform Comput Biol 2018; 16:1840003. [DOI: 10.1142/s0219720018400036] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Predicting promoter activity of DNA fragment is an important task for computational biology. Approaches using physical properties of DNA to predict bacterial promoters have recently gained a lot of attention. To select an adequate set of physical properties for training a classifier, various characteristics of DNA molecule should be taken into consideration. Here, we present a systematic approach that allows us to select less correlated properties for classification by means of both correlation and cophenetic coefficients as well as concordance matrices. To prove this concept, we have developed the first classifier that uses not only sequence and static physical properties of DNA fragment, but also dynamic properties of DNA open states. Therefore, the best performing models with accuracy values up to 90% for all types of sequences were obtained. Furthermore, we have demonstrated that the classifier can serve as a reliable tool enabling promoter DNA fragments to be distinguished from promoter islands despite the similarity of their nucleotide sequences.
Collapse
Affiliation(s)
- Artem Ryasik
- Mechanism of Cell Genome Functioning Laboratory, Institute of Cell Biophysics, ul. Institutskaya 3, Pushchino 142290, Russia
| | - Mikhail Orlov
- Mechanism of Cell Genome Functioning Laboratory, Institute of Cell Biophysics, ul. Institutskaya 3, Pushchino 142290, Russia
| | - Evgenia Zykova
- Mechanism of Cell Genome Functioning Laboratory, Institute of Cell Biophysics, ul. Institutskaya 3, Pushchino 142290, Russia
- Department of Applied Research Informatization, State Institute of Information Technologies and Telecommunications (SIIT&T Informika), per. Brusov 21 st.2, Moscow, 125009, Russia
| | - Timofei Ermak
- Laboratory of Molecular Genetics Systems, Institute of Cytology and Genetics, pr. Akademika Lavrentyeva 10, Novosibirsk 630090, Russia
| | - Anatoly Sorokin
- Mechanism of Cell Genome Functioning Laboratory, Institute of Cell Biophysics, ul. Institutskaya 3, Pushchino 142290, Russia
| |
Collapse
|
37
|
Characterization of a Minimal Type of Promoter Containing the -10 Element and a Guanine at the -14 or -13 Position in Mycobacteria. J Bacteriol 2017; 199:JB.00385-17. [PMID: 28784819 DOI: 10.1128/jb.00385-17] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2017] [Accepted: 08/03/2017] [Indexed: 11/20/2022] Open
Abstract
Three key promoter elements, i.e., -10, -35, and T-15G-14N, are recognized by the σ subunit of RNA polymerase. Among them, promoters with the -10 element and either -35 or T-15G-14N are known to initiate transcription efficiently, but recent systematic analyses have identified a large group of promoters in Mycobacterium tuberculosis that contain only a -10 consensus. How these promoters initiate transcription remains poorly understood. Here, we show that promoters containing the -10 element and an upstream G located at the -14 or -13 position can successfully initiate transcription in mycobacteria. Importantly, this new type of promoter is active in the absence of other promoter consensuses, suggesting that it is a minimal promoter type. Mutation of the upstream G in promoters decreased the efficiencies of their binding with RNA polymerase and their abilities to initiate transcription in both in vitro and in vivo analyses. A glutamic acid in σ region 3.0 is essential for recognizing G-14 and G-13 and is conserved in both principal and principal-like σ factors in mycobacteria, indicating that recognition of this minimal type of promoter might be a common mechanism for transcription initiation. Consistently, more than 70% of the identified promoters in M. tuberculosis contained G-14 or G-13 upstream of the conserved -10 element, and thousands of promoters in representative mycobacterial species have been predicted using the -10 consensus and G-14 or G-13 Altogether, our study presents a universal mechanism for transcription initiation from a minimal promoter in mycobacteria, which might also be applicable to other bacteria.IMPORTANCE In contrast to the detailed information for recognizing classic promoters in the model organism Escherichia coli, very little is known about how transcription is initiated in the human pathogen Mycobacterium tuberculosis In this study, we characterized a new type of promoter in mycobacteria that requires only a -10 consensus and an upstream G-14 or G-13 Residues important for recognizing the -10 element and the upstream G are conserved in σA and σB from mycobacterial species. According to such features, thousands of promoters in mycobacteria can be predicted using the -10 consensus and G-14 or G-13, which suggests that transcription from this new type of promoter might be widespread. Our findings provide insightful information for characterizing promoters in mycobacteria.
Collapse
|
38
|
Liu B, Yang F, Huang DS, Chou KC. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics 2017; 34:33-40. [DOI: 10.1093/bioinformatics/btx579] [Citation(s) in RCA: 235] [Impact Index Per Article: 33.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Accepted: 09/13/2017] [Indexed: 12/30/2022] Open
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- The Gordon Life Science Institute, Boston, MA, USA
| | - Fan Yang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Kuo-Chen Chou
- The Gordon Life Science Institute, Boston, MA, USA
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
- Faculty of Computing and Information Technology in Rabigh, King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
39
|
Dong C, Yuan YZ, Zhang FZ, Hua HL, Ye YN, Labena AA, Lin H, Chen W, Guo FB. Combining pseudo dinucleotide composition with the Z curve method to improve the accuracy of predicting DNA elements: a case study in recombination spots. MOLECULAR BIOSYSTEMS 2017; 12:2893-900. [PMID: 27410247 DOI: 10.1039/c6mb00374e] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Pseudo dinucleotide composition (PseDNC) and Z curve showed excellent performance in the classification issues of nucleotide sequences in bioinformatics. Inspired by the principle of Z curve theory, we improved PseDNC to give the phase-specific PseDNC (psPseDNC). In this study, we used the prediction of recombination spots as a case to illustrate the capability of psPseDNC and also PseDNC fused with Z curve theory based on a novel machine learning method named large margin distribution machine (LDM). We verified that combining the two widely used approaches could generate better performance compared to only using PseDNC with a support vector machine based (SVM-based) model. The best Mathew's correlation coefficient (MCC) achieved by our LDM-based model was 0.7037 through the rigorous jackknife test and improved by ∼6.6%, ∼3.2%, and ∼2.4% compared with three previous studies. Similarly, the accuracy was improved by 3.2% compared with our previous iRSpot-PseDNC web server through an independent data test. These results demonstrate that the joint use of PseDNC and Z curve enhances performance and can extract more information from a biological sequence. To facilitate research in this area, we constructed a user-friendly web server for predicting hot/cold spots, HcsPredictor, which can be freely accessed from . In summary, we provided a united algorithm by integrating Z curve with PseDNC. We hope this united algorithm could be extended to other classification issues in DNA elements.
Collapse
Affiliation(s)
- Chuan Dong
- Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China. and Center of Information in Biomedicine, University of Electronic Science and Technology of China, Chengdu, China and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Ya-Zhou Yuan
- Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China. and Center of Information in Biomedicine, University of Electronic Science and Technology of China, Chengdu, China and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Fa-Zhan Zhang
- Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China. and Center of Information in Biomedicine, University of Electronic Science and Technology of China, Chengdu, China and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Hong-Li Hua
- Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China. and Center of Information in Biomedicine, University of Electronic Science and Technology of China, Chengdu, China and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Yuan-Nong Ye
- School of Biology and Engineering, Guizhou Medical University, Guiyang, China
| | - Abraham Alemayehu Labena
- Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China. and Center of Information in Biomedicine, University of Electronic Science and Technology of China, Chengdu, China and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Hao Lin
- Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China. and Center of Information in Biomedicine, University of Electronic Science and Technology of China, Chengdu, China and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Wei Chen
- Department of Physics, School of Sciences, Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan, China
| | - Feng-Biao Guo
- Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China. and Center of Information in Biomedicine, University of Electronic Science and Technology of China, Chengdu, China and Key Laboratory for Neuro-information of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
40
|
Shahmuradov IA, Mohamad Razali R, Bougouffa S, Radovanovic A, Bajic VB. bTSSfinder: a novel tool for the prediction of promoters in cyanobacteria and Escherichia coli. Bioinformatics 2017; 33:334-340. [PMID: 27694198 PMCID: PMC5408793 DOI: 10.1093/bioinformatics/btw629] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2016] [Accepted: 09/27/2016] [Indexed: 12/01/2022] Open
Abstract
Motivation The computational search for promoters in prokaryotes remains an attractive problem in bioinformatics. Despite the attention it has received for many years, the problem has not been addressed satisfactorily. In any bacterial genome, the transcription start site is chosen mostly by the sigma (σ) factor proteins, which control the gene activation. The majority of published bacterial promoter prediction tools target σ70 promoters in Escherichia coli. Moreover, no σ-specific classification of promoters is available for prokaryotes other than for E. coli. Results Here, we introduce bTSSfinder, a novel tool that predicts putative promoters for five classes of σ factors in Cyanobacteria (σA, σC, σH, σG and σF) and for five classes of sigma factors in E. coli (σ70, σ38, σ32, σ28 and σ24). Comparing to currently available tools, bTSSfinder achieves higher accuracy (MCC = 0.86, F1-score = 0.93) compared to the next best tool with MCC = 0.59, F1-score = 0.79) and covers multiple classes of promoters. Availability and Implementation bTSSfinder is available standalone and online at http://www.cbrc.kaust.edu.sa/btssfinder. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ilham Ayub Shahmuradov
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), 4700 King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| | - Rozaimi Mohamad Razali
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), 4700 King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| | - Salim Bougouffa
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), 4700 King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| | - Aleksandar Radovanovic
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), 4700 King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| | - Vladimir B Bajic
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), 4700 King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| |
Collapse
|
41
|
Qiu ZW, Bi JH, Gazdar AF, Song K. Genome-wide copy number variation pattern analysis and a classification signature for non-small cell lung cancer. Genes Chromosomes Cancer 2017; 56:559-569. [PMID: 28379620 DOI: 10.1002/gcc.22460] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2016] [Revised: 03/25/2017] [Accepted: 03/26/2017] [Indexed: 02/06/2023] Open
Abstract
The accurate classification of non-small cell lung carcinoma (NSCLC) into lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) is essential for both clinical practice and lung cancer research. Although the standard WHO diagnosis of NSCLC on biopsy material is rapid and economic, more than 13% of NSCLC tumors in the USA are not further classified. The purpose of this study was to analyze the genome-wide pattern differences in copy number variations (CNVs) and to develop a CNV signature as an adjunct test for the routine histopathologic classification of NSCLCs. We investigated the genome-wide CNV differences between these two tumor types using three independent patient datasets. Approximately half of the genes examined exhibited significant differences between LUAD and LUSC tumors and the corresponding non-malignant tissues. A new classifier was developed to identify signature genes out of 20 000 genes. Thirty-three genes were identified as a CNV signature of NSCLC. Using only their CNV values, the classification model separated the LUADs from the LUSCs with an accuracy of 0.88 and 0.84, respectively, in the training and validation datasets. The same signature also classified NSCLC tumors from their corresponding non-malignant samples with an accuracy of 0.96 and 0.98, respectively. We also compared the CNV patterns of NSCLC tumors with those of histologically similar tumors arising at other sites, such as the breast, head, and neck, and four additional tumors. Of greater importance, the significant differences between these tumors may offer the possibility of identifying the origin of tumors whose origin is unknown.
Collapse
Affiliation(s)
- Zhe-Wei Qiu
- School of Chemical Engineering and Technology, Tianjin University, 300072 Tianjin, People's Republic of China
| | - Jia-Hao Bi
- School of Chemical Engineering and Technology, Tianjin University, 300072 Tianjin, People's Republic of China
| | - Adi F Gazdar
- Hamon Center for Therapeutic Oncology, University of Texas Southwestern Medical Center, Dallas, Texas, 75390, USA.,Department of Pathology, University of Texas Southwestern Medical Center, Dallas, Texas, 75390, USA
| | - Kai Song
- School of Chemical Engineering and Technology, Tianjin University, 300072 Tianjin, People's Republic of China.,Hamon Center for Therapeutic Oncology, University of Texas Southwestern Medical Center, Dallas, Texas, 75390, USA
| |
Collapse
|
42
|
Tang B, Xie F, Zhao W, Wang J, Dai S, Zheng H, Ding X, Cen X, Liu H, Yu Y, Zhou H, Zhou Y, Zhang L, Goodfellow M, Zhao GP. A systematic study of the whole genome sequence of Amycolatopsis methanolica strain 239 T provides an insight into its physiological and taxonomic properties which correlate with its position in the genus. Synth Syst Biotechnol 2016; 1:169-186. [PMID: 29062941 PMCID: PMC5640789 DOI: 10.1016/j.synbio.2016.05.001] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2015] [Revised: 04/01/2016] [Accepted: 05/18/2016] [Indexed: 12/31/2022] Open
Abstract
The complete genome of methanol-utilizing Amycolatopsis methanolica strain 239T was generated, revealing a single 7,237,391 nucleotide circular chromosome with 7074 annotated protein-coding sequences (CDSs). Comparative analyses against the complete genome sequences of Amycolatopsis japonica strain MG417-CF17T, Amycolatopsis mediterranei strain U32 and Amycolatopsis orientalis strain HCCB10007 revealed a broad spectrum of genomic structures, including various genome sizes, core/quasi-core/non-core configurations and different kinds of episomes. Although polyketide synthase gene clusters were absent from the A. methanolica genome, 12 gene clusters related to the biosynthesis of other specialized (secondary) metabolites were identified. Complete pathways attributable to the facultative methylotrophic physiology of A. methanolica strain 239T, including both the mdo/mscR encoded methanol oxidation and the hps/hpi encoded formaldehyde assimilation via the ribulose monophosphate cycle, were identified together with evidence that the latter might be the result of horizontal gene transfer. Phylogenetic analyses based on 16S rDNA or orthologues of AMETH_3452, a novel actinobacterial class-specific conserved gene against 62 or 18 Amycolatopsis type strains, respectively, revealed three major phyletic lineages, namely the mesophilic or moderately thermophilic A. orientalis subclade (AOS), the mesophilic Amycolatopsis taiwanensis subclade (ATS) and the thermophilic A. methanolica subclade (AMS). The distinct growth temperatures of members of the subclades correlated with corresponding genetic variations in their encoded compatible solutes. This study shows the value of integrating conventional taxonomic with whole genome sequence data.
Collapse
Affiliation(s)
- Biao Tang
- State Key Laboratory of Genetic Engineering, Department of Microbiology, School of Life Sciences and Institute of Biomedical Sciences, Fudan University, Shanghai, 200438, China.,CAS-Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Feng Xie
- CAS-Key Laboratory of Pathogenic Microbiology and Immunology, Institute of Microbiology, Chinese Academy of Sciences, No. 1 Beichen West Road, Chaoyang District, Beijing, 100101, China
| | - Wei Zhao
- CAS-Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Jian Wang
- CAS-Key Laboratory of Pathogenic Microbiology and Immunology, Institute of Microbiology, Chinese Academy of Sciences, No. 1 Beichen West Road, Chaoyang District, Beijing, 100101, China
| | - Shengwang Dai
- CAS-Key Laboratory of Pathogenic Microbiology and Immunology, Institute of Microbiology, Chinese Academy of Sciences, No. 1 Beichen West Road, Chaoyang District, Beijing, 100101, China
| | - Huajun Zheng
- Shanghai-MOST Key Laboratory of Disease and Health Genomics, Chinese National Human Genome Center at Shanghai, Shanghai, 201203, China
| | - Xiaoming Ding
- State Key Laboratory of Genetic Engineering, Department of Microbiology, School of Life Sciences and Institute of Biomedical Sciences, Fudan University, Shanghai, 200438, China
| | - Xufeng Cen
- CAS-Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Haican Liu
- State Key Laboratory for Infectious Diseases Prevention and Control, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, National Institute for Communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing, China
| | - Yucong Yu
- State Key Laboratory of Genetic Engineering, Department of Microbiology, School of Life Sciences and Institute of Biomedical Sciences, Fudan University, Shanghai, 200438, China
| | - Haokui Zhou
- Department of Microbiology and Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Prince of Wales Hospital, Shatin, New Territories, Hong Kong SAR, China
| | - Yan Zhou
- State Key Laboratory of Genetic Engineering, Department of Microbiology, School of Life Sciences and Institute of Biomedical Sciences, Fudan University, Shanghai, 200438, China.,Shanghai-MOST Key Laboratory of Disease and Health Genomics, Chinese National Human Genome Center at Shanghai, Shanghai, 201203, China
| | - Lixin Zhang
- CAS-Key Laboratory of Pathogenic Microbiology and Immunology, Institute of Microbiology, Chinese Academy of Sciences, No. 1 Beichen West Road, Chaoyang District, Beijing, 100101, China
| | - Michael Goodfellow
- School of Biology, University of Newcastle, Newcastle upon Tyne, NE1 7RU, UK
| | - Guo-Ping Zhao
- State Key Laboratory of Genetic Engineering, Department of Microbiology, School of Life Sciences and Institute of Biomedical Sciences, Fudan University, Shanghai, 200438, China.,CAS-Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, China.,Shanghai-MOST Key Laboratory of Disease and Health Genomics, Chinese National Human Genome Center at Shanghai, Shanghai, 201203, China.,Department of Microbiology and Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Prince of Wales Hospital, Shatin, New Territories, Hong Kong SAR, China
| |
Collapse
|
43
|
Abbas MM, Mohie-Eldin MM, EL-Manzalawy Y. Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors. PLoS One 2015; 10:e0119721. [PMID: 25803493 PMCID: PMC4372424 DOI: 10.1371/journal.pone.0119721] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Accepted: 01/26/2015] [Indexed: 11/27/2022] Open
Abstract
As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ70 promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ70 promoter prediction methods.
Collapse
Affiliation(s)
- Mostafa M. Abbas
- KINDI Center for Computing Research, College of Engineering, Qatar University, Doha, Qatar
| | | | - Yasser EL-Manzalawy
- Systems and Computer Engineering, Al-Azhar University, Cairo, Egypt
- College of Information Sciences, Penn State University, University Park, United States of America
| |
Collapse
|
44
|
Lin H, Deng EZ, Ding H, Chen W, Chou KC. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res 2014; 42:12961-72. [PMID: 25361964 PMCID: PMC4245931 DOI: 10.1093/nar/gku1019] [Citation(s) in RCA: 398] [Impact Index Per Article: 39.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The σ54 promoters are unique in prokaryotic genome and responsible for transcripting carbon and nitrogen-related genes. With the avalanche of genome sequences generated in the postgenomic age, it is highly desired to develop automated methods for rapidly and effectively identifying the σ54 promoters. Here, a predictor called ‘iPro54-PseKNC’ was developed. In the predictor, the samples of DNA sequences were formulated by a novel feature vector called ‘pseudo k-tuple nucleotide composition’, which was further optimized by the incremental feature selection procedure. The performance of iPro54-PseKNC was examined by the rigorous jackknife cross-validation tests on a stringent benchmark data set. As a user-friendly web-server, iPro54-PseKNC is freely accessible at http://lin.uestc.edu.cn/server/iPro54-PseKNC. For the convenience of the vast majority of experimental scientists, a step-by-step protocol guide was provided on how to use the web-server to get the desired results without the need to follow the complicated mathematics that were presented in this paper just for its integrity. Meanwhile, we also discovered through an in-depth statistical analysis that the distribution of distances between the transcription start sites and the translation initiation sites were governed by the gamma distribution, which may provide a fundamental physical principle for studying the σ54 promoters.
Collapse
Affiliation(s)
- Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China Gordon Life Science Institute, Belmont, MA, USA
| | - En-Ze Deng
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Chen
- Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China Gordon Life Science Institute, Belmont, MA, USA
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Belmont, MA, USA Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
45
|
-Biao Guo F, Lin Y, -Ling Chen L. Recognition of Protein-coding Genes Based on Z-curve Algorithms. Curr Genomics 2014; 15:95-103. [PMID: 24822027 PMCID: PMC4009845 DOI: 10.2174/1389202915999140328162724] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2013] [Revised: 11/19/2013] [Accepted: 11/20/2013] [Indexed: 01/18/2023] Open
Abstract
Recognition of protein-coding genes, a classical bioinformatics issue, is an absolutely needed step for annotating newly sequenced genomes. The Z-curve algorithm, as one of the most effective methods on this issue, has been successfully applied in annotating or re-annotating many genomes, including those of bacteria, archaea and viruses. Two Z-curve based ab initio gene-finding programs have been developed: ZCURVE (for bacteria and archaea) and ZCURVE_V (for viruses and phages). ZCURVE_C (for 57 bacteria) and Zfisher (for any bacterium) are web servers for re-annotation of bacterial and archaeal genomes. The above four tools can be used for genome annotation or re-annotation, either independently or combined with the other gene-finding programs. In addition to recognizing protein-coding genes and exons, Z-curve algorithms are also effective in recognizing promoters and translation start sites. Here, we summarize the applications of Z-curve algorithms in gene finding and genome annotation.
Collapse
Affiliation(s)
- Feng -Biao Guo
- Center of Bioinformatics and Key Laboratory for NeuroInformation of the Ministry of Education, University of Elec-tronic Science and Technology of China, Chengdu, 610054, China
| | - Yan Lin
- Department of Physics, Tianjin University, Tianjin 300072, China
| | - Ling -Ling Chen
- cCollege of Life Science and Technology, Huazhong Agricultural University, Wuhan, 430070, China
| |
Collapse
|
46
|
P U, Dubey JK, Rv K, Cherian BS, Gopalakrishnan G, Nair AS. A novel sequence and context based method for promoter recognition. Bioinformation 2014; 10:175-9. [PMID: 24966516 PMCID: PMC4070045 DOI: 10.6026/97320630010175] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2014] [Revised: 03/17/2014] [Accepted: 03/18/2014] [Indexed: 11/23/2022] Open
Abstract
UNLABELLED Identification of promoters in DNA sequence using computational techniques is a significant research area because of its direct association in transcription regulation. A wide range of algorithms are available for promoter prediction. Most of them are polymerase dependent and cannot handle eukaryotes and prokaryotes alike. This study proposes a polymerase independent algorithm, which can predict whether a given DNA fragment is a promoter or not, based on the sequence features and statistical elements. This algorithm considers all possible pentamers formed from the nucleotides A, C, G, and T along with CpG islands, TATA box, initiator elements, and downstream promoter elements. The highlight of the algorithm is that it is not polymerase specific and can predict for both eukaryotes and prokaryotes in the same computational manner even though the underlying biological mechanisms of promoter recognition differ greatly. The proposed Method, Promoter Prediction System - PPS-CBM achieved a sensitivity, specificity, and accuracy percentages of 75.08, 83.58 and 79.33 on E. coli data set and 86.67, 88.41 and 87.58 on human data set. We have developed a tool based on PPS-CBM, the proposed algorithm, with which multiple sequences of varying lengths can be tested simultaneously and the result is reported in a comprehensive tabular format. The tool also reports the strength of the prediction. AVAILABILITY The tool and source code of PPS-CBM is available at http://keralabs.org.
Collapse
Affiliation(s)
- Umesh P
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram - 695581, Kerala, India
| | - Jitendra Kumar Dubey
- Department of Computer Science and Engineering, National Institute of Technology, Calicut - 673601, Kerala, India
| | - Karthika Rv
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram - 695581, Kerala, India
| | | | - Gopakumar Gopalakrishnan
- Department of Computer Science and Engineering, National Institute of Technology, Calicut - 673601, Kerala, India
| | - Achuthsankar Sukumaran Nair
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram - 695581, Kerala, India
| |
Collapse
|
47
|
Zhang R, Zhang CT. A Brief Review: The Z-curve Theory and its Application in Genome Analysis. Curr Genomics 2014; 15:78-94. [PMID: 24822026 PMCID: PMC4009844 DOI: 10.2174/1389202915999140328162433] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2013] [Revised: 10/16/2013] [Accepted: 10/16/2013] [Indexed: 11/22/2022] Open
Abstract
In theoretical physics, there exist two basic mathematical approaches, algebraic and geometrical methods, which, in most cases, are complementary. In the area of genome sequence analysis, however, algebraic approaches have been widely used, while geometrical approaches have been less explored for a long time. The Z-curve theory is a geometrical approach to genome analysis. The Z-curve is a three-dimensional curve that represents a given DNA sequence in the sense that each can be uniquely reconstructed given the other. The Z-curve, therefore, contains all the information that the corresponding DNA sequence carries. The analysis of a DNA sequence can then be performed through studying the corresponding Z-curve. The Z-curve method has found applications in a wide range of areas in the past two decades, including the identifications of protein-coding genes, replication origins, horizontally-transferred genomic islands, promoters, translational start sides and isochores, as well as studies on phylogenetics, genome visualization and comparative genomics. Here, we review the progress of Z-curve studies from aspects of both theory and applications in genome analysis.
Collapse
Affiliation(s)
- Ren Zhang
- Center for Molecular Medicine and Genetics, Wayne State University Medical School, Detroit, MI 48201, USA
| | - Chun-Ting Zhang
- Department of Physics, Tianjin University, Tianjin 300072, China
| |
Collapse
|
48
|
Song K, Tong T, Wu F. Predicting essential genes in prokaryotic genomes using a linear method: ZUPLS. Integr Biol (Camb) 2014; 6:460-9. [PMID: 24603751 DOI: 10.1039/c3ib40241j] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
An effective linear method, ZUPLS, was developed to improve the accuracy and speed of prokaryotic essential gene identification. ZUPLS only uses the Z-curve and other sequence-based features. Such features can be calculated readily from the DNA/amino acid sequences. Therefore, no well-studied biological network knowledge is required for using ZUPLS. This significantly simplifies essential gene identification, especially for newly sequenced species. ZUPLS can also select necessary features automatically by embedding the uninformative variable elimination tool into the partial least squares classifier. No optimized modelling parameters are needed. ZUPLS has been used, herein, to predict essential genes of 12 remotely related prokaryotes to test its performance. The cross-organism predictions yielded AUC (Area Under the Curve) scores between 0.8042 and 0.9319 by using E. coli genes as the training samples. Similarly, ZUPLS achieved AUC scores between 0.8111 and 0.9371 by using B. subtilis genes as the training samples. We also compared it with the best available results of the existing approaches for further testing. The improvement of the AUC score in predicting B. subtilis essential genes using E. coli genes was 0.13. Additionally, in predicting E. coli essential genes using P. aeruginosa genes, the significant improvement was 0.10. Similarly, the exceptional improvement of the average accuracy of M. pulmonis using M. genitalium and M. pulmonis genes was 14.7%. The combined superior feature extraction and selection power of ZUPLS enable it to give reliable prediction of essential genes for both Gram-positive/negative organisms and rich/poor culture media.
Collapse
Affiliation(s)
- Kai Song
- School of Chemical Engineering and Technology, Tianjin University, 92 Weijin Road, Nankai district, Tianjin, 300072, China.
| | | | | |
Collapse
|
49
|
Huang WL, Tung CW, Liaw C, Huang HL, Ho SY. Rule-based knowledge acquisition method for promoter prediction in human and Drosophila species. ScientificWorldJournal 2014; 2014:327306. [PMID: 24955394 PMCID: PMC3927563 DOI: 10.1155/2014/327306] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2013] [Accepted: 10/10/2013] [Indexed: 01/08/2023] Open
Abstract
The rapid and reliable identification of promoter regions is important when the number of genomes to be sequenced is increasing very speedily. Various methods have been developed but few methods investigate the effectiveness of sequence-based features in promoter prediction. This study proposes a knowledge acquisition method (named PromHD) based on if-then rules for promoter prediction in human and Drosophila species. PromHD utilizes an effective feature-mining algorithm and a reference feature set of 167 DNA sequence descriptors (DNASDs), comprising three descriptors of physicochemical properties (absorption maxima, molecular weight, and molar absorption coefficient), 128 top-ranked descriptors of 4-mer motifs, and 36 global sequence descriptors. PromHD identifies two feature subsets with 99 and 74 DNASDs and yields test accuracies of 96.4% and 97.5% in human and Drosophila species, respectively. Based on the 99- and 74-dimensional feature vectors, PromHD generates several if-then rules by using the decision tree mechanism for promoter prediction. The top-ranked informative rules with high certainty grades reveal that the global sequence descriptor, the length of nucleotide A at the first position of the sequence, and two physicochemical properties, absorption maxima and molecular weight, are effective in distinguishing promoters from non-promoters in human and Drosophila species, respectively.
Collapse
Affiliation(s)
- Wen-Lin Huang
- Department of Management Information System, Asia Pacific Institute of Creativity, Miaoli 351, Taiwan
| | - Chun-Wei Tung
- School of Pharmacy, College of Pharmacy, Kaohsiung Medical University, Kaohsiung 807, Taiwan
| | - Chyn Liaw
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu 300, Taiwan
| | - Hui-Ling Huang
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu 300, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu 300, Taiwan
| | - Shinn-Ying Ho
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu 300, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu 300, Taiwan
| |
Collapse
|
50
|
Wu X, Liu H, Liu H, Su J, Lv J, Cui Y, Wang F, Zhang Y. Z curve theory-based analysis of the dynamic nature of nucleosome positioning in Saccharomyces cerevisiae. Gene 2013; 530:8-18. [DOI: 10.1016/j.gene.2013.08.018] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2013] [Revised: 07/30/2013] [Accepted: 08/03/2013] [Indexed: 01/01/2023]
|