1
|
Cotroneo CE, Gormley IC, Shields DC, Salter-Townshend M. Computational modelling of chromosomally clustering protein domains in bacteria. BMC Bioinformatics 2021; 22:593. [PMID: 34906073 PMCID: PMC8670047 DOI: 10.1186/s12859-021-04512-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Accepted: 11/16/2021] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND In bacteria, genes with related functions-such as those involved in the metabolism of the same compound or in infection processes-are often physically close on the genome and form groups called clusters. The enrichment of such clusters over various distantly related bacteria can be used to predict the roles of genes of unknown function that cluster with characterised genes. There is no obvious rule to define a cluster, given their variability in size and intergenic distances, and the definition of what comprises a "gene", since genes can gain and lose domains over time. Protein domains can cluster within a gene, or in adjacent genes of related function, and in both cases these are chromosomally clustered. Here, we model the distances between pairs of protein domain coding regions across a wide range of bacteria and archaea via a probabilistic two component mixture model, without imposing arbitrary thresholds in terms of gene numbers or distances. RESULTS We trained our model using matched gene ontology terms to label functionally related pairs and assess the stability of the parameters of the model across 14,178 archaeal and bacterial strains. We found that the parameters of our mixture model are remarkably stable across bacteria and archaea, except for endosymbionts and obligate intracellular pathogens. Obligate pathogens have smaller genomes, and although they vary, on average do not show noticeably different clustering distances; the main difference in the parameter estimates is that a far greater proportion of the genes sharing ontology terms are clustered. This may reflect that these genomes are enriched for complexes encoded by clustered core housekeeping genes, as a proportion of the total genes. Given the overall stability of the parameter estimates, we then used the mean parameter estimates across the entire dataset to investigate which gene ontology terms are most frequently associated with clustered genes. CONCLUSIONS Given the stability of the mixture model across species, it may be used to predict bacterial gene clusters that are shared across multiple species, in addition to giving insights into the evolutionary pressures on the chromosomal locations of genes in different species.
Collapse
Affiliation(s)
- Chiara E Cotroneo
- School of Medicine, University College Dublin, Dublin, Ireland.,Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin, Ireland
| | | | - Denis C Shields
- School of Medicine, University College Dublin, Dublin, Ireland. .,Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin, Ireland.
| | | |
Collapse
|
2
|
Kuo RJ, Lin JY, Nguyen TPQ. An application of sine cosine algorithm-based fuzzy possibilistic c-ordered means algorithm to cluster analysis. Soft comput 2021. [DOI: 10.1007/s00500-020-05380-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
3
|
Assaf R, Xia F, Stevens R. Detecting operons in bacterial genomes via visual representation learning. Sci Rep 2021; 11:2124. [PMID: 33483546 PMCID: PMC7822928 DOI: 10.1038/s41598-021-81169-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2020] [Accepted: 12/30/2020] [Indexed: 12/05/2022] Open
Abstract
Contiguous genes in prokaryotes are often arranged into operons. Detecting operons plays a critical role in inferring gene functionality and regulatory networks. Human experts annotate operons by visually inspecting gene neighborhoods across pileups of related genomes. These visual representations capture the inter-genic distance, strand direction, gene size, functional relatedness, and gene neighborhood conservation, which are the most prominent operon features mentioned in the literature. By studying these features, an expert can then decide whether a genomic region is part of an operon. We propose a deep learning based method named Operon Hunter that uses visual representations of genomic fragments to make operon predictions. Using transfer learning and data augmentation techniques facilitates leveraging the powerful neural networks trained on image datasets by re-training them on a more limited dataset of extensively validated operons. Our method outperforms the previously reported state-of-the-art tools, especially when it comes to predicting full operons and their boundaries accurately. Furthermore, our approach makes it possible to visually identify the features influencing the network’s decisions to be subsequently cross-checked by human experts.
Collapse
Affiliation(s)
- Rida Assaf
- Department of Computer Science, University of Chicago, Chicago, 60637, USA.
| | - Fangfang Xia
- Computing Environment and Life Sciences Division, Argonne National Laboratory, Lemont, 60439, USA.,Data Science and Learning Division, Argonne National Laboratory, Lemont, 60439, USA
| | - Rick Stevens
- The University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, 60637, USA.,Computing Environment and Life Sciences Division, Argonne National Laboratory, Lemont, 60439, USA
| |
Collapse
|
4
|
Zaidi SSA, Kayani MUR, Zhang X, Ouyang Y, Shamsi IH. Prediction and analysis of metagenomic operons via MetaRon: a pipeline for prediction of Metagenome and whole-genome opeRons. BMC Genomics 2021; 22:60. [PMID: 33468056 PMCID: PMC7814594 DOI: 10.1186/s12864-020-07357-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Accepted: 12/27/2020] [Indexed: 11/10/2022] Open
Abstract
Background Efficient regulation of bacterial genes in response to the environmental stimulus results in unique gene clusters known as operons. Lack of complete operonic reference and functional information makes the prediction of metagenomic operons a challenging task; thus, opening new perspectives on the interpretation of the host-microbe interactions. Results In this work, we identified whole-genome and metagenomic operons via MetaRon (Metagenome and whole-genome opeRon prediction pipeline). MetaRon identifies operons without any experimental or functional information. MetaRon was implemented on datasets with different levels of complexity and information. Starting from its application on whole-genome to simulated mixture of three whole-genomes (E. coli MG1655, Mycobacterium tuberculosis H37Rv and Bacillus subtilis str. 16), E. coli c20 draft genome extracted from chicken gut and finally on 145 whole-metagenome data samples from human gut. MetaRon consistently achieved high operon prediction sensitivity, specificity and accuracy across E. coli whole-genome (97.8, 94.1 and 92.4%), simulated genome (93.7, 75.5 and 88.1%) and E. coli c20 (87, 91 and 88%,), respectively. Finally, we identified 1,232,407 unique operons from 145 paired-end human gut metagenome samples. We also report strong association of type 2 diabetes with Maltose phosphorylase (K00691), 3-deoxy-D-glycero-D-galacto-nononate 9-phosphate synthase (K21279) and an uncharacterized protein (K07101). Conclusion With MetaRon, we were able to remove two notable limitations of existing whole-genome operon prediction methods: (1) generalizability (ability to predict operons in unrelated bacterial genomes), and (2) whole-genome and metagenomic data management. We also demonstrate the use of operons as a subset to represent the trends of secondary metabolites in whole-metagenome data and the role of secondary metabolites in the occurrence of disease condition. Using operonic data from metagenome to study secondary metabolic trends will significantly reduce the data volume to more precise data. Furthermore, the identification of metabolic pathways associated with the occurrence of type 2 diabetes (T2D) also presents another dimension of analyzing the human gut metagenome. Presumably, this study is the first organized effort to predict metagenomic operons and perform a detailed analysis in association with a disease, in this case type 2 diabetes. The application of MetaRon to metagenomic data at diverse scale will be beneficial to understand the gene regulation and therapeutic metagenomics.
Collapse
Affiliation(s)
- Syed Shujaat Ali Zaidi
- Bioinformatics Division, Beijing National Research Institute for Information Science and Technology (BNRIST), Department of Automation, Tsinghua University, Beijing, 100084, People's Republic of China.,Bioscience Department, COMSATS Institute of Information Technology, Islamabad, 44000, Pakistan.,Center for Innovation in Brain Science, University of Arizona, Tucson, 85719, USA
| | - Masood Ur Rehman Kayani
- Center for Microbiota and Immunological Diseases, Shanghai General Hospital, Shanghai Institute of Immunology, Shanghai Jiao Tong University, School of Medicine, Shanghai, 2000025, People's Republic of China
| | - Xuegong Zhang
- Bioinformatics Division, Beijing National Research Institute for Information Science and Technology (BNRIST), Department of Automation, Tsinghua University, Beijing, 100084, People's Republic of China
| | - Younan Ouyang
- China National Rice Research Institute (CNRRI), 28 Shuidaosuo rd, Fuyang, Hangzhou, 311400, People's Republic of China
| | - Imran Haider Shamsi
- Department of Agronomy, College of Agriculture and Biotechnology, Key Laboratory of Crop Germplasm Resource, Zhejiang University, Hangzhou, 310058, People's Republic of China.
| |
Collapse
|
5
|
Zaidi SSA, Zhang X. Computational operon prediction in whole-genomes and metagenomes. Brief Funct Genomics 2018; 16:181-193. [PMID: 27659221 DOI: 10.1093/bfgp/elw034] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Microbial diversity in unique environmental settings enables abrupt responses catalysed by altering the gene regulation and formation of gene clusters called operons. Operons increases bacterial adaptability, which in turn increases their survival. This review article presents the emergence of computational operon prediction methods for whole microbial genomes and metagenomes, and discusses their strengths and limitations. Most of the whole-genome operon prediction methods struggle to generalize on unrelated genomes. The applicability of universal whole-genome operon prediction methods to metagenomic data is an interesting yet less investigated question. We have evaluated the potential of various operon prediction features for genomic and metagenomic data. Most of operon prediction methods with high accuracy have been compiled into databases. Despite of the high predictive performance, the data among many databases are not completely consistent for similar species. We performed a correlation analysis between the computationally predicted operon databases and experimentally validated data for Escherichia coli, Bacillus subtilis and Mycobacterium tuberculosis. Operon prediction for most of the less characterized microbes cannot be verified due to absence of experimentally validated operons. The generation of validated information for other microbes would test the authenticity of operon databases for other less annotated microbes as well. Advances in sequencing technologies and development of better analysis methods will help researchers to overcome the technological hurdles (such as long sequencing reads and improved contig size) and further improve operon predictions and better utilize operonic information.
Collapse
|
6
|
Kalantari A, Kamsin A, Shamshirband S, Gani A, Alinejad-Rokny H, Chronopoulos AT. Computational intelligence approaches for classification of medical data: State-of-the-art, future challenges and research directions. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2017.01.126] [Citation(s) in RCA: 70] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
|
7
|
Chuang LY, Yang CH, Tsai JH, Yang CH. Operon prediction using chaos embedded particle swarm optimization. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1299-1309. [PMID: 24384714 DOI: 10.1109/tcbb.2013.63] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Operons contain valuable information for drug design and determining protein functions. Genes within an operon are co-transcribed to a single-strand mRNA and must be coregulated. The identification of operons is, thus, critical for a detailed understanding of the gene regulations. However, currently used experimental methods for operon detection are generally difficult to implement and time consuming. In this paper, we propose a chaotic binary particle swarm optimization (CBPSO) to predict operons in bacterial genomes. The intergenic distance, participation in the same metabolic pathway and the cluster of orthologous groups (COG) properties of the Escherichia coli genome are used to design a fitness function. Furthermore, the Bacillus subtilis, Pseudomonas aeruginosa PA01, Staphylococcus aureus and Mycobacterium tuberculosis genomes are tested and evaluated for accuracy, sensitivity, and specificity. The computational results indicate that the proposed method works effectively in terms of enhancing the performance of the operon prediction. The proposed method also achieved a good balance between sensitivity and specificity when compared to methods from the literature.
Collapse
Affiliation(s)
| | - Cheng-Huei Yang
- National Kaohsiung Institute of Marine Technology, Kaohsiung
| | - Jui-Hung Tsai
- National Kaohsiung University of Applied Sciences, Kaohsiung
| | - Cheng-Hong Yang
- National Kaohsiung University of Applied Sciences, Kaohsiung
| |
Collapse
|
8
|
Sahu TK, Rao AR, Vasisht S, Singh N, Singh UP. Computational approaches, databases and tools for in silico motif discovery. Interdiscip Sci 2012; 4:239-255. [PMID: 23354813 DOI: 10.1007/s12539-012-0141-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2011] [Revised: 04/12/2012] [Accepted: 06/13/2012] [Indexed: 06/01/2023]
Abstract
Motifs are the biologically significant fragments of nucleotide or peptide sequences in a specific pattern. Motifs are categorized as structural motifs and sequence motifs. These are discovered by phylogenetic studies of similar genes across species. Structural motifs are formed by three dimensional arrangements of amino acids consisting of two or more α helices or β strands whereas sequence motifs are formed by the nucleotide fragments appearing in the exons of a gene. The arrangement of residues in structural motifs may not be continuous while it is continuous in sequence motifs. Sequence motifs may encode to the structural motifs. The algorithms used for motif discovery are important part of the bio-computational studies. The purpose of motif discovery is to identify patterns in biopolymer (nucleotide or protein) sequences to understand the structure and function of the molecules and their evolutionary aspects. The main aim of this paper is to provide systematic compilation of a review on different approaches, databases and tools used in motif discovery.
Collapse
Affiliation(s)
- Tanmaya Kumar Sahu
- Centre for Agricultural Bioinformatics, Indian Agricultural Statistics Research Institute, New Delhi, India
| | | | | | | | | |
Collapse
|
9
|
A global analysis of adaptive evolution of operons in cyanobacteria. Antonie van Leeuwenhoek 2012; 103:331-46. [PMID: 22987250 DOI: 10.1007/s10482-012-9813-0] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2012] [Accepted: 09/06/2012] [Indexed: 01/04/2023]
Abstract
Operons are an important feature of prokaryotic genomes. Evolution of operons is hypothesized to be adaptive and has contributed significantly towards coordinated optimization of functions. Two conflicting theories, based on (i) in situ formation to achieve co-regulation and (ii) horizontal gene transfer of functionally linked gene clusters, are generally considered to explain why and how operons have evolved. Furthermore, effects of operon evolution on genomic traits such as intergenic spacing, operon size and co-regulation are relatively less explored. Based on the conservation level in a set of diverse prokaryotes, we categorize the operonic gene pair associations and in turn the operons as ancient and recently formed. This allowed us to perform a detailed analysis of operonic structure in cyanobacteria, a morphologically and physiologically diverse group of photoautotrophs. Clustering based on operon conservation showed significant similarity with the 16S rRNA-based phylogeny, which groups the cyanobacterial strains into three clades. Clade C, dominated by strains that are believed to have undergone genome reduction, shows a larger fraction of operonic genes that are tightly packed in larger sized operons. Ancient operons are in general larger, more tightly packed, better optimized for co-regulation and part of key cellular processes. A sub-clade within Clade B, which includes Synechocystis sp. PCC 6803, shows a reverse trend in intergenic spacing. Our results suggest that while in situ formation and vertical descent may be a dominant mechanism of operon evolution in cyanobacteria, optimization of intergenic spacing and co-regulation are part of an ongoing process in the life-cycle of operons.
Collapse
|
10
|
Chuang LY, Chang HW, Tsai JH, Yang CH. Features for computational operon prediction in prokaryotes. Brief Funct Genomics 2012; 11:291-9. [PMID: 22753776 DOI: 10.1093/bfgp/els024] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Accurate prediction of operons can improve the functional annotation and application of genes within operons in prokaryotes. Here, we review several features: (i) intergenic distance, (ii) metabolic pathways, (iii) homologous genes, (iv) promoters and terminators, (v) gene order conservation, (vi) microarray, (vii) clusters of orthologous groups, (viii) gene length ratio, (ix) phylogenetic profiles, (x) operon length/size and (xi) STRING database scores, as well as some other features, which have been applied in recent operon prediction methods in prokaryotes in the literature. Based on a comparison of the prediction performances of these features, we conclude that other, as yet undiscovered features, or feature selection with a receiver operating characteristic analysis before algorithm processing can improve operon prediction in prokaryotes.
Collapse
Affiliation(s)
- Li-Yeh Chuang
- Department of Chemical Engineering & Institute of Biotechnology and Chemical Engineering, I-Shou University, Taiwan
| | | | | | | |
Collapse
|
11
|
Roberts EW, Cai F, Kerfeld CA, Cannon GC, Heinhorst S. Isolation and characterization of the Prochlorococcus carboxysome reveal the presence of the novel shell protein CsoS1D. J Bacteriol 2012; 194:787-95. [PMID: 22155772 PMCID: PMC3272956 DOI: 10.1128/jb.06444-11] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2011] [Accepted: 11/29/2011] [Indexed: 11/20/2022] Open
Abstract
Cyanobacteria, including members of the genus Prochlorococcus, contain icosahedral protein microcompartments known as carboxysomes that encapsulate multiple copies of the CO(2)-fixing enzyme ribulose 1,5-bisphosphate carboxylase/oxygenase (RubisCO) in a thin protein shell that enhances the catalytic performance of the enzyme in part through the action of a shell-associated carbonic anhydrase. However, the exact mechanism by which compartmentation provides a catalytic advantage to the enzyme is not known. Complicating the study of cyanobacterial carboxysomes has been the inability to obtain homogeneous carboxysome preparations. This study describes the first successful purification and characterization of carboxysomes from the marine cyanobacterium Prochlorococcus marinus MED4. Because the isolated P. marinus MED4 carboxysomes were free from contaminating membrane proteins, their protein complement could be assessed. In addition to the expected shell proteins, the CsoS1D protein that is not encoded by the canonical cso gene clusters of α-cyanobacteria was found to be a low-abundance shell component. This finding and supporting comparative genomic evidence have important implications for carboxysome composition, structure, and function. Our study indicates that carboxysome composition is probably more complex than was previously assumed based on the gene complements of the classical cso gene clusters.
Collapse
Affiliation(s)
- Evan W. Roberts
- Department of Chemistry and Biochemistry, The University of Southern Mississippi, Hattiesburg, Mississippi, USA
| | - Fei Cai
- DOE Joint Genome Institute, Walnut Creek, California, USA
| | - Cheryl A. Kerfeld
- DOE Joint Genome Institute, Walnut Creek, California, USA
- Department of Plant and Microbial Biology, University of California, Berkeley, California, USA
| | - Gordon C. Cannon
- Department of Chemistry and Biochemistry, The University of Southern Mississippi, Hattiesburg, Mississippi, USA
| | - Sabine Heinhorst
- Department of Chemistry and Biochemistry, The University of Southern Mississippi, Hattiesburg, Mississippi, USA
| |
Collapse
|
12
|
Abstract
An operon is a fundamental unit of transcription and contains specific functional genes for the construction and regulation of networks at the entire genome level. The correct prediction of operons is vital for understanding gene regulations and functions in newly sequenced genomes. As experimental methods for operon detection tend to be nontrivial and time consuming, various methods for operon prediction have been proposed in the literature. In this study, a binary particle swarm optimization is used for operon prediction in bacterial genomes. The intergenic distance, participation in the same metabolic pathway, the cluster of orthologous groups, the gene length ratio and the operon length are used to design a fitness function. We trained the proper values on the Escherichia coli genome, and used the above five properties to implement feature selection. Finally, our study used the intergenic distance, metabolic pathway and the gene length ratio property to predict operons. Experimental results show that the prediction accuracy of this method reached 92.1%, 93.3% and 95.9% on the Bacillus subtilis genome, the Pseudomonas aeruginosa PA01 genome and the Staphylococcus aureus genome, respectively. This method has enabled us to predict operons with high accuracy for these three genomes, for which only limited data on the properties of the operon structure exists.
Collapse
Affiliation(s)
- Li-Yeh Chuang
- Department of Chemical Engineering, I-Shou University, Kaohsiung, Taiwan
| | | | | |
Collapse
|
13
|
Taboada B, Verde C, Merino E. High accuracy operon prediction method based on STRING database scores. Nucleic Acids Res 2010; 38:e130. [PMID: 20385580 PMCID: PMC2896540 DOI: 10.1093/nar/gkq254] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
We present a simple and highly accurate computational method for operon prediction, based on intergenic distances and functional relationships between the protein products of contiguous genes, as defined by STRING database (Jensen,L.J., Kuhn,M., Stark,M., Chaffron,S., Creevey,C., Muller,J., Doerks,T., Julien,P., Roth,A., Simonovic,M. et al. (2009) STRING 8–a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res., 37, D412–D416). These two parameters were used to train a neural network on a subset of experimentally characterized Escherichia coli and Bacillus subtilis operons. Our predictive model was successfully tested on the set of experimentally defined operons in E. coli and B. subtilis, with accuracies of 94.6 and 93.3%, respectively. As far as we know, these are the highest accuracies ever obtained for predicting bacterial operons. Furthermore, in order to evaluate the predictable accuracy of our model when using an organism's data set for the training procedure, and a different organism's data set for testing, we repeated the E. coli operon prediction analysis using a neural network trained with B. subtilis data, and a B. subtilis analysis using a neural network trained with E. coli data. Even for these cases, the accuracies reached with our method were outstandingly high, 91.5 and 93%, respectively. These results show the potential use of our method for accurately predicting the operons of any other organism. Our operon predictions for fully-sequenced genomes are available at http://operons.ibt.unam.mx/OperonPredictor/.
Collapse
Affiliation(s)
- Blanca Taboada
- Centro de Ciencias Aplicadas y Desarrollo Tecnológico, Universidad Nacional Autónoma de México, México, D.F., México
| | | | | |
Collapse
|
14
|
Improved Prediction of Protein Binding Sites from Sequences Using Genetic Algorithm. Protein J 2009; 28:273-80. [DOI: 10.1007/s10930-009-9192-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
15
|
Li G, Che D, Xu Y. A universal operon predictor for prokaryotic genomes. J Bioinform Comput Biol 2009; 7:19-38. [PMID: 19226658 DOI: 10.1142/s0219720009003984] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2007] [Revised: 02/21/2008] [Accepted: 04/22/2008] [Indexed: 11/18/2022]
Abstract
Identification of operons at the genome scale of prokaryotic organisms represents a key step in deciphering of their transcriptional regulation machinery, biological pathways, and networks. While numerous computational methods have been shown to be effective in predicting operons for well-studied organisms such as Escherichia coli K12 and Bacillus subtilis 168, these methods generally do not generalize well to genomes other than the ones used to train the methods, or closely related genomes because they rely on organism-specific information. Several methods have been explored to address this problem through utilizing only genomic structural information conserved across multiple organisms, but they all suffer from the issue of low prediction sensitivity. In this paper, we report a novel operon prediction method that is applicable to any prokaryotic genome with high prediction accuracy. The key idea of the method is to predict operons through identification of conserved gene clusters across multiple genomes and through deriving a key parameter relevant to the distribution of intergenic distances in genomes. We have implemented this method using a graph-theoretic approach, to calculate a set of maximum gene clusters in the target genome that are conserved across multiple reference genomes. Our computational results have shown that this method has higher prediction sensitivity as well as specificity than most of the published methods. We have carried out a preliminary study on operons unique to archaea and bacteria, respectively, and derived a number of interesting new insights about operons between these two kingdoms. The software and predicted operons of 365 prokaryotic genomes are available at http://csbl.bmb.uga.edu/~dongsheng/UNIPOP.
Collapse
Affiliation(s)
- Guojun Li
- CSBL, Department of Biochemistry and Molecular Biology, Department of Computer Science, University of Georgia, Athens, GA 30602, USA.
| | | | | |
Collapse
|
16
|
Jacob E, Nair KR, Sasikumar R. A fuzzy-driven genetic algorithm for sequence segmentation applied to genomic sequences. Appl Soft Comput 2009. [DOI: 10.1016/j.asoc.2008.07.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
17
|
Gómez A, Cedano J, Espadaler J, Hermoso A, Piñol J, Querol E. Prediction of protein function improving sequence remote alignment search by a fuzzy logic algorithm. Protein J 2008; 27:130-9. [PMID: 18066655 DOI: 10.1007/s10930-007-9116-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
The functional annotation of the new protein sequences represents a major drawback for genomic science. The best way to suggest the function of a protein from its sequence is by finding a related one for which biological information is available. Current alignment algorithms display a list of protein sequence stretches presenting significant similarity to different protein targets, ordered by their respective mathematical scores. However, statistical and biological significance do not always coincide, therefore, the rearrangement of the program output according to more biological characteristics than the mathematical scoring would help functional annotation. A new method that predicts the putative function for the protein integrating the results from the PSI-BLAST program and a fuzzy logic algorithm is described. Several protein sequence characteristics have been checked in their ability to rearrange a PSI-BLAST profile according more to their biological functions. Four of them: amino acid content, matched segment length and hydropathic and flexibility profiles positively contributed, upon being integrated by a fuzzy logic algorithm into a program, BYPASS, to the accurate prediction of the function of a protein from its sequence.
Collapse
Affiliation(s)
- Antonio Gómez
- Institut de Biotecnologia i Biomedicina, Departament de Bioquímica i Biologia Molecular de la, Universitat Autònoma de Barcelona, Bellaterra, Barcelona, 08193, Spain
| | | | | | | | | | | |
Collapse
|
18
|
Brouwer RWW, Kuipers OP, van Hijum SAFT. The relative value of operon predictions. Brief Bioinform 2008; 9:367-75. [PMID: 18420711 DOI: 10.1093/bib/bbn019] [Citation(s) in RCA: 78] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
For most organisms, computational operon predictions are the only source of genome-wide operon information. Operon prediction methods described in literature are based on (a combination of) the following five criteria: (i) intergenic distance, (ii) conserved gene clusters, (iii) functional relation, (iv) sequence elements and (v) experimental evidence. The performance estimates of operon predictions reported in literature cannot directly be compared due to differences in methods and data used in these studies. Here, we survey the current status of operon prediction methods. Based on a comparison of the performance of operon predictions on Escherichia coli and Bacillus subtilis we conclude that there is still room for improvement. We expect that existing and newly generated genomics and transcriptomics data will further improve accuracy of operon prediction methods.
Collapse
|
19
|
Wang S, Wang Y, Du W, Sun F, Wang X, Zhou C, Liang Y. A multi-approaches-guided genetic algorithm with application to operon prediction. Artif Intell Med 2007; 41:151-9. [PMID: 17869072 DOI: 10.1016/j.artmed.2007.07.010] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2006] [Revised: 07/30/2007] [Accepted: 07/30/2007] [Indexed: 11/24/2022]
Abstract
OBJECTIVE The prediction of operons is critical to the reconstruction of regulatory networks at the whole genome level. Multiple genome features have been used for predicting operons. However, multiple genome features are usually dealt with using only single method in the literatures. The aim of this paper is to develop a combined method for operon prediction by using different methods to preprocess different genome features in order for exerting their unique characteristics. METHODS A novel multi-approach-guided genetic algorithm for operon prediction is presented. We exploit different methods for intergenic distance, cluster of orthologous groups (COG) gene functions, metabolic pathway and microarray expression data. A novel local-entropy-minimization method is proposed to partition intergenic distance. Our program can be used for other newly sequenced genomes by transferring the knowledge that has been obtained from Escherichia coli data. We calculate the log-likelihood for COG gene functions and Pearson correlation coefficient for microarray expression data. The genetic algorithm is used for integrating the four types of data. RESULTS The proposed method is examined on E. coli K12 genome, Bacillus subtilis genome, and Pseudomonas aeruginosa PAO1 genome. The accuracies of prediction for these three genomes are 85.9987%, 88.296%, and 81.2384%, respectively. CONCLUSION Simulated experimental results demonstrate that in the genetic algorithm the preprocessing for genome data using multiple approaches ensures the effective utilization of different biological characteristics. Experimental results also show that the proposed method is applicable for predicting operons in prokaryote.
Collapse
Affiliation(s)
- Shuqin Wang
- College of Computer Science and Technology, Jilin University, Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, Changchun 130012, China
| | | | | | | | | | | | | |
Collapse
|
20
|
Cai F, Heinhorst S, Shively JM, Cannon GC. Transcript analysis of the Halothiobacillus neapolitanus cso operon. Arch Microbiol 2007; 189:141-50. [PMID: 17899012 DOI: 10.1007/s00203-007-0305-y] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2007] [Revised: 08/13/2007] [Accepted: 08/31/2007] [Indexed: 10/22/2022]
Abstract
Carboxysomes are polyhedral microcompartments that sequester the CO(2)-fixing enzyme ribulose 1,5-bisphosphate carboxylase/oxygenase in many autotrophic bacteria. Their protein constituents are encoded by a set of tightly clustered genes that are thought to form an operon (the cso operon). This study is the first to systematically address transcriptional regulation of carboxysome protein expression. Quantification of transcript levels derived from the cso operon of Halothiobacillus neapolitanus, the sulfur oxidizer that has emerged as the model organism for carboxysome structural and functional studies, indicated that all cso genes are transcribed, albeit at different levels. Combined with comparative genomic evidence, this study supports the premise that the cso gene cluster constitutes an operon. Characterization of transcript 5'- and 3'-ends and examination of likely regulatory sequences and secondary structure elements within the operon suggested potential strategies by which the vastly different levels of individual carboxysome proteins in the microcompartment could have arisen.
Collapse
Affiliation(s)
- Fei Cai
- Department of Chemistry and Biochemistry, The University of Southern Mississippi, Hattiesburg, MS 39406-0001, USA
| | | | | | | |
Collapse
|
21
|
Characterization of relationships between transcriptional units and operon structures in Bacillus subtilis and Escherichia coli. BMC Genomics 2007; 8:48. [PMID: 17298663 PMCID: PMC1808063 DOI: 10.1186/1471-2164-8-48] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2006] [Accepted: 02/13/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Operon structures play an important role in transcriptional regulation in prokaryotes. However, there have been fewer studies on complicated operon structures in which the transcriptional units vary with changing environmental conditions. Information about such complicated operons is helpful for predicting and analyzing operon structures, as well as understanding gene functions and transcriptional regulation. RESULTS We systematically analyzed the experimentally verified transcriptional units (TUs) in Bacillus subtilis and Escherichia coli obtained from ODB and RegulonDB. To understand the relationships between TUs and operons, we defined a new classification system for adjacent gene pairs, divided into three groups according to the level of gene co-regulation: operon pairs (OP) belong to the same TU, sub-operon pairs (SOP) that are at the transcriptional boundaries within an operon, and non-operon pairs (NOP) belonging to different operons. Consequently, we found that the levels of gene co-regulation was correlated to intergenic distances and gene expression levels. Additional analysis revealed that they were also correlated to the levels of conservation across about 200 prokaryotic genomes. Most interestingly, we found that functional associations in SOPs were more observed in the environmental and genetic information processes. CONCLUSION Complicated operon structures were correlated with genome organization and gene expression profiles. Such intricately regulated operons allow functional differences depending on environmental conditions. These regulatory mechanisms are helpful in accommodating the variety of changes that happen around the cell. In addition, such differences may play an important role in the evolution of gene order across genomes.
Collapse
|
22
|
Dam P, Olman V, Harris K, Su Z, Xu Y. Operon prediction using both genome-specific and general genomic information. Nucleic Acids Res 2006; 35:288-98. [PMID: 17170009 PMCID: PMC1802555 DOI: 10.1093/nar/gkl1018] [Citation(s) in RCA: 141] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
We have carried out a systematic analysis of the contribution of a set of selected features that include three new features to the accuracy of operon prediction. Our analyses have led to a number of new insights about operon prediction, including that (i) different features have different levels of discerning power when used on adjacent gene pairs with different ranges of intergenic distance, (ii) certain features are universally useful for operon prediction while others are more genome-specific and (iii) the prediction reliability of operons is dependent on intergenic distances. Based on these new insights, our newly developed operon-prediction program achieves more accurate operon prediction than the previous ones, and it uses features that are most readily available from genomic sequences. Our prediction results indicate that our (non-linear) decision tree-based classifier can predict operons in a prokaryotic genome very accurately when a substantial number of operons in the genome are already known. For example, the prediction accuracy of our program can reach 90.2 and 93.7% on Bacillus subtilis and Escherichia coli genomes, respectively. When no such information is available, our (linear) logistic function-based classifier can reach the prediction accuracy at 84.6 and 83.3% for E.coli and B.subtilis, respectively.
Collapse
Affiliation(s)
- Phuongan Dam
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, University of GeorgiaAthens, GA, USA
| | - Victor Olman
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, University of GeorgiaAthens, GA, USA
- Institute of Bioinformatics, University of GeorgiaAthens, GA, USA
| | - Kyle Harris
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, University of GeorgiaAthens, GA, USA
| | - Zhengchang Su
- Center for Bioinformatics Research, Department of Computer Science, University of North Carolina at CharlotteCharlotte, NC, USA
| | - Ying Xu
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, University of GeorgiaAthens, GA, USA
- Institute of Bioinformatics, University of GeorgiaAthens, GA, USA
- To whom correspondence should be addressed. Tel: +1 706 542 9779; Fax: +1 706 542 9751;
| |
Collapse
|
23
|
Abstract
Identification of operons in the hyperthermophilic archaeon Pyrococcus furiosus represents an important step to understanding the regulatory mechanisms that enable the organism to adapt and thrive in extreme environments. We have predicted operons in P.furiosus by combining the results from three existing algorithms using a neural network (NN). These algorithms use intergenic distances, phylogenetic profiles, functional categories and gene-order conservation in their operon prediction. Our method takes as inputs the confidence scores of the three programs, and outputs a prediction of whether adjacent genes on the same strand belong to the same operon. In addition, we have applied Gene Ontology (GO) and KEGG pathway information to improve the accuracy of our algorithm. The parameters of this NN predictor are trained on a subset of all experimentally verified operon gene pairs of Bacillus subtilis. It subsequently achieved 86.5% prediction accuracy when applied to a subset of gene pairs for Escherichia coli, which is substantially better than any of the three prediction programs. Using this new algorithm, we predicted 470 operons in the P.furiosus genome. Of these, 349 were validated using DNA microarray data.
Collapse
Affiliation(s)
| | - Phuongan Dam
- Department of Biochemistry and Molecular Biology, Institute of BioinformaticsUniversity of Georgia, Athens, GA 30602, USA
| | - Zhengchang Su
- Department of Biochemistry and Molecular Biology, Institute of BioinformaticsUniversity of Georgia, Athens, GA 30602, USA
| | - Farris L. Poole
- Department of Biochemistry and Molecular Biology, Institute of BioinformaticsUniversity of Georgia, Athens, GA 30602, USA
| | - Michael W. W. Adams
- Department of Biochemistry and Molecular Biology, Institute of BioinformaticsUniversity of Georgia, Athens, GA 30602, USA
| | | | - Ying Xu
- Department of Biochemistry and Molecular Biology, Institute of BioinformaticsUniversity of Georgia, Athens, GA 30602, USA
- To whom correspondence should be addressed. Tel: +1 706 542 9779; Fax: +1 706 542 9751;
| |
Collapse
|
24
|
Janga SC, Lamboy WF, Huerta AM, Moreno-Hagelsieb G. The distinctive signatures of promoter regions and operon junctions across prokaryotes. Nucleic Acids Res 2006; 34:3980-7. [PMID: 16914446 PMCID: PMC1557821 DOI: 10.1093/nar/gkl563] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Here we show that regions upstream of first transcribed genes have oligonucleotide signatures that distinguish them from regions upstream of genes in the middle of operons. Databases of experimentally confirmed transcription units do not exist for most genomes. Thus, to expand the analyses into genomes with no experimentally confirmed data, we used genes conserved adjacent in evolutionarily distant genomes as representatives of genes inside operons. Likewise, we used divergently transcribed genes as representative examples of first transcribed genes. In model organisms, the trinucleotide signatures of regions upstream of these representative genes allow for operon predictions with accuracies close to those obtained with known operon data (0.8). Signature-based operon predictions have more similar phylogenetic profiles and higher proportions of genes in the same pathways than predicted transcription unit boundaries (TUBs). These results confirm that we are separating genes with related functions, as expected for operons, from genes not necessarily related, as expected for genes in different transcription units. We also test the quality of the predictions using microarray data in six genomes and show that the signature-predicted operons tend to have high correlations of expression. Oligonucleotide signatures should expand the number of tools available to identify operons even in poorly characterized genomes.
Collapse
Affiliation(s)
- Sarath Chandra Janga
- Department of Biology, Wilfrid Laurier University, 75 University Avenue West, Waterloo, ON, Canada, N2L 3C5.
| | | | | | | |
Collapse
|
25
|
Zhang GQ, Cao ZW, Luo QM, Cai YD, Li YX. Operon prediction based on SVM. Comput Biol Chem 2006; 30:233-40. [PMID: 16716751 DOI: 10.1016/j.compbiolchem.2006.03.002] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2005] [Revised: 03/17/2006] [Accepted: 03/24/2006] [Indexed: 11/27/2022]
Abstract
The operon is a specific functional organization of genes found in bacterial genomes. Most genes within operons share common features. The support vector machine (SVM) approach is here used to predict operons at the genomic level. Four features were chosen as SVM input vectors: the intergenic distances, the number of common pathways, the number of conserved gene pairs and the mutual information of phylogenetic profiles. The analysis reveals that these common properties are indeed characteristic of the genes within operons and are different from that of non-operonic genes. Jackknife testing indicates that these input feature vectors, employed with RBF kernel SVM, achieve high accuracy. To validate the method, Escherichia coli K12 and Bacillus subtilis were taken as benchmark genomes of known operon structure, and the prediction results in both show that the SVM can detect operon genes in target genomes efficiently and offers a satisfactory balance between sensitivity and specificity.
Collapse
Affiliation(s)
- Guo-qing Zhang
- Hubei Bioinformatics and Molecular Imaging Key Laboratory, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | | | | | | | | |
Collapse
|
26
|
Xiao G, Martinez-Vaz B, Pan W, Khodursky AB. Operon information improves gene expression estimation for cDNA microarrays. BMC Genomics 2006; 7:87. [PMID: 16630355 PMCID: PMC1513396 DOI: 10.1186/1471-2164-7-87] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2005] [Accepted: 04/21/2006] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND In prokaryotic genomes, genes are organized in operons, and the genes within an operon tend to have similar levels of expression. Because of co-transcription of genes within an operon, borrowing information from other genes within the same operon can improve the estimation of relative transcript levels; the estimation of relative levels of transcript abundances is one of the most challenging tasks in experimental genomics due to the high noise level in microarray data. Therefore, techniques that can improve such estimations, and moreover are based on sound biological premises, are expected to benefit the field of microarray data analysis RESULTS In this paper, we propose a hierarchical Bayesian model, which relies on borrowing information from other genes within the same operon, to improve the estimation of gene expression levels and, hence, the detection of differentially expressed genes. The simulation studies and the analysis of experiential data demonstrated that the proposed method outperformed other techniques that are routinely used to estimate transcript levels and detect differentially expressed genes, including the sample mean and SAM t statistics. The improvement became more significant as the noise level in microarray data increases. CONCLUSION By borrowing information about transcriptional activity of genes within classified operons, we improved the estimation of gene expression levels and the detection of differentially expressed genes.
Collapse
Affiliation(s)
- Guanghua Xiao
- Division of Biostatistics, School of Public Health, University of Minnesota, A460 Mayo Building, Minneapolis, MN 55455-0378, USA
| | - Betsy Martinez-Vaz
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Saint Paul, MN, 55108, USA
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, A460 Mayo Building, Minneapolis, MN 55455-0378, USA
| | - Arkady B Khodursky
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Saint Paul, MN, 55108, USA
| |
Collapse
|
27
|
Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armañanzas R, Santafé G, Pérez A, Robles V. Machine learning in bioinformatics. Brief Bioinform 2006; 7:86-112. [PMID: 16761367 DOI: 10.1093/bib/bbk007] [Citation(s) in RCA: 368] [Impact Index Per Article: 19.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
This article reviews machine learning methods for bioinformatics. It presents modelling methods, such as supervised classification, clustering and probabilistic graphical models for knowledge discovery, as well as deterministic and stochastic heuristics for optimization. Applications in genomics, proteomics, systems biology, evolution and text mining are also shown.
Collapse
Affiliation(s)
- Pedro Larrañaga
- Intelligent Systems Group, Department of Computer Science and Artificial Intelligence, University of the Basque Country, Paseo Manuel de Lardizabal, 1, 20018 San Sebastian, Spain.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
28
|
Gertz J, Riles L, Turnbaugh P, Ho SW, Cohen BA. Discovery, validation, and genetic dissection of transcription factor binding sites by comparative and functional genomics. Genome Res 2005; 15:1145-52. [PMID: 16077013 PMCID: PMC1182227 DOI: 10.1101/gr.3859605] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2005] [Accepted: 05/03/2005] [Indexed: 11/24/2022]
Abstract
Completing the annotation of a genome sequence requires identifying the regulatory sequences that control gene expression. To identify these sequences, we developed an algorithm that searches for short, conserved sequence motifs in the genomes of related species. The method is effective in finding motifs de novo and for refining known regulatory motifs in Saccharomyces cerevisiae. We tested one novel motif prediction of the algorithm and found it to be the binding site of Stp2; it is significantly different from the previously predicted Stp2 binding site. We show that Stp2 physically interacts with this sequence motif, and that stp2 mutations affect the expression of genes associated with the motif. We demonstrate that the Stp2 binding site also interacts genetically with Stp1, a regulator of amino acid permease genes and, with Sfp1, a key regulator of cell growth. These results illuminate an important transcriptional circuit that regulates cell growth through external nutrient uptake.
Collapse
Affiliation(s)
- Jason Gertz
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63108, USA
| | | | | | | | | |
Collapse
|
29
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2005. [PMCID: PMC2447491 DOI: 10.1002/cfg.425] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|