1
|
|
2
|
-Biao Guo F, Lin Y, -Ling Chen L. Recognition of Protein-coding Genes Based on Z-curve Algorithms. Curr Genomics 2014; 15:95-103. [PMID: 24822027 PMCID: PMC4009845 DOI: 10.2174/1389202915999140328162724] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2013] [Revised: 11/19/2013] [Accepted: 11/20/2013] [Indexed: 01/18/2023] Open
Abstract
Recognition of protein-coding genes, a classical bioinformatics issue, is an absolutely needed step for annotating newly sequenced genomes. The Z-curve algorithm, as one of the most effective methods on this issue, has been successfully applied in annotating or re-annotating many genomes, including those of bacteria, archaea and viruses. Two Z-curve based ab initio gene-finding programs have been developed: ZCURVE (for bacteria and archaea) and ZCURVE_V (for viruses and phages). ZCURVE_C (for 57 bacteria) and Zfisher (for any bacterium) are web servers for re-annotation of bacterial and archaeal genomes. The above four tools can be used for genome annotation or re-annotation, either independently or combined with the other gene-finding programs. In addition to recognizing protein-coding genes and exons, Z-curve algorithms are also effective in recognizing promoters and translation start sites. Here, we summarize the applications of Z-curve algorithms in gene finding and genome annotation.
Collapse
Affiliation(s)
- Feng -Biao Guo
- Center of Bioinformatics and Key Laboratory for NeuroInformation of the Ministry of Education, University of Elec-tronic Science and Technology of China, Chengdu, 610054, China
| | - Yan Lin
- Department of Physics, Tianjin University, Tianjin 300072, China
| | - Ling -Ling Chen
- cCollege of Life Science and Technology, Huazhong Agricultural University, Wuhan, 430070, China
| |
Collapse
|
3
|
Guo FB, Xiong L, Teng JLL, Yuen KY, Lau SKP, Woo PCY. Re-annotation of protein-coding genes in 10 complete genomes of Neisseriaceae family by combining similarity-based and composition-based methods. DNA Res 2013; 20:273-86. [PMID: 23571676 PMCID: PMC3686433 DOI: 10.1093/dnares/dst009] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
In this paper, we performed a comprehensive re-annotation of protein-coding genes by a systematic method combining composition- and similarity-based approaches in 10 complete bacterial genomes of the family Neisseriaceae. First, 418 hypothetical genes were predicted as non-coding using the composition-based method and 413 were eliminated from the gene list. Both the scatter plot and cluster of orthologous groups (COG) fraction analyses supported the result. Second, from 20 to 400 hypothetical proteins were assigned with functions in each of the 10 strains based on the homology search. Among newly assigned functions, 397 are so detailed to have definite gene names. Third, 106 genes missed by the original annotations were picked up by an ab initio gene finder combined with similarity alignment. Transcriptional experiments validated the effectiveness of this method in Laribacter hongkongensis and Chromobacterium violaceum. Among the 106 newly found genes, some deserve particular interests. For example, 27 transposases were newly found in Neiserria meningitidis alpha14. In Neiserria gonorrhoeae NCCP11945, four new genes with putative functions and definite names (nusG, rpsN, rpmD and infA) were found and homologues of them usually are essential for survival in bacteria. The updated annotations for the 10 Neisseriaceae genomes provide a more accurate prediction of protein-coding genes and a more detailed functional information of hypothetical proteins. It will benefit research into the lifestyle, metabolism, environmental adaption and pathogenicity of the Neisseriaceae species. The re-annotation procedure could be used directly, or after the adaption of detailed methods, for checking annotations of any other bacterial or archaeal genomes.
Collapse
Affiliation(s)
- Feng-Biao Guo
- Department of Microbiology, The University of Hong Kong, Special Administrative Region, Hong Kong, People's Republic of China
| | | | | | | | | | | |
Collapse
|
4
|
Du MZ, Guo FB, Chen YY. Gene re-annotation in genome of the extremophile Pyrobaculum aerophilum by using bioinformatics methods. J Biomol Struct Dyn 2012; 29:391-401. [PMID: 21875157 DOI: 10.1080/07391102.2011.10507393] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
In this paper, we re-annotated the genome of Pyrobaculum aerophilum str. IM2, particularly for hypothetical ORFs. The annotation process includes three parts. Firstly and most importantly, 23 new genes, which were missed in the original annotation, are found by combining similarity search and the ab initio gene finding approaches. Among these new genes, five have significant similarities with function-known genes and the rest have significant similarities with hypothetical ORFs contained in other genomes. Secondly, the coding potentials of the 1645 hypothetical ORFs are re-predicted by using 33 Z curve variables combined with Fisher linear discrimination method. With the accuracy being 99.68%, 25 originally annotated hypothetical ORFs are recognized as non-coding by our method. Thirdly, 80 hypothetical ORFs are assigned with potential functions by using similarity search with BLAST program. Re-annotation of the genome will benefit related researches on this hyperthermophilic crenarchaeon. Also, the re-annotation procedure could be taken as a reference for other archaeal genomes. Details of the revised annotation are freely available at http://cobi.uestc.edu.cn/resource/paero/
Collapse
Affiliation(s)
- Meng-Ze Du
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | | | | |
Collapse
|
5
|
Okamoto A, Yamada K. Proteome driven re-evaluation and functional annotation of the Streptococcus pyogenes SF370 genome. BMC Microbiol 2011; 11:249. [PMID: 22070424 PMCID: PMC3224786 DOI: 10.1186/1471-2180-11-249] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2011] [Accepted: 11/10/2011] [Indexed: 12/02/2022] Open
Abstract
Background The genome data of Streptococcus pyogenes SF370 has been widely used by many researchers and provides a vast array of interesting findings. Nevertheless, approximately 40% of genes remain classified as hypothetical proteins, and several coding sequences (CDSs) have been unrecognized. In this study, we attempted a shotgun proteomic analysis with a six-frame database that was independent of genome annotation. Results Nine proteins encoded by novel ORFs were found by shotgun proteomic analysis, and their specific mRNAs were verified by reverse transcriptional PCR (RT-PCR). We also provided functional annotations for hypothetical genes using proteomic analysis from three different culture conditions that were separated into three fractions: supernatant, soluble, and insoluble. Consequently, we identified 567 proteins on re-evaluation of the proteomic data using an in-house database comprising 1,697 annotated and nine non-annotated CDSs. We provided functional annotations for 126 hypothetical proteins (18.9% out of the 668 hypothetical proteins) based on their cellular fractions and expression profiles under different culture conditions. Conclusions The list of amino acid sequences that were annotated by genome analysis contains outdated information and unrecognized protein-coding sequences. We suggest that the six-frame database derived from actual DNA sequences be used for reliable proteomic analysis. In addition, the experimental evidence from functional proteomic analysis is useful for the re-evaluation of previously sequenced genomes.
Collapse
Affiliation(s)
- Akira Okamoto
- Department of Molecular Bacteriology, Nagoya University Graduate School of Medicine, 65 Tsurumai-cho, Showa-ku, Nagoya, Aichi 466-8550, Japan.
| | | |
Collapse
|
6
|
Chakraborty J, Dutta TK. From lipid transport to oxygenation of aromatic compounds: evolution within the Bet v1-like superfamily. J Biomol Struct Dyn 2011; 29:67-78. [PMID: 21696226 DOI: 10.1080/07391102.2011.10507375] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
In absence of significant sequence similarity, remote homology between proteins can be confused with analogy and in such a case, shared ancestry can be inferred in light of certain unique and common features. In the present study, to understand the evolutionary origin of catalytic domain of large subunit of ring-hydroxylating oxygenases (RHOs), belonging to the Bet v1-like superfamily, structure-based phylogenies have been derived from structural alignment of representative proteins of the superfamily. A careful inspection of the structural relatedness of RHOs with the rest of the families showed closest similarity between RHO catalytic domain and PA1206-like protein. In addition, phylogenetic relationship of the Rieske domain of the large subunit of RHOs with functionally and structurally similar proteins has also been elucidated so as to postulate the most possible events leading to the genesis of the large subunit of RHOs.
Collapse
Affiliation(s)
- Joydeep Chakraborty
- Department of Microbiology, Bose Institute, P-1/12 C.I.T. Scheme VII M, Kolkata 700054, India
| | | |
Collapse
|
7
|
Yu JF, Xiao K, Jiang DK, Guo J, Wang JH, Sun X. An integrative method for identifying the over-annotated protein-coding genes in microbial genomes. DNA Res 2011; 18:435-49. [PMID: 21903723 PMCID: PMC3223076 DOI: 10.1093/dnares/dsr030] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
The falsely annotated protein-coding genes have been deemed one of the major causes accounting for the annotating errors in public databases. Although many filtering approaches have been designed for the over-annotated protein-coding genes, some are questionable due to the resultant increase in false negative. Furthermore, there is no webserver or software specifically devised for the problem of over-annotation. In this study, we propose an integrative algorithm for detecting the over-annotated protein-coding genes in microorganisms. Overall, an average accuracy of 99.94% is achieved over 61 microbial genomes. The extremely high accuracy indicates that the presented algorithm is efficient to differentiate the protein-coding genes from the non-coding open reading frames. Abundant analyses show that the predicting results are reliable and the integrative algorithm is robust and convenient. Our analysis also indicates that the over-annotated protein-coding genes can cause the false positive of horizontal gene transfers detection. The webserver of the proposed algorithm can be freely accessible from www.cbi.seu.edu.cn/RPGM.
Collapse
Affiliation(s)
- Jia-Feng Yu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China.
| | | | | | | | | | | |
Collapse
|
8
|
Al-Khatib RM, Rashid NAA, Abdullah R. Thermodynamic Heuristics with Case-Based Reasoning: Combined Insights for RNA Pseudoknot Secondary Structure. J Biomol Struct Dyn 2011; 29:1-26. [DOI: 10.1080/07391102.2011.10507373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
9
|
Chang G, Wang T. Weighted relative entropy for alignment-free sequence comparison based on Markov model. J Biomol Struct Dyn 2011; 28:545-55. [PMID: 21142223 DOI: 10.1080/07391102.2011.10508594] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
In this paper, we introduce a probabilistic measure for computing the similarity between two biological sequences without alignment. The computation of the similarity measure is based on the Kullback-Leibler divergence of two constructed Markov models. We firstly validate the method on clustering nine chromosomes from three species. Secondly, we give the result of similarity search based on our new method. We lastly apply the measure to the construction of phylogenetic tree of 48 HEV genome sequences. Our results indicate that the weighted relative entropy is an efficient and powerful alignment-free measure for the analysis of sequences in the genomic scale.
Collapse
Affiliation(s)
- Guisong Chang
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China.
| | | |
Collapse
|
10
|
Cao J, Shi F, Liu X, Jia J, Zeng J, Huang G. Genome-wide identification and evolutionary analysis of Arabidopsis sm genes family. J Biomol Struct Dyn 2011; 28:535-44. [PMID: 21142222 DOI: 10.1080/07391102.2011.10508593] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Sm proteins are members of a family of small proteins that are widespread in biosphere and found associated with RNA metabolism. To date, to our knowledge, only Arabidopsis SAD1 gene has been studied functionally in plant. In this study, 42 Sm genes are identified through comprehensive analysis in Arabidopsis. And a complete overview of this gene family is presented, including the gene structures, phylogeny, chromosome locations, selection pressure and expression. The results reveal that gene duplication contributes to the expansion of the Sm gene family in Arabidopsis genome, diverse expression patterns suggest their functional differentiation and divergence analysis indicates purifying selection as a key role in evolution. Our comparative genomics analysis of Sm genes will provide the first step towards the future experimental research on determining the functions of these genes.
Collapse
Affiliation(s)
- Jun Cao
- Institute of Life Science, Jiangsu University, Xuefu Road 301, Zhenjiang 212013, Jiangsu, PR China.
| | | | | | | | | | | |
Collapse
|
11
|
Das S, Mitra S, Sahoo S, Chakrabarti J. Novel Hybrid Encodes both Continuous and Split tRNA Genes? J Biomol Struct Dyn 2011; 28:827-31. [DOI: 10.1080/07391102.2011.10508610] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
12
|
Liu X, Dai Q, Li L, He Z. An efficient binomial model-based measure for sequence comparison and its application. J Biomol Struct Dyn 2011; 28:833-43. [PMID: 21294594 DOI: 10.1080/07391102.2011.10508611] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Sequence comparison is one of the major tasks in bioinformatics, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations. There are several similarity/dissimilarity measures for sequence comparison, but challenges remains. This paper presented a binomial model-based measure to analyze biological sequences. With help of a random indicator, the occurrence of a word at any position of sequence can be regarded as a random Bernoulli variable, and the distribution of a sum of the word occurrence is well known to be a binomial one. By using a recursive formula, we computed the binomial probability of the word count and proposed a binomial model-based measure based on the relative entropy. The proposed measure was tested by extensive experiments including classification of HEV genotypes and phylogenetic analysis, and further compared with alignment-based and alignment-free measures. The results demonstrate that the proposed measure based on binomial model is more efficient.
Collapse
Affiliation(s)
- Xiaoqing Liu
- School of Science, Hangzhou Dianzi Unviersity, Hangzhou 310018, People's Republic of China
| | | | | | | |
Collapse
|
13
|
Zhang Y, Chen W. A Measure of DNA Sequence Dissimilarity Based on Free Energy of Nearest-neighbor Interaction. J Biomol Struct Dyn 2011; 28:557-65. [DOI: 10.1080/07391102.2011.10508595] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
|