1
|
Consistent Clustering Pattern of Prokaryotic Genes Based on Base Frequency at the Second Codon Position and its Association with Functional Category Preference. Interdiscip Sci 2022; 14:349-357. [PMID: 34817803 PMCID: PMC9124167 DOI: 10.1007/s12539-021-00493-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Revised: 11/02/2021] [Accepted: 11/07/2021] [Indexed: 10/26/2022]
Abstract
AbstractIn 2002, our research group observed a gene clustering pattern based on the base frequency of A versus T at the second codon position in the genome of Vibrio cholera and found that the functional category distribution of genes in the two clusters was different. With the availability of a large number of sequenced genomes, we performed a systematic investigation of A2–T2 distribution and found that 2694 out of 2764 prokaryotic genomes have an optimal clustering number of two, indicating a consistent pattern. Analysis of the functional categories of the coding genes in each cluster in 1483 prokaryotic genomes indicated, that 99.33% of the genomes exhibited a significant difference (p < 0.01) in function distribution between the two clusters. Specifically, functional category P was overrepresented in the small cluster of 98.65% of genomes, whereas categories J, K, and L were overrepresented in the larger cluster of over 98.52% of genomes. Lineage analysis uncovered that these preferences appear consistently across all phyla. Overall, our work revealed an almost universal clustering pattern based on the relative frequency of A2 versus T2 and its role in functional category preference. These findings will promote the understanding of the rationality of theoretical prediction of functional classes of genes from their nucleotide sequences and how protein function is determined by DNA sequence.
Graphical abstract
Collapse
|
2
|
Analysis of codon usage patterns in Ginkgo biloba reveals codon usage tendency from A/U-ending to G/C-ending. Sci Rep 2016; 6:35927. [PMID: 27808241 PMCID: PMC5093902 DOI: 10.1038/srep35927] [Citation(s) in RCA: 58] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2016] [Accepted: 10/07/2016] [Indexed: 11/08/2022] Open
Abstract
As one of the most ancient tree species, the codon usage pattern analysis of Ginkgo biloba is a useful way to understand its evolutionary and genetic mechanisms. Several studies have been conducted on angiosperms, but seldom on gymnosperms. Based on RNA-Seq data of the G. biloba transcriptome, amount to 17,579 unigenes longer than 300 bp were selected and analyzed from 68,547 candidates. The codon usage pattern tended towards more frequently use of A/U-ending codons, which showed an obvious gradient progressing from gymnosperms to dicots to monocots. Meanwhile, analysis of high/low-expression unigenes revealed that high-expression unigenes tended to use G/C-ending codons together with more codon usage bias. Variation of unigenes with different functions suggested that unigenes involving in environment adaptation use G/C-ending codons more frequently with more usage bias, and these results were consistent with the conclusion that the formation of G. biloba codon usage bias was dominated by natural selection.
Collapse
|
3
|
Pan LL, Wang Y, Hu JH, Ding ZT, Li C. Analysis of codon use features of stearoyl-acyl carrier protein desaturase gene in Camellia sinensis. J Theor Biol 2013; 334:80-6. [PMID: 23774066 DOI: 10.1016/j.jtbi.2013.06.006] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2013] [Revised: 06/03/2013] [Accepted: 06/06/2013] [Indexed: 11/19/2022]
Abstract
The stearoyl-acyl carrier protein desaturase (SAD) gene widely exists in all kinds of plants. In this paper, the Camellia sinensis SAD gene (CsSAD) sequence was firstly analyzed by Codon W, CHIPS, and CUSP programs online, and then compared with genomes of the tea plant, other species and SAD genes from 11 plant species. The results show that the CsSAD gene and the selected 73 of C. sinensis genes have similar codon usage bias. The CsSAD gene has a bias toward the synonymous codons with A and T at the third codon position, the same as the 73 of C. sinensis genes. Compared with monocotyledons such as Triticum aestivum and Zea mays, the differences in codon usage frequency between the CsSAD gene and dicotyledons such as Arabidopsis thaliana and Nicotiana tobacum are less. Therefore, A. thaliana and N. tobacum expression systems may be more suitable for the expression of the CsSAD gene. The analysis result of SAD genes from 12 plant species also shows that most of the SAD genes are biased toward the synonymous codons with G and C at the third codon position. We believe that the codon usage bias analysis presented in this study will be essential for providing a theoretical basis for discussing the structure and function of the CsSAD gene.
Collapse
Affiliation(s)
- Lu-Lu Pan
- Tea Research Institute, Qingdao Agricultural University, Changcheng Road 700#, Chengyang District, Qingdao, Shandong 266109, China.
| | | | | | | | | |
Collapse
|
4
|
Guo XL, Wang Y, Yang LC, Ding ZT. [Analysis of codon use features of CBF gene in Camellia sinensis]. YI CHUAN = HEREDITAS 2012; 34:1614-1623. [PMID: 23262110 DOI: 10.3724/sp.j.1005.2012.01614] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
CBF (C-repeat-binding factor) transcription factor exists widely in all kinds of plants. It is an important regulative factor in the process of plant resistance adversity. In this paper, Camellia sinensis CBF1 gene sequence was analyzed by Codon W, CHIPS, and CUSP programs online, and then compared with C. sinensis genes, genomes in other species, and CBF genes from 39 plant species. It is important to identify the codon usage of CsCBF1 gene and select appropriate expression systems. The results showed that CsCBF1 gene and selected 70 C. sinensis genes had distinct usage differences. CsCBF1 gene was bias toward the synonymous codons with G and C at the third codon position, but 70 C. sinensis genes were bias toward the synonymous codons with A and T. The differences in codon usage frequency between CsCBF1 gene and dicotyledons such as Arabidopsis thaliana and Nicotiana tobacum were less than monocotyledons such as wheat (Triticum aestivum) and corn (Zea mays). Therefore, A. thaliana and N. tobacum expression systems may be more suitable for the expression of CsCBF1 gene. The analysis results of CBF genes from 40 plant species also showed that most of the CBF genes were bias toward the synonymous codons with G and C at the third codon position. The reason of this phenomenon is possible due to special functions of these genes.
Collapse
Affiliation(s)
- Xiu-Li Guo
- Tea Research Institute, Qingdao Agricultural University, Qingdao 266109, China.
| | | | | | | |
Collapse
|
5
|
O'Connell MJ, Doyle AM, Juenger TE, Donoghue MTA, Keshavaiah C, Tuteja R, Spillane C. In Arabidopsis thaliana codon volatility scores reflect GC3 composition rather than selective pressure. BMC Res Notes 2012; 5:359. [PMID: 22805311 PMCID: PMC3502101 DOI: 10.1186/1756-0500-5-359] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2012] [Accepted: 07/17/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Synonymous codon usage bias has typically been correlated with, and attributed to translational efficiency. However, there are other pressures on genomic sequence composition that can affect codon usage patterns such as mutational biases. This study provides an analysis of the codon usage patterns in Arabidopsis thaliana in relation to gene expression levels, codon volatility, mutational biases and selective pressures. RESULTS We have performed synonymous codon usage and codon volatility analyses for all genes in the A. thaliana genome. In contrast to reports for species from other kingdoms, we find that neither codon usage nor volatility are correlated with selection pressure (as measured by dN/dS), nor with gene expression levels on a genome wide level. Our results show that codon volatility and usage are not synonymous, rather that they are correlated with the abundance of G and C at the third codon position (GC3). CONCLUSIONS Our results indicate that while the A. thaliana genome shows evidence for synonymous codon usage bias, this is not related to the expression levels of its constituent genes. Neither codon volatility nor codon usage are correlated with expression levels or selective pressures but, because they are directly related to the composition of G and C at the third codon position, they are the result of mutational bias. Therefore, in A. thaliana codon volatility and usage do not result from selection for translation efficiency or protein functional shift as measured by positive selection.
Collapse
Affiliation(s)
- Mary J O'Connell
- Bioinformatics and Molecular Evolution Group, School of Biotechnology,Dublin City University, Dublin 9, Ireland
| | | | | | | | | | | | | |
Collapse
|
6
|
Ma J, Nguyen MN, Rajapakse JC. Gene classification using codon usage and support vector machines. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2009; 6:134-143. [PMID: 19179707 DOI: 10.1109/tcbb.2007.70240] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
A novel approach for gene classification, which adopts codon usage bias as input feature vector for classification by support vector machines (SVM) is proposed. The DNA sequence is first converted to a 59-dimensional feature vector where each element corresponds to the relative synonymous usage frequency of a codon. As the input to the classifier is independent of sequence length and variance, our approach is useful when the sequences to be classified are of different lengths, a condition that homology-based methods tend to fail. The method is demonstrated by using 1,841 Human Leukocyte Antigen (HLA) sequences which are classified into two major classes: HLA-I and HLA-II; each major class is further subdivided into sub-groups of HLA-I and HLA-II molecules. Using codon usage frequencies, binary SVM achieved accuracy rate of 99.3% for HLA major class classification and multi-class SVM achieved accuracy rates of 99.73% and 98.38% for sub-class classification of HLA-I and HLA-II molecules, respectively. The results show that gene classification based on codon usage bias is consistent with the molecular structures and biological functions of HLA molecules.
Collapse
Affiliation(s)
- Jianmin Ma
- BioInformatics Research Center, NanyangTechnological University, Singapore 637553.
| | | | | |
Collapse
|
7
|
Incorporating PCA and fuzzy-ART techniques into achieve organism classification based on codon usage consideration. Comput Biol Med 2008; 38:886-93. [DOI: 10.1016/j.compbiomed.2008.05.007] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2006] [Revised: 03/26/2008] [Accepted: 05/19/2008] [Indexed: 11/20/2022]
|
8
|
|
9
|
Hanada K, Zhang X, Borevitz JO, Li WH, Shiu SH. A large number of novel coding small open reading frames in the intergenic regions of the Arabidopsis thaliana genome are transcribed and/or under purifying selection. Genome Res 2007; 17:632-40. [PMID: 17395691 PMCID: PMC1855179 DOI: 10.1101/gr.5836207] [Citation(s) in RCA: 126] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Large-scale cDNA sequencing projects and tiling array studies have revealed the presence of many unannotated genes. For protein coding genes, small coding sequences may not be identified by gene finders because of the conservative nature of prediction algorithms. In this study, we identified small open reading frames (sORFs) with high coding potential by a simple gene finding method (Coding Index, CI) based on the nucleotide composition bias found in most coding sequences. Applying this method to 18 Arabidopsis thaliana and 84 yeast sORF genes with evidence of expression at the protein level gives 100% accurate prediction. In the A. thaliana genome, we identified 7159 sORFs that are likely coding sequences (coding sORFs) with the CI measure at the 1% false-positive rate. To determine if these coding sORFs are parts of functional genes, we evaluated each coding sORF for evidence of transcription or evolutionary conservation. At the 5% false-positive rate, we found that 2996 coding sORFs are likely expressed in at least one experimental condition of the A. thaliana tiling array data. In addition, the evolutionary conservation of each A. thaliana sORF was examined within A. thaliana or between A. thaliana and five plants with complete or partial genome sequences. In 3997 coding sORFs with readily identifiable homologous sequences, 2376 are subject to purifying selection at the 1% false-positive rate. After eliminating coding sORFs with similarity to known transposable elements and those that are likely missing exons of known genes, the remaining 3241 coding sORFs with either evidence of transcription or purifying selection likely belong to novel coding genes in the A. thaliana genome.
Collapse
Affiliation(s)
- Kousuke Hanada
- Department of Plant Biology, Michigan State University, East Lansing, Michigan 48824, USA
- Department of Ecology and Evolution, University of Chicago, Chicago, Illinois 60637, USA
| | - Xu Zhang
- Department of Ecology and Evolution, University of Chicago, Chicago, Illinois 60637, USA
| | - Justin O. Borevitz
- Department of Ecology and Evolution, University of Chicago, Chicago, Illinois 60637, USA
| | - Wen-Hsiung Li
- Department of Ecology and Evolution, University of Chicago, Chicago, Illinois 60637, USA
| | - Shin-Han Shiu
- Department of Plant Biology, Michigan State University, East Lansing, Michigan 48824, USA
- Corresponding author.E-mail ; fax (517) 353-7244
| |
Collapse
|
10
|
Abstract
Background Synonymous codon usage varies widely between genomes, and also between genes within genomes. Although there is now a large body of data on variations in codon usage, it is still not clear if the observed patterns reflect the effects of positive Darwinian selection acting at the level of translational efficiency or whether these patterns are due simply to the effects of mutational bias. In this study, we have included both intra-genomic and inter-genomic comparisons of codon usage. This allows us to distinguish more efficiently between the effects of nucleotide bias and translational selection. Results We show that there is an extreme degree of heterogeneity in codon usage patterns within the rice genome, and that this heterogeneity is highly correlated with differences in nucleotide content (particularly GC content) between the genes. In contrast to the situation observed within the rice genome, Arabidopsis genes show relatively little variation in both codon usage and nucleotide content. By exploiting a combination of intra-genomic and inter-genomic comparisons, we provide evidence that the differences in codon usage among the rice genes reflect a relatively rapid evolutionary increase in the GC content of some rice genes. We also noted that the degree of codon bias was negatively correlated with gene length. Conclusion Our results show that mutational bias can cause a dramatic evolutionary divergence in codon usage patterns within a period of approximately two hundred million years. The heterogeneity of codon usage patterns within the rice genome can be explained by a balance between genome-wide mutational biases and negative selection against these biased mutations. The strength of the negative selection is proportional to the length of the coding sequences. Our results indicate that the large variations in synonymous codon usage are not related to selection acting on the translational efficiency of synonymous codons.
Collapse
Affiliation(s)
- Huai-Chun Wang
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, B3H 2G1, Canada
| | - Donal A Hickey
- Department of Biology, Concordia University, 7141 Sherbrooke West, Montréal, Québec, H4B 1R6, Canada
| |
Collapse
|
11
|
Mathé C, Sagot MF, Schiex T, Rouzé P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 2002; 30:4103-17. [PMID: 12364589 PMCID: PMC140543 DOI: 10.1093/nar/gkf543] [Citation(s) in RCA: 209] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2002] [Revised: 08/07/2002] [Accepted: 08/07/2002] [Indexed: 11/14/2022] Open
Abstract
While the genomes of many organisms have been sequenced over the last few years, transforming such raw sequence data into knowledge remains a hard task. A great number of prediction programs have been developed that try to address one part of this problem, which consists of locating the genes along a genome. This paper reviews the existing approaches to predicting genes in eukaryotic genomes and underlines their intrinsic advantages and limitations. The main mathematical models and computational algorithms adopted are also briefly described and the resulting software classified according to both the method and the type of evidence used. Finally, the several difficulties and pitfalls encountered by the programs are detailed, showing that improvements are needed and that new directions must be considered.
Collapse
Affiliation(s)
- Catherine Mathé
- Institut de Pharmacologie et Biologie Structurale, UMR 5089, 205 route de Narbonne, F-31077 Toulouse Cedex, France.
| | | | | | | |
Collapse
|
12
|
Ma J, Zhou T, Gu W, Sun X, Lu Z. Cluster analysis of the codon use frequency of MHC genes from different species. Biosystems 2002; 65:199-207. [PMID: 12069729 DOI: 10.1016/s0303-2647(02)00016-3] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
The relative synonymous codon use frequency of 135 MHC genes from four mammal species (Homo sapiens, Pan troglodyte, Macaca mulanta and Rattus norvegicus) is analyzed using a hierarchical cluster method. The result suggests that gene function is the dominant factor that determines codon usage bias, while species is a minor factor that determines further difference in codon usage bias for genes with similar functions. The conclusion may be useful in gene classification and gene function prediction.
Collapse
Affiliation(s)
- Jianmin Ma
- Chien-Shiung Wu Laboratory, Southeast University, 210096 Jiangsu Province, Nanjing, People's Republic of China.
| | | | | | | | | |
Collapse
|
13
|
Wang J, Guo FB. Base frequencies at the second codon position of Vibrio cholerae genes connect with protein function. Biochem Biophys Res Commun 2002; 290:81-4. [PMID: 11779136 DOI: 10.1006/bbrc.2001.6174] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
In this paper, the base frequency at the second codon position of the 3839 open reading frames (ORFs) in the Vibrio cholerae genome is analyzed. It is shown that according to the base content at this codon site, the ORFs can be divided into two clusters, each containing 673 and 3166 ORFs, respectively. ORFs in the smaller cluster usually have significantly higher T frequency than that of A at the second codon position. For the two clusters of ORFs, there are significant differences in the frequencies for 18 of the 20 amino acids in the encoding proteins. The two clusters of ORFs are also significantly different in their functions. More than half of the known genes involved in transport and binding are included in the smaller cluster, while few genes involved in amino acid biosynthesis, protein synthesis, and so on are included in this cluster.
Collapse
Affiliation(s)
- Ju Wang
- Department of Physics, Tianjin University, Tianjin 300072, China.
| | | |
Collapse
|
14
|
Fadiel A, Lithwick S, Wanas MQ, Cuticchia AJ. Influence of intercodon and base frequencies on codon usage in filarial parasites. Genomics 2001; 74:197-210. [PMID: 11386756 DOI: 10.1006/geno.2001.6531] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Base frequency, codon usage, and intercodon identity were analyzed in five filarial parasite species representing five Onchocercidae genera. Wucheria bancrofti, Brugia malayi, Onchocerca volvulus, Acanthocheilonema viteae, and Dirofilaria immitis gene sequences were downloaded from NCBI, and analysis was performed using locally designed computer programs and other freely available applications. A clear sequence bias was observed among the nematode species examined. At the nucleotide level, AT basepairs were present in gene sequences at higher frequencies than GC. In addition, codons ending in A or T were used proportionately more than those with G or C in the third-codon position. In addition, the amino acids used most often corresponded to codons ending in AT basepairs. Intercodon base proportion was biased in that A was found most often at N4, second only to T in certain specific cases. Since all of these sequence biases were observed in a relatively consistent fashion among all of the organisms studied, we conclude that sequence bias is a genetic characteristic, which is associated with multiple filarial genera.
Collapse
Affiliation(s)
- A Fadiel
- Bioinformatics Supercomputing Centre, The Hospital for Sick Children, Toronto, Ontario M5G 1Z8, Canada.
| | | | | | | |
Collapse
|
15
|
Abstract
The codon usage in the Vibrio cholerae genome is analyzed in this paper. Although there are much more genes on the chromosome 1 than on chromosome 2, the codon usage patterns of genes on the two chromosomes are quite similar, indicating that the two chromosomes may have coexisted in the same cell for a very long history. Unlike the base frequency pattern observed in other genomes, the G+C content at the third codon position of the V. cholerae genome varies in a rather small interval. The most notable feature of codon usage of V. cholerae genome is that there is a fraction of genes show significant bias in base choice at the second codon position. The 2,006 known genes can be classified into two clusters according to the base frequencies at this position. The smaller cluster contains 227 genes, most of which code for proteins involved in transport and binding functions. The encoding products of these genes have significant bias in amino acids composition as compared with other genes. The codon usage patterns for the 1,836 function unknown ORFs are also analyzed, which is useful to study their functions.
Collapse
Affiliation(s)
- J Wang
- Department of Physics, Tianjin University, China
| | | |
Collapse
|
16
|
Wang HC, Badger J, Kearney P, Li M. Analysis of codon usage patterns of bacterial genomes using the self-organizing map. Mol Biol Evol 2001; 18:792-800. [PMID: 11319263 DOI: 10.1093/oxfordjournals.molbev.a003861] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Codon usage varies both between organisms and between different genes in the same organism. This observation has been used as a basis for earlier work in identifying highly expressed and horizontally transferred genes in Escherichia coli. In this work, we applied Kohonen's self-organizing map to analysis of the codon usage pattern of the Escherichia coli, Aquifex aeolicus, Archaeoglobus fulgidus, Haemophilus influenzae RD:, Methanococcus jannaschii, Methanobacterium thermoautotrophicum, and Pyrococcus horikoshii genomes for evidence of highly expressed genes and horizontally transferred genes. All of the analyzed genomes had a clear category of horizontally transferred genes, and their apparent percentages ranged from 7.7% to 21.4%. The apparent percentage of highly expressed genes ranges from 0% to 11.8%. A clustering of average codon usage of main gene categories of the seven genomes showed an interesting mixing of gene classes in four thermophilic/hyperthermophilic organisms, A. aeolicus, A. fulgidus, M. thermoautotrophicum, and P. horikoshii, which suggests possible origins of their horizontally transferred genes as well as the need for adaptation to a specific environment. Further classification of the three gene categories in E. coli and H. influenzae according to gene function revealed that genes involved in communication (such as regulation and cell process) and structure (cell structure and structural proteins) are more likely to be horizontally transferred than are genes involved in information (transcription, translation, and related processes) and in some groups of energy (such as energy metabolism and carbon compound catabolism).
Collapse
Affiliation(s)
- H C Wang
- Department of Computer Science, University of Waterloo, Waterloo, Ontario, Canada.
| | | | | | | |
Collapse
|
17
|
Biochemical Genetics. Biochemistry 2001. [DOI: 10.1016/b978-012492543-4/50029-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
18
|
Mathé C, Déhais P, Pavy N, Rombauts S, Van Montagu M, Rouzé P. Gene prediction and gene classes in Arabidopsis thaliana. J Biotechnol 2000; 78:293-9. [PMID: 10751690 DOI: 10.1016/s0168-1656(00)00196-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Gene prediction methods for eukaryotic genomes still are not fully satisfying. One way to improve gene prediction accuracy, proven to be relevant for prokaryotes, is to consider more than one model of genes. Thus, we used our classification of Arabidopsis thaliana genes in two classes (CU(1) and CU(2)), previously delineated according to statistical features, in the GeneMark gene identification program. For each gene class, as well as for the two classes combined, a Markov model was developed (respectively, GM-CU(1), GM-CU(2) and GM-all) and then used on a test set of 168 genes to compare their respective efficiency. We concluded from this analysis that GM-CU(1) is more sensitive than GM-CU(2) which seems to be more specific to a gene type. Besides, GM-all does not give better results than GM-CU(1) and combining results from GM-CU(1) and GM-CU(2) greatly improve prediction efficiency in comparison with predictions made with GM-all only. Thus, this work confirms the necessity to consider more than one gene model for gene prediction in eukaryotic genomes, and to look for gene classes in order to build these models.
Collapse
Affiliation(s)
- C Mathé
- Laboratorium voor Genetica, Department of Genetics, Flanders Interuniversity Institute for Biotechnology (VIB), Universiteit Gent, B-9000, Gent, Belgium
| | | | | | | | | | | |
Collapse
|