1
|
Ludwig A, Krieger MA. Genomic and phylogenetic evidence of VIPER retrotransposon domestication in trypanosomatids. Mem Inst Oswaldo Cruz 2016; 111:765-769. [PMID: 27849219 PMCID: PMC5146736 DOI: 10.1590/0074-02760160224] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2016] [Accepted: 08/25/2016] [Indexed: 12/02/2022] Open
Abstract
Transposable elements are important residents of eukaryotic genomes and eventually
the host can domesticate them to serve cellular functions. We reported here a
possible domestication event of the vestigial interposed retroelement (VIPER) in
trypanosomatids. We found a large gene in a syntenic location in Leishmania
braziliensis, L. panamensis, Leptomanas
pyrrhocoris, and Crithidia fasciculata whose products
share similarity in the C-terminal portion with the third protein of VIPER. No
remnants of other VIPER regions surrounding the gene sequence were found. We
hypothesise that the domestication event occurred more than 50 mya and the
conservation of this gene suggests it might perform some function in the host
species.
Collapse
Affiliation(s)
- Adriana Ludwig
- Fundação Oswaldo Cruz, Instituto Carlos Chagas, Laboratório de Genômica Funcional, Curitiba, PR, Brasil.,Instituto de Biologia Molecular do Paraná, Curitiba, PR, Brasil
| | - Marco Aurelio Krieger
- Fundação Oswaldo Cruz, Instituto Carlos Chagas, Laboratório de Genômica Funcional, Curitiba, PR, Brasil.,Instituto de Biologia Molecular do Paraná, Curitiba, PR, Brasil
| |
Collapse
|
2
|
Wong TY, Schwartzbach SD. Protein Mis-Termination Initiates Genetic Diseases, Cancers, and Restricts Bacterial Genome Expansion. JOURNAL OF ENVIRONMENTAL SCIENCE AND HEALTH. PART C, ENVIRONMENTAL CARCINOGENESIS & ECOTOXICOLOGY REVIEWS 2015; 33:255-285. [PMID: 26087060 DOI: 10.1080/10590501.2015.1053461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Protein termination is an important cellular process. Protein termination relies on the stop-codons in the mRNA interacting properly with the releasing factors on the ribosome. One third of inherited diseases, including cancers, are associated with the mutation of the stop-codons. Many pathogens and viruses are able to manipulate their stop-codons to express their virulence. The influence of stop-codons is not limited to the primary reading frame of the genes. Stop-codons in the second and third reading frames are referred as premature stop signals (PSC). Stop-codons and PSCs together are collectively referred as stop-signals. The ratios of the stop-signals (referred as translation stop-signals ratio or TSSR) of genetically related bacteria, despite their great differences in gene contents, are much alike. This nearly identical Genomic-TSSR value of genetically related bacteria may suggest that bacterial genome expansion is limited by their unique stop-signals bias. We review the protein termination process and the different types of stop-codon mutation in plants, animals, microbes, and viruses, with special emphasis on the role of PSCs in directing bacterial evolution in their natural environments. Knowing the limit of genomic boundary could facilitate the formulation of new strategies in controlling the spread of diseases and combat antibiotic-resistant bacteria.
Collapse
Affiliation(s)
- Tit-Yee Wong
- a Department of Biological Sciences , University of Memphis , Memphis , Tennessee , USA
| | | |
Collapse
|
3
|
Predicting statistical properties of open reading frames in bacterial genomes. PLoS One 2012; 7:e45103. [PMID: 23028785 PMCID: PMC3454372 DOI: 10.1371/journal.pone.0045103] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2012] [Accepted: 08/14/2012] [Indexed: 11/26/2022] Open
Abstract
An analytical model based on the statistical properties of Open Reading Frames (ORFs) of eubacterial genomes such as codon composition and sequence length of all reading frames was developed. This new model predicts the average length, maximum length as well as the length distribution of the ORFs of 70 species with GC contents varying between 21% and 74%. Furthermore, the number of annotated genes is predicted with high accordance. However, the ORF length distribution in the five alternative reading frames shows interesting deviations from the predicted distribution. In particular, long ORFs appear more often than expected statistically. The unexpected depletion of stop codons in these alternative open reading frames cannot completely be explained by a biased codon usage in the +1 frame. While it is unknown if the stop codon depletion has a biological function, it could be due to a protein coding capacity of alternative ORFs exerting a selection pressure which prevents the fixation of stop codon mutations. The comparison of the analytical model with bacterial genomes, therefore, leads to a hypothesis suggesting novel gene candidates which can now be investigated in subsequent wet lab experiments.
Collapse
|
4
|
|
5
|
Gatherer D. Evolution of the G+C Content Frontier in the Rat Cytomegalovirus Genome. Virology (Auckl) 2008. [DOI: 10.4137/vrt.s1023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022] Open
Abstract
Within the 230138 bp of the rat cytomegalovirus (RCMV) genome, the G+C content changes abruptly at position 142644, constituting a G+C content frontier. To the left of this point, overall G+C content is 69.2%, and to the right it is only 47.6%. A region of extremely low G+C content (33.8%) is found in the 5 kb immediately to the right of the frontier, in which there are no predicted coding sequences. To the right of position 147501, the G+C content rises and predicted coding sequences reappear. However, these genes are much shorter (average 848 bp, 50% G+C) than those in the left two-thirds of the genome (average 1462 bp, 70% G+C). Whole genome alignment of several viruses indicates that the initial ultra-low G+C region appeared in the common ancestor of the genera Cytomegalovirus and Muromegalovirus, and that the lowering of G+C in the right third has been a subsequent process in the lineage leading to RCMV. The left two-thirds of RCMV has stop codon occurrences at 67.5% of their expected level, based on a modified Markov chain model of stop codon distribution, and the corresponding figure for the right third is 78%. Therefore, despite heavy mutation pressure, selective constraint has operated in the right third of the RCMV genome to maintain a degree of gene length unusual for such low G+C sequences.
Collapse
Affiliation(s)
- Derek Gatherer
- MRC Virology Unit, Institute of Virology, University of Glasgow, Church Street, Glasgow, G11 5JR, U.K
| |
Collapse
|
6
|
Guo FB. The distribution patterns of bases of protein-coding genes, non-coding ORFs, and intergenic sequences in pseudomonas aeruginosa PA01 genome and its implications. J Biomol Struct Dyn 2008; 25:127-33. [PMID: 17718591 DOI: 10.1080/07391102.2007.10507161] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
The distribution patterns of bases of DNA fragments in different regions in P. aeruginosa genome are analyzed in this paper. It's shown that 5565 protein-coding genes, 17315 non-coding ORFs, and 1104 intergenic sequences are located into seven clusters based on their base frequencies. Almost all the protein-coding genes are contained in one of the seven clusters. The significant difference of base frequencies among three codon positions in high GC genome, which arouse the division between the distribution patterns of bases of six reading frames of protein-coding genes, is responsible for the appearance of the clustering phenomenon. In the light of the clustering phenomenon, the author supposes that the anitisense strand ORFs, particularly those corresponding to Frame 2' and Frame 3', may not code for proteins in P. aeruginosa genome.
Collapse
Affiliation(s)
- F-B Guo
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
7
|
Quantitative determination of gene strand bias in prokaryotic genomes. Genomics 2007; 90:733-40. [DOI: 10.1016/j.ygeno.2007.07.010] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2007] [Revised: 07/09/2007] [Accepted: 07/23/2007] [Indexed: 11/19/2022]
|
8
|
Xu K, Ma BG. Comparative analysis of predicted gene expression among deep-sea genomes. Gene 2007; 397:136-42. [PMID: 17544603 DOI: 10.1016/j.gene.2007.04.023] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2006] [Revised: 04/08/2007] [Accepted: 04/20/2007] [Indexed: 11/18/2022]
Abstract
Deep-sea species live in an environment that is specifically characterized by extreme temperature and hydrostatic pressure. In this work, predicted highly expressed (PHX) genes are comparatively analyzed for six deep-sea microbes, which allows us to pinpoint the common highly expressed genes shared by them. The relationships between gene expression level and some basic properties such as genomic G + C content, optimal growth temperature (OGT), and environmental hydrostatic pressure of the six deep-sea species are also investigated. We find that the percentage of PHX genes out of a whole genome positively correlates to OGT for the deep-sea genomes, whereas such positive correlation seems not to exist between environmental hydrostatic pressure and percentage of PHX genes. Moreover, there exists a negative correlation between genomic G + C content and diversity of gene expression level for the deep-sea genomes, which is in sharp contrast to land-living microbes. We report the top 20 PHX genes for the six deep-sea genomes and find no common highly expressed genes shared by them except for ribosomal proteins, transcription factors, and translation factors. Our present work proffers a paradigm for studying the relationship between environmental factors and microbes' predicted gene expression level.
Collapse
Affiliation(s)
- Ke Xu
- College of Mathematics and Information Science, Shandong University of Technology, Zibo, PR China
| | | |
Collapse
|
9
|
Ganapathiraju M, Manoharan V, Klein-Seetharaman J. BLMT: statistical sequence analysis using N-grams. ACTA ACUST UNITED AC 2005; 3:193-200. [PMID: 15693744 DOI: 10.2165/00822942-200403020-00013] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
UNLABELLED Statistical analysis of amino acid and nucleotide sequences, especially sequence alignment, is one of the most commonly performed tasks in modern molecular biology. However, for many tasks in bioinformatics, the requirement for the features in an alignment to be consecutive is restrictive and "n-grams" (aka k-tuples) have been used as features instead. N-grams are usually short nucleotide or amino acid sequences of length n, but the unit for a gram may be chosen arbitrarily. The n-gram concept is borrowed from language technologies where n-grams of words form the fundamental units in statistical language models. Despite the demonstrated utility of n-gram statistics for the biology domain, there is currently no publicly accessible generic tool for the efficient calculation of such statistics. Most sequence analysis tools will disregard matches because of the lack of statistical significance in finding short sequences. This article presents the integrated Biological Language Modeling Toolkit (BLMT) that allows efficient calculation of n-gram statistics for arbitrary sequence datasets. AVAILABILITY BLMT can be downloaded from http://www.cs.cmu.edu/~blmt/source and installed for standalone use on any Unix platform or Unix shell emulation such as Cygwin on the Windows platform. Specific tools and usage details are described in a "readme" file. The n-gram computations carried out by the BLMT are part of a broader set of tools borrowed from language technologies and modified for statistical analysis of biological sequences; these are available at http://flan.blm.cs.cmu.edu/.
Collapse
Affiliation(s)
- Madhavi Ganapathiraju
- Language Technologies Institute, Carnegie Mellon University, 4400 Fifth Avenue, Pittsburgh, PA 15213, USA
| | | | | |
Collapse
|
10
|
Ma BG, Chen LL. The Most Deviated Codon Position in AT-rich Bacterial Genomes: A Function Related Analysis. J Biomol Struct Dyn 2005; 23:143-9. [PMID: 16060688 DOI: 10.1080/07391102.2005.10507055] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
We have performed systematic study on more than 120 archaeal and bacterial genomes. Based on the index proposed in the current paper, clear patterns are observed showing the relation between the base compositional deviation at three codon positions and the genomic GC content. For AT-rich genomes, the Most Deviated Codon Position (MDCP) is the 1st codon position, while for GC-rich genomes, MDCP appears at the 2nd or 3rd codon position alternatively. According to MDCP, the CDSs of a genome can be classified into two types: typical and atypical. In AT-rich genomes the typical represent the majority and account for about 3/4 of all the CDSs. Based on the functional classification of COG database, the two types of CDSs are examined. An apparent bias of distribution is observed that the CDSs with the function of 'information processing' are more likely to present in typical type.
Collapse
Affiliation(s)
- Bin-Guang Ma
- College of Chemistry and Chemical Engineering, Suzhou University, Suzhou 215006, PR China
| | | |
Collapse
|
11
|
Barral P J, Cantini L, Hasmy A, Jiménez J, Marcano A. Correlation between strand asymmetry and phylogeny in mitochondrial DNA. J Theor Biol 2005; 236:422-6. [PMID: 15927203 DOI: 10.1016/j.jtbi.2005.03.022] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2004] [Revised: 03/17/2005] [Accepted: 03/17/2005] [Indexed: 11/25/2022]
Abstract
An evolutionary distance is introduced in order to propose an efficient and feasible procedure for phylogeny studies. Our analysis are based on the strand asymmetry property of mitochondrial DNA, but can be applied to other genomes. Comparison of our results with those reported in conventional phylogenetic trees, gives confidence about our approximation. Our findings support the hypotheses about the origin of the skew and its dependence upon evolutionary pressures, and improves previous efforts on using the strand asymmetry property of genomes for phylogeny inference. For the evolutionary distance introduced here, we observe that the more adequate technique for tree reconstructions correspond to an average link method which employs a sequential clustering algorithm.
Collapse
Affiliation(s)
- J Barral P
- Centro Nacional de Secuenciación y Análisis de Acidos Nucleicos CeSAAN, IVIC, Apartado Postal 21827, Caracas 1020A, Venezuela
| | | | | | | | | |
Collapse
|
12
|
Bradshaw PC, Rathi A, Samuels DC. Mitochondrial-encoded membrane protein transcripts are pyrimidine-rich while soluble protein transcripts and ribosomal RNA are purine-rich. BMC Genomics 2005; 6:136. [PMID: 16185363 PMCID: PMC1262711 DOI: 10.1186/1471-2164-6-136] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2005] [Accepted: 09/26/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Eukaryotic organisms contain mitochondria, organelles capable of producing large amounts of ATP by oxidative phosphorylation. Each cell contains many mitochondria with many copies of mitochondrial DNA in each organelle. The mitochondrial DNA encodes a small but functionally critical portion of the oxidative phosphorylation machinery, a few other species-specific proteins, and the rRNA and tRNA used for the translation of these transcripts. Because the microenvironment of the mitochondrion is unique, mitochondrial genes may be subject to different selectional pressures than those affecting nuclear genes. RESULTS From an analysis of the mitochondrial genomes of a wide range of eukaryotic species we show that there are three simple rules for the pyrimidine and purine abundances in mitochondrial DNA transcripts. Mitochondrial membrane protein transcripts are pyrimidine rich, rRNA transcripts are purine-rich and the soluble protein transcripts are purine-rich. The transitions between pyrimidine and purine-rich regions of the genomes are rapid and are easily visible on a pyrimidine-purine walk graph. These rules are followed, with few exceptions, independent of which strand encodes the gene. Despite the robustness of these rules across a diverse set of species, the magnitude of the differences between the pyrimidine and purine content is fairly small. Typically, the mitochondrial membrane protein transcripts have a pyrimidine richness of 56%, the rRNA transcripts are 55% purine, and the soluble protein transcripts are only 53% purine. CONCLUSION The pyrimidine richness of mitochondrial-encoded membrane protein transcripts is partly driven by U nucleotides in the second codon position in all species, which yields hydrophobic amino acids. The purine-richness of soluble protein transcripts is mainly driven by A nucleotides in the first codon position. The purine-richness of rRNA is also due to an abundance of A nucleotides. Possible mechanisms as to how these trends are maintained in mtDNA genomes of such diverse ancestry, size and variability of A-T richness are discussed.
Collapse
Affiliation(s)
- Patrick C Bradshaw
- Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA
| | - Anand Rathi
- Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA
| | - David C Samuels
- Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA
| |
Collapse
|
13
|
Jin J. Identification of protein coding regions of rice genes using alternative spectral rotation measure and linear discriminant analysis. GENOMICS PROTEOMICS & BIOINFORMATICS 2005; 2:167-73. [PMID: 15862117 PMCID: PMC5172472 DOI: 10.1016/s1672-0229(04)02022-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
An improved method, called Alternative Spectral Rotation (ASR) measure, for predicting protein coding regions in rice DNA has been developed. The method is based on the Spectral Rotation (SR) measure proposed by Kotlar and Lavner, and its accuracy is higher than that of the SR measure and the Spectral Content (SC) measure proposed by Tiwari et al. In order to increase the identifying accuracy, we chose three different coding characters, namely the asymmetric, purine, and stop-codon variables as parameters, and an approving result was presented by the method of Linear Discriminant Analysis (LDA).
Collapse
Affiliation(s)
- Jiao Jin
- Department of Statistics and Financial Mathematics, School of Mathematical Sciences, Beijing Normal University, Beijing 100875, China.
| |
Collapse
|
14
|
Ganapathiraju M, Balakrishnan N, Reddy R, Klein-Seetharaman J. Computational Biology and Language. AMBIENT INTELLIGENCE FOR SCIENTIFIC DISCOVERY 2005. [DOI: 10.1007/978-3-540-32263-4_2] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
15
|
Bernaola-Galván P, Oliver JL, Carpena P, Clay O, Bernardi G. Quantifying intrachromosomal GC heterogeneity in prokaryotic genomes. Gene 2004; 333:121-33. [PMID: 15177687 DOI: 10.1016/j.gene.2004.02.042] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2003] [Revised: 11/14/2003] [Accepted: 02/10/2004] [Indexed: 11/15/2022]
Abstract
The sequencing of prokaryotic genomes covering a wide taxonomic range has sparked renewed interest in intrachromosomal compositional (GC) heterogeneity, largely in view of lateral transfers. We present here a brief overview of some methods for visualizing and quantifying GC variation in prokaryotes. We used these methods to examine heterogeneity levels in sequenced prokaryotes, for a range of scales or stringencies. Some species are consistently homogeneous, whereas others are markedly heterogeneous in comparison, in particular Aeropyrum pernix, Xylella fastidiosa, Mycoplasma genitalium, Enterococcus faecalis, Bacillus subtilis, Pyrobaculum aerophilum, Vibrio vulnificus chromosome I, Deinococcus radiodurans chromosome II and Halobacterium. As we discuss here, the wide range of heterogeneities calls for reexamination of an accepted belief, namely that the endogenous DNA of bacteria and archaea should typically exhibit low intrachromosomal GC contrasts. Supplementary results for all species analyzed are available at our website: http://bioinfo2.ugr.es/prok.
Collapse
|
16
|
Krishnamachari A, moy Mandal V. Study of DNA binding sites using the Rényi parametric entropy measure. J Theor Biol 2004; 227:429-36. [PMID: 15019509 DOI: 10.1016/j.jtbi.2003.11.026] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2003] [Revised: 11/06/2003] [Accepted: 11/17/2003] [Indexed: 10/26/2022]
Abstract
Shannon's definition of uncertainty or surprisal has been applied extensively to measure the information content of aligned DNA sequences and characterizing DNA binding sites. In contrast to Shannon's uncertainty, this study investigates the applicability and suitability of a parametric uncertainty measure due to Rényi. It is observed that this measure also provides results in agreement with Shannon's measure, pointing to its utility in analysing DNA binding site region. For facilitating the comparison between these uncertainty measures, a dimensionless quantity called "redundancy" has been employed. It is found that Rényi's measure at low parameter values possess a better delineating feature of binding sites (of binding regions) than Shannon's measure. The critical value of the parameter is chosen with an outlier criterion.
Collapse
Affiliation(s)
- A Krishnamachari
- Bioinformatics Centre, Jawaharlal Nehru University, New Delhi 110 067, India
| | | |
Collapse
|
17
|
Ou HY, Guo FB, Zhang CT. Analysis of nucleotide distribution in the genome of Streptomyces coelicolor A3(2) using the Z curve method. FEBS Lett 2003; 540:188-94. [PMID: 12681506 DOI: 10.1016/s0014-5793(03)00263-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
The nucleotide distribution of all 33,527 open reading frames (ORFs) (> or =300 bp) in the genome of Streptomyces coelicolor A3(2) has been analyzed using the Z curve method. Each ORF is mapped onto a point in a 9-dimensional space. To visualize the distribution of mapping points, the points are projected onto the principal plane based on principal component analysis. Consequently, the distribution pattern of the 33,527 points in the principal plane shows a flower-like shape, in which there are seven distinct regions. In addition to the central region, there are six petal-like regions around the center, one of which corresponds to 7172 coding sequences. The central region and the remaining five petal-like regions correspond to the intergenic sequences and out-of-frame non-coding ORFs, respectively. It is shown that selective pressure produces a remarkable bias of the G+C content among three codon positions, resulting in the interesting phenomenon observed. A similar phenomenon is also observed for other bacterial genomes with high genomic G+C content, such as Pseudomonas aeruginosa PA01 (G+C = 66.6%). However, for the genomes of Bacillus subtilis (G+C = 43.5%) and Clostridium perfringens (G+C = 28.6%), no similar phenomenon was observed. The finding presented here may be useful to improve the gene-finding algorithms for genomes with high G+C content. A set of supplementary materials including the plots displaying the base distribution patterns of ORFs in 12 prokaryotes is provided on the website http://tubic.tju.edu.cn/highGC/.
Collapse
Affiliation(s)
- Hong-Yu Ou
- Department of Physics, Tianjin University, Tianjin 300072, PR China
| | | | | |
Collapse
|
18
|
Abstract
Using a measure of how differentially expressed a gene is in two biochemically/phenotypically different conditions, we can rank all genes in a microarray dataset. We have shown that the falling-off of this measure (normalized maximum likelihood in a classification model such as logistic regression) as a function of the rank is typically a power-law function. This power-law function in other similar ranked plots are known as the Zipf's law, observed in many natural and social phenomena. The presence of this power-law function prevents an intrinsic cutoff point between the "important" genes and "irrelevant" genes. We have shown that similar power-law functions are also present in permuted dataset, and provide an explanation from the well-known chi(2) distribution of likelihood ratios. We discuss the implication of this Zipf's law on gene selection in a microarray data analysis, as well as other characterizations of the ranked likelihood plots such as the rate of fall-off of the likelihood.
Collapse
Affiliation(s)
- Wentian Li
- Center for Genomics and Human Genetics North Shore LIJ Research Institute, 350 Community Drive, Manhasset, NY 11030, USA.
| | | |
Collapse
|
19
|
Kuznetsov VA, Knott GD, Bonner RF. General statistics of stochastic process of gene expression in eukaryotic cells. Genetics 2002; 161:1321-32. [PMID: 12136033 PMCID: PMC1462190 DOI: 10.1093/genetics/161.3.1321] [Citation(s) in RCA: 113] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Thousands of genes are expressed at such very low levels (< or =1 copy per cell) that global gene expression analysis of rarer transcripts remains problematic. Ambiguity in identification of rarer transcripts creates considerable uncertainty in fundamental questions such as the total number of genes expressed in an organism and the biological significance of rarer transcripts. Knowing the distribution of the true number of genes expressed at each level and the corresponding gene expression level probability function (GELPF) could help resolve these uncertainties. We found that all observed large-scale gene expression data sets in yeast, mouse, and human cells follow a Pareto-like distribution model skewed by many low-abundance transcripts. A novel stochastic model of the gene expression process predicts the universality of the GELPF both across different cell types within a multicellular organism and across different organisms. This model allows us to predict the frequency distribution of all gene expression levels within a single cell and to estimate the number of expressed genes in a single cell and in a population of cells. A random "basal" transcription mechanism for protein-coding genes in all or almost all eukaryotic cell types is predicted. This fundamental mechanism might enhance the expression of rarely expressed genes and, thus, provide a basic level of phenotypic diversity, adaptability, and random monoallelic expression in cell populations.
Collapse
Affiliation(s)
- V A Kuznetsov
- Laboratory of Integrative and Medical Biophysics, National Institute of Child Health and Human Development/NIH, Bldg. 13, Rm. 3W16, Bethesda, MD 20892-5772, USA.
| | | | | |
Collapse
|
20
|
Li W, Bernaola-Galván P, Haghighi F, Grosse I. Applications of recursive segmentation to the analysis of DNA sequences. COMPUTERS & CHEMISTRY 2002; 26:491-510. [PMID: 12144178 DOI: 10.1016/s0097-8485(02)00010-4] [Citation(s) in RCA: 64] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Recursive segmentation is a procedure that partitions a DNA sequence into domains with a homogeneous composition of the four nucleotides A, C, G and T. This procedure can also be applied to any sequence converted from a DNA sequence, such as to a binary strong(G + C)/weak(A + T) sequence, to a binary sequence indicating the presence or absence of the dinucleotide CpG, or to a sequence indicating both the base and the codon position information. We apply various conversion schemes in order to address the following five DNA sequence analysis problems: isochore mapping, CpG island detection, locating the origin and terminus of replication in bacterial genomes, finding complex repeats in telomere sequences, and delineating coding and noncoding regions. We find that the recursive segmentation procedure can successfully detect isochore borders, CpG islands, and the origin and terminus of replication, but it needs improvement for detecting complex repeats as well as borders between coding and noncoding regions.
Collapse
Affiliation(s)
- Wentian Li
- Center for Genomics and Human Genetics, North Shore-LIJ Research Institute, Manhasset, NY 11030, USA.
| | | | | | | |
Collapse
|
21
|
Wang Y, Zhang CT, Dong P. Recognizing shorter coding regions of human genes based on the statistics of stop codons. Biopolymers 2002; 63:207-16. [PMID: 11787008 DOI: 10.1002/bip.10054] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
With the quick progress of the Human Genome Project, a great amount of uncharacterized DNA sequences needs to be annotated copiously by better algorithms. Recognizing shorter coding sequences of human genes is one of the most important problems in gene recognition, which is not yet completely solved. This paper is devoted to solving the issue using a new method. The distributions of the three stop codons, i.e., TAA, TAG and TGA, in three phases along coding, noncoding, and intergenic sequences are studied in detail. Using the obtained distributions and other coding measures, a new algorithm for the recognition of shorter coding sequences of human genes is developed. The accuracy of the algorithm is tested based on a larger database of human genes. It is found that the average accuracy achieved is as high as 92.1% for the sequences with length of 192 base pairs, which is confirmed by sixfold cross-validation tests. It is hoped that by incorporating the present method with some existing algorithms, the accuracy for identifying human genes from unannotated sequences would be increased.
Collapse
Affiliation(s)
- Yonghong Wang
- Department of Physics, Tianjin University, Tianjin, 300072, China
| | | | | |
Collapse
|
22
|
Oiwa NN, Goldman C. Phylogenetic study of the spatial distribution of protein-coding and control segments in DNA chains. PHYSICAL REVIEW LETTERS 2000; 85:2396-2399. [PMID: 10978019 DOI: 10.1103/physrevlett.85.2396] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/1999] [Indexed: 05/23/2023]
Abstract
We examine the size and spatial distributions of the protein-coding and control segments of genes in DNA nucleotide sequences from GenBank. Phylogenetic analysis of these data suggests the presence of spatial order in sequences of higher organisms, irrespective of the nature of nucleotide base content. This is characterized by defined two-point correlation functions and measured by fractal dimensions and singularity spectrum.
Collapse
Affiliation(s)
- N N Oiwa
- Instituto de Física, Universidade de São Paulo, CP 66318, 05315-970, São Paulo, Brazil
| | | |
Collapse
|