1
|
Gupta S, Pal D. Detection of intrinsic transcription termination sites in bacteria: consensus from hairpin detection approaches. J Biomol Struct Dyn 2024:1-11. [PMID: 38605579 DOI: 10.1080/07391102.2024.2325107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2023] [Accepted: 02/23/2024] [Indexed: 04/13/2024]
Abstract
We compare the WebGeSTer and INtrinsic transcription TERmination hairPIN (INTERPIN) databases used for intrinsic transcription termination (ITT) site prediction in bacteria. The former deploys inverted nucleotide repeat detection for identification of RNA hairpin, while the latter a pair-potential function - the hairpin energy score evaluation being identical for both. We find INTERPIN more sensitive than WebGeSTer with about 6% and 51% additional predictions for ITTs in chromosomal and plasmid operons, respectively. INTERPIN hairpins are relatively shorter in length with ungapped stem, and even located in AT-rich segments, compared to GC-rich longer hairpins with a gapped stem in WebGeSTer. The GC%, length, and energy score from INTERPIN transcription units (TUs) are best inter-correlated while the lowest energy single hairpins from WebGeSTer, considered suitable for ITT, being the worst. Around 72% TUs from the two databases overlap, and ∼60% of all alternate ITT sites downstream of TUs overlap, of which 65% are cluster hairpins. This helps highlight hairpin features that can be used to identify termination sites in bacteria across different prediction methods. Overall, the pair-potential-function-based hairpins screened appear to be more consistent with the kinetic and thermodynamics processes of ITT known to date.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Swati Gupta
- Department of Computational and Data Sciences, Indian Institute of Science, Bengaluru, India
| | - Debnath Pal
- Department of Computational and Data Sciences, Indian Institute of Science, Bengaluru, India
| |
Collapse
|
2
|
Plant Virus Genome Is Shaped by Specific Dinucleotide Restrictions That Influence Viral Infection. mBio 2020; 11:mBio.02818-19. [PMID: 32071264 PMCID: PMC7029135 DOI: 10.1128/mbio.02818-19] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
The presence of CpG and UpA dinucleotides is restricted in the genomes of animal RNA viruses to avoid specific host defenses. We wondered whether a similar phenomenon exists in nonanimal RNA viruses. Here, we show that these two dinucleotides, especially UpA, are underrepresented in the family Potyviridae, the most important group of plant RNA viruses. Using plum pox virus (PPV; Potyviridae family) as a model, we show that an increase in UpA frequency strongly diminishes virus accumulation. Remarkably, unlike previous observations in animal viruses, PPV variants harboring CpG-rich fragments display just faint (or no) attenuation. The anticorrelation between UpA frequency and viral fitness additionally demonstrates the relevance of this particular dinucleotide: UpA-high mutants are attenuated in a dose-dependent manner, whereas a UpA-low variant displays better fitness than its parental control. Using high-throughput sequencing, we also show that UpA-rich PPV variants are genetically stable, without apparent changes in sequence that revert and/or compensate for the dinucleotide modification despite its attenuation. In addition, we also demonstrate here that the PPV restriction of UpA-rich variants works independently of the classical RNA silencing pathway. Finally, we show that the anticorrelation between UpA frequency and RNA accumulation applies to mRNA-like fragments produced by the host RNA polymerase II. Together, our results inform us about a dinucleotide-based system in plant cells that controls diverse RNAs, including RNA viruses.IMPORTANCE Dinucleotides (combinations of two consecutive nucleotides) are not randomly present in RNA viruses; in fact, the presence of CpG and UpA is significantly repressed in their genomes. Although the meaning of this phenomenon remains obscure, recent studies with animal-infecting viruses have revealed that their low CpG/UpA frequency prevents virus restriction via a host antiviral system that recognizes, and promotes the degradation of, CpG/UpA-rich RNAs. Whether similar systems act in organisms from other life kingdoms has been unknown. To fill this gap in our knowledge, we built several synthetic variants of a plant RNA virus with deoptimized dinucleotide frequencies and analyzed their viral fitness and genome adaptation. In brief, our results inform us for the first time about an effective dinucleotide-based system that acts in plants against viruses. Remarkably, this viral restriction in plants is reminiscent of, but not identical to, the equivalent antiviral response in animals.
Collapse
|
3
|
Abstract
Given a realisation of a Markov chain, one can count the numbers of state transitions of each type. One can ask how many realisations are there with these transition counts and the same initial state. Whittle (1955) has answered this question, by finding an explicit though complicated formula, and has also shown that each realisation is equally likely. In the analysis of DNA sequences which comprise letters from the set {A, C, G, T}, it is often useful to count the frequency of a pattern, say ACGCT, in a long sequence and compare this with the expected frequency for all sequences having the same start letter and the same transition counts (or ‘dinucleotide counts' as they are called in the molecular biology literature). To date, no exact method exists; this paper rectifies that deficiency.
Collapse
|
4
|
Abstract
Given a realisation of a Markov chain, one can count the numbers of state transitions of each type. One can ask how many realisations are there with these transition counts and the same initial state. Whittle (1955) has answered this question, by finding an explicit though complicated formula, and has also shown that each realisation is equally likely. In the analysis of DNA sequences which comprise letters from the set {A, C, G, T}, it is often useful to count the frequency of a pattern, say ACGCT, in a long sequence and compare this with the expected frequency for all sequences having the same start letter and the same transition counts (or ‘dinucleotide counts' as they are called in the molecular biology literature). To date, no exact method exists; this paper rectifies that deficiency.
Collapse
|
5
|
Metagenomics and the molecular identification of novel viruses. Vet J 2010; 190:191-198. [PMID: 21111643 PMCID: PMC7110547 DOI: 10.1016/j.tvjl.2010.10.014] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2010] [Revised: 10/16/2010] [Accepted: 10/20/2010] [Indexed: 12/16/2022]
Abstract
There have been rapid recent developments in establishing methods for identifying and characterising viruses associated with animal and human diseases. These methodologies, commonly based on hybridisation or PCR techniques, are combined with advanced sequencing techniques termed ‘next generation sequencing’. Allied advances in data analysis, including the use of computational transcriptome subtraction, have also impacted the field of viral pathogen discovery. This review details these molecular detection techniques, discusses their application in viral discovery, and provides an overview of some of the novel viruses discovered. The problems encountered in attributing disease causality to a newly identified virus are also considered.
Collapse
|
6
|
uShuffle: a useful tool for shuffling biological sequences while preserving the k-let counts. BMC Bioinformatics 2008; 9:192. [PMID: 18405375 PMCID: PMC2375906 DOI: 10.1186/1471-2105-9-192] [Citation(s) in RCA: 99] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2008] [Accepted: 04/11/2008] [Indexed: 12/02/2022] Open
Abstract
Background Randomly shuffled sequences are routinely used in sequence analysis to evaluate the statistical significance of a biological sequence. In many cases, biologists need sophisticated shuffling tools that preserve not only the counts of distinct letters but also higher-order statistics such as doublet counts, triplet counts, and, in general, k-let counts. Results We present a sequence analysis tool (named uShuffle) for generating uniform random permutations of biological sequences (such as DNAs, RNAs, and proteins) that preserve the exact k-let counts. The uShuffle tool implements the latest variant of the Euler algorithm and uses Wilson's algorithm in the crucial step of arborescence generation. It is carefully engineered and extremely efficient. The uShuffle tool achieves maximum flexibility by allowing arbitrary alphabet size and let size. It can be used as a command-line program, a web application, or a utility library. Source code in C, Java, and C#, and integration instructions for Perl and Python are provided. Conclusion The uShuffle tool surpasses existing implementation of the Euler algorithm in both performance and flexibility. It is a useful tool for the bioinformatics community.
Collapse
|
7
|
Di YP, Harper R, Zhao Y, Pahlavan N, Finkbeiner W, Wu R. Molecular cloning and characterization of spurt, a human novel gene that is retinoic acid-inducible and encodes a secretory protein specific in upper respiratory tracts. J Biol Chem 2003; 278:1165-73. [PMID: 12409287 DOI: 10.1074/jbc.m210523200] [Citation(s) in RCA: 72] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Retinoids, such as all-trans-retinoic acid, play an essential role in the regulation of airway epithelial cell growth, differentiation, and gene expression. Using cDNA microarray, we identified a clone, DD4, that contains the cDNA of a novel gene, spurt (secretory protein in upper respiratory tracts) that was significantly induced by all-trans-retinoic acid in primary cultured human tracheobroncheal epithelia. Two alternatively spliced spurt transcripts of 1090 and 1035 base pairs exist that contain the same open reading frame expressing a 256-amino acid peptide. The full-length spurt cDNA sequence spans a genomic DNA fragment of 7,313 bp, and the gene is located on chromosome 20q11.21. spurt mRNA is notably expressed at high levels in human nasal, tracheal, and lung tissues. In situ hybridization demonstrated that spurt message is often present in secretory cell types. The human spurt gene product is a secretory protein that contains a distinct signal peptide sequence in its first 19 amino acids. Mono-specific antibodies were generated to characterize spurt expression. Our data demonstrate that spurt is secreted onto the apical side of primary human airway epithelial cultures and is present in clinical sputum samples. spurt gene expression is higher in sputum and tissue samples obtained from patients with chronic obstructive lung disease. Our results provide the cloning and characterization of this tissue-specific novel gene and its possible relationship with airway diseases.
Collapse
Affiliation(s)
- Yuan-Pu Di
- Center for Comparative Respiratory Biology and Medicine, Division of Pulmonary & Critical Care Medicine, School of Medicine, Medical Center of the University of California, Davis, 95616, USA
| | | | | | | | | | | |
Collapse
|
8
|
Arquès DG, Lacan J, Michel CJ. Identification of protein coding genes in genomes with statistical functions based on the circular code. Biosystems 2002; 66:73-92. [PMID: 12204444 DOI: 10.1016/s0303-2647(02)00039-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
A new statistical approach using functions based on the circular code classifies correctly more than 93% of bases in protein (coding) genes and non-coding genes of human sequences. Based on this statistical study, a research software called 'Analysis of Coding Genes' (ACG) has been developed for identifying protein genes in the genomes and for determining their frame. Furthermore, the software ACG also allows an evaluation of the length of protein genes, their position in the genome, their relative position between themselves, and the prediction of internal frames in protein genes.
Collapse
Affiliation(s)
- Didier G Arquès
- Equipe de Biologie Théorique, Institut Gaspard Monge, Université de Marne la Vallée, 2 rue de la Butte Verte, 93160 Noisy le Grand, France.
| | | | | |
Collapse
|
9
|
Elgar G, Clark MS, Meek S, Smith S, Warner S, Edwards YJ, Bouchireb N, Cottage A, Yeo GS, Umrania Y, Williams G, Brenner S. Generation and analysis of 25 Mb of genomic DNA from the pufferfish Fugu rubripes by sequence scanning. Genome Res 1999; 9:960-71. [PMID: 10523524 PMCID: PMC310822 DOI: 10.1101/gr.9.10.960] [Citation(s) in RCA: 68] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
We have generated and analyzed >50,000 shotgun clones from 1059 Fugu cosmid clones. All sequences have been minimally edited and searched against protein and DNA databases. These data are all displayed on a searchable, publicly available web site at. With an average of 50 reads per cosmid, this is virtually nonredundant sequence skimming, covering 30%-50% of each clone. This essentially random data set covers nearly 25 Mb (>6%) of the Fugu genome and forms the basis of a series of whole genome analyses which address questions regarding gene density and distribution in the Fugu genome and the similarity between Fugu and mammalian genes. The Fugu genome, with eight times less DNA but a similar gene repertoire, is ideally suited to this type of study because most cosmids contain more than one identifiable gene. General features of the genome are also discussed. We have made some estimation of the syntenic relationship between mammals and Fugu and looked at the efficacy of ORF prediction from short, unedited Fugu genomic sequences. Comparative DNA sequence analyses are an essential tool in the functional interpretation of complex vertebrate genomes. This project highlights the utility of using the Fugu genome in this kind of study.
Collapse
Affiliation(s)
- G Elgar
- UK Human Genome Mapping Project (HGMP) Resource Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SB, UK.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
10
|
Arqués DG, Fallot JP, Marsan L, Michel CJ. An evolutionary analytical model of a complementary circular code. Biosystems 1999; 49:83-103. [PMID: 10203190 DOI: 10.1016/s0303-2647(98)00038-0] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The subset X0=[AAC,AAT,ACC,ATC,ATT,CAG,CTC,CTG, GAA,GAC,GAG,GAT,GCC,GGC,GGT,GTA,GTC,GTT,TAC,TTC] of 20 trinucleotides has a preferential occurrence in the frame 0 (reading frame established by the ATG start trinucleotide) of protein (coding) genes of both prokaryotes and eukaryotes. This subset X0 is a complementary maximal circular code with two permutated maximal circular codes X1 and X2 in the frames 1 and 2 respectively (frame 0 shifted by one and two nucleotides respectively in the 5'-3' direction). X0 is called a C3 code (Arquès and Michel, 1997, J. Biosyst 44, 107-134). A quantitative study of these three subsets X0, X1 and X2 in the three frames 0, 1 and 2 of eukaryotic protein genes shows that their occurrence frequencies are constant functions of the trinucleotide positions in the sequences. The frequencies of X0, X1 and X2 in the frame 0 of eukaryotic protein genes are 48.5%, 29% and 22.5% respectively. These properties are not observed in the 5' and 3' regions of eukaryotes where X0, X1 and X2 occur with variable frequencies around the random value (1/3). Several frequency asymmetries unexpectedly observed, e.g. the frequency difference between X1 and X2 in the frame 0, are related to a new property of the C3 code X0 involving substitutions. An evolutionary analytical model at three parameters (p, q, t) based on an independent mixing of the 20 codons (trinucleotides in the frame 0) of X0 with equiprobability (1/20) followed by t approximately 4 substitutions per codon according to the proportions p approximately 0.1, q approximately 0.1 and r = 1 - p - q approximately 0.8 in the three codon sites respectively, retrieves the frequencies of X0, X1 and X2 observed in the three frames of protein genes and explains these asymmetries. The complex behaviour of these analytical curves is totally unexpected and a priori difficult to imagine. Finally, the evolutionary analytical method developed could be applied to the phylogenetic tree reconstruction and the DNA sequence alignment.
Collapse
Affiliation(s)
- D G Arqués
- Equipe de Biologie Théorique, Université de Marne la Vallée, Institut Gaspard Monge, Noisy le Grand, France.
| | | | | | | |
Collapse
|
11
|
Abstract
Viruses are responsible for many of the diseases caused by microbial infection. During the past two decades, approximately 20 new human viruses have been discovered. Many of these new viruses were initially identified using molecular biology techniques, a major advantage of which is the ability to search rapidly for new viruses, known viruses or related, but previously unidentified, members of established virus families in disease samples.
Collapse
Affiliation(s)
- P Kellam
- Dept of Virology, Chester Beatty Laboratories, London, UK
| |
Collapse
|
12
|
Hwang CF, Lin Y, D'Souza T, Cheng CL. Sequences necessary for nitrate-dependent transcription of Arabidopsis nitrate reductase genes. PLANT PHYSIOLOGY 1997; 113:853-62. [PMID: 9085575 PMCID: PMC158205 DOI: 10.1104/pp.113.3.853] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Nitrate increases the transcription of the two Arabidopsis thaliana nitrate reductase genes. We demonstrated previously that 238 and 330 bp of the 5' flanking regions, designated as NP1 and NP2, of the two nitrate reductase genes NR1 and NR2, respectively, are sufficient for nitrate-dependent transcription (Y. Lin, C.-F. Hwang, J.B. Brown, C.-L. Cheng [1994] Plant Physiol 106: 477-484). Here we identify the cis-acting elements of NP1 and NP2 that are necessary for nitrate-dependent transcription by linker-scanning (LS) analysis. In transgenic plants one LS mutant of NP1 and two LS mutants of NP2 exhibited significantly lower nitrate-induced reporter gene chloramphenicol acetyltransferase activity. To distinguish which of these three mutants lost nitrate inducibility, competitive reverse-transcriptase polymerase chain reaction was used to measure the chloramphenicol acetyltransferase mRNA levels before and after nitrate induction. The single LS mutant in NP1 lost its response to nitrate, whereas the two LS mutants in NP2 partially lost their response to nitrate. A 12-bp sequence is conserved between the NP1 site and the two NP2 sites. This sequence motif is also conserved in the 5' flanking regions of other nitrate-inducible plant genes. Gel mobility shift experiments indicate that these three regions bind to similar proteins. The binding is constitutive with respect to nitrate treatment and was observed in both nonphotosynthetic suspension cells and green leaves.
Collapse
Affiliation(s)
- C F Hwang
- Department of Biological Sciences, University of Iowa, Iowa City 52242, USA
| | | | | | | |
Collapse
|
13
|
Elgar G, Sandford R, Aparicio S, Macrae A, Venkatesh B, Brenner S. Small is beautiful: comparative genomics with the pufferfish (Fugu rubripes). Trends Genet 1996; 12:145-50. [PMID: 8901419 DOI: 10.1016/0168-9525(96)10018-4] [Citation(s) in RCA: 112] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
As the Human Genome Project advances, it is clear that the emphasis will switch from accumulation of data to their interpretation. Comparative genomics provides a powerful way in which to analyse sequence data. Indeed, there is already a long list of 'model' organisms, which allow comparative analyses in a variety of ways. The very small vertebrate genome of the pufferfish provides a simple and economical way of comparing sequence data from mammals and fish, representing a large evolutionary divergence and so permitting the identification of essential elements that are still present in both species. These elements include genes and the associated machinery that controls their expression; elements that, in many cases, have survived the test of time.
Collapse
Affiliation(s)
- G Elgar
- Department of Medicine, University of Cambridge, UK
| | | | | | | | | | | |
Collapse
|
14
|
Elofsson A, Fischer D, Rice DW, Le Grand SM, Eisenberg D. A study of combined structure/sequence profiles. FOLDING & DESIGN 1996; 1:451-61. [PMID: 9080191 DOI: 10.1016/s1359-0278(96)00061-2] [Citation(s) in RCA: 39] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
BACKGROUND For genome sequencing projects to achieve their full impact on biology and medicine, each protein sequence must be identified with its three-dimensional structure. Fold assignment methods (also called profile and threading methods) attempt to assign sequences to known protein folds by computing the compatibility of sequence to fold. RESULTS We have extended profile methods for the detection of protein folds having structural similarity but low sequence similarity to sequence probes. Our extension combines sequence substitution tables with structural properties to form a combined profile. The structural properties used in this study include distances between residues, exposed areas, areas buried by polar atoms, and properties of the original three-dimensional profile method. We compared the performance of these combined profiles with different sequence matrices and with the original three-dimensional profile method. To determine the optimal gap penalties and weights used with these profiles, we employed a genetic algorithm. The performance of these combined profiles was tested by cross validation using independent test and training sets. CONCLUSIONS These studies show that the combined profiles perform better than profiles based on either structural or sequence information alone.
Collapse
Affiliation(s)
- A Elofsson
- UCLA-DOE Laboratory of Structural Biology and Molecular Medicine, UCLA 90095-1570, USA
| | | | | | | | | |
Collapse
|
15
|
Arquès DG, Michel CJ. Analytical expression of the purine/pyrimidine codon probability after and before random mutations. Bull Math Biol 1993; 55:1025-38. [PMID: 8281128 DOI: 10.1007/bf02460698] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
Recently, we proposed a new model of DNA sequence evolution (Arquès and Michel. 1990b. Bull. math. Biol. 52, 741-772) according to which actual genes on the purine/pyrimidine (R/Y) alphabet (R = purine = adenine or guanine, Y = pyrimidine = cytosine or thymine) are the result of two successive evolutionary genetic processes: (i) a mixing (independent) process of non-random oligonucleotides (words of base length less than 10: YRY(N)6, YRYRYR and YRYYRY are so far identified; N = R or Y) leading to primitive genes (words of several hundreds of base length) and followed by (ii) a random mutation process, i.e., transformations of a base R (respectively Y) into the base Y (respectively R) at random sites in these primitive genes. Following this model the problem investigated here is the study of the variation of the 8 R/Y codon probabilities RRR, ..., YYY under random mutations. Two analytical expressions solved here allow analysis of this variation in the classical evolutionary sense (from the past to the present, i.e., after random mutations), but also in the inverted evolutionary sense (from the present to the past, i.e., before random mutations). Different properties are also derived from these formulae. Finally, a few applications of these formulae are presented. They prove the proposition in Arquès and Michel (1990b. Bull. math. Biol. 52, 741-772), Section 3.3.2, with the existence of a maximal mean number of random mutations per base of the order 0.3 in the protein coding genes. They also confirm the mixing process of oligonucleotides by excluding the purine/pyrimidine contiguous and alternating tracts from the formation process of primitive genes.
Collapse
Affiliation(s)
- D G Arquès
- Université de Franche-Comté, Besançon, France
| | | |
Collapse
|
16
|
Abstract
Analysis of an artificial neural network trained to classify DNA as coding or non-coding revealed compositional differences between sequence parts translated into protein and those that were not. The 5' end of human introns was found to have a base composition that was non-random to an extent matching the non-randomness in the 3' end that contains the polypyrimidine tract. The prevailing nucleotides in the initial 50 nucleotides of human introns are guanine and cytosine, the trinucleotide GGG was found to occur almost four times as frequently as it would in sequences with a uniform distribution of the nucleotides. The initial part of terminal exons and their associated terminal introns were shown to have a very special base composition deviating strongly from the normal picture in other exons and introns.
Collapse
Affiliation(s)
- J Engelbrecht
- Department of Physical Chemistry, Technical University of Denmark, Lyngby
| | | | | |
Collapse
|
17
|
Maier D, Stumm G, Kuhn K, Preiss A. Hairless, a Drosophila gene involved in neural development, encodes a novel, serine rich protein. Mech Dev 1992; 38:143-56. [PMID: 1419850 DOI: 10.1016/0925-4773(92)90006-6] [Citation(s) in RCA: 74] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Hairless is a dominant loss of function mutation in Drosophila affecting the formation of adult sensory organs. In the mutants, neuronal precursor cells do not differentiate, suggesting that Hairless might be involved in specifying or realizing neuronal fate in the fly, similar to the 'pro-neural' genes of the achaete-scute complex. As highlighted by the manifold phenotypic interactions of Hairless with most of the neurogenic loci, the gene might play an important role in nervous system development. Therefore, we initiated a molecular analysis of the Hairless locus in order to elucidate the function of its gene product and gain insight into the biochemical nature of the observed genetic interactions in which it participates. Here, we report the molecular cloning of the Hairless locus, confirmed by breakpoint and transformation analysis. Unexpectedly, Hairless activity peaks during embryogenesis, where transcripts accumulate primarily in endo- and mesodermal cell layers, and is lowest during larval stages, the lethal phase of Hairless mutants. The putative Hairless protein deduced from DNA sequencing is extremely basic and highly enriched in serine residues. Hairless appears to encode a novel protein without compelling homology to other known proteins which function in specifying peripheral nervous system development in Drosophila.
Collapse
Affiliation(s)
- D Maier
- Biozentrum, Department of Cellbiology, University of Basel, Switzerland
| | | | | | | |
Collapse
|
18
|
Abstract
We have developed a hierarchical rule base system for identifying genes in DNA sequences. Atomic sites (such as initiation codons, stop codons, acceptor sites and donor sites) are identified by a number of different methods and evaluated by a set of filters and rules chosen to maximize sensitivity; these are combined into higher-order gene elements (such as exons), evaluated, filtered and combined as equivalence classes into probable genes, which are evaluated and ranked. The system has been tested on an extensive collection of vertebrate genes smaller than 15,000 bases. Results obtained show that, on average, 88% of the predicted coding region for a transcription unit is actually coding, and 80% of the actual coding is correctly predicted. This will, in most applications, be sufficient for a search against protein sequence databases for the identification of probable gene function. In addition, the system provides a general test platform for both gene atomic site identification and the rules for their evaluation and assembly.
Collapse
Affiliation(s)
- R Guigó
- Molecular Biology Computer Research Resource, Dana-Farber Cancer Institute, Boston, MA
| | | | | | | |
Collapse
|
19
|
Pesole G, Prunella N, Liuni S, Attimonelli M, Saccone C. WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences. Nucleic Acids Res 1992; 20:2871-5. [PMID: 1614873 PMCID: PMC336935 DOI: 10.1093/nar/20.11.2871] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
We present here a fast and sensitive method designed to isolate short nucleotide sequences which have non-random statistical properties and may thus be biologically active. It is based on a first order Markov analysis and allows us to detect statistically significant sequence motifs from six to ten nucleotides long which are significantly shared (or avoided) in the sequences under investigation. This method has been tested on a set of 521 sequences extracted from the Eukaryotic Promoter Database (2). Our results demonstrate the accuracy and the efficiency of the method in that the sequence motifs which are known to act as eukaryotic promoters, such as the TATA-box and the CAAT-box, were clearly identified. In addition we have found other statistically significant motifs, the biological roles of which are yet to be clarified.
Collapse
Affiliation(s)
- G Pesole
- Dipartimento di Biochimica e Biologia Molecolare, Università di Bari, Italy
| | | | | | | | | |
Collapse
|
20
|
Derr JN, Davis SK, Woolley JB, Wharton RA. Variation and the phylogenetic utility of the large ribosomal subunit of mitochondrial DNA from the insect order Hymenoptera. Mol Phylogenet Evol 1992; 1:136-47. [PMID: 1342927 DOI: 10.1016/1055-7903(92)90025-c] [Citation(s) in RCA: 25] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Nucleotide sequence variation from a 573-bp region of the mitochondrial 16S rRNA gene was determined for representative hymenopteran taxa. An overall bias in the distribution of A and T bases was observed from all taxa; however, the terebrants (parasitoids) displayed significantly lower AT ratios as well as a higher degree of strand asymmetry. Moreover, a strong positive correlation was observed between relative AT richness and sequence divergence, suggesting selection at the nucleotide level for A and T bases as well as functionality. Overall sequence difference ranged from 2.3 to 53.4%, with the maximum divergence between members of the two Hymenopteran suborders. These data were used in a phylogenetic analysis to illustrate the utility and degree of resolution provided by this information at various hierarchical levels within this taxonomically diverse order. Parsimony analysis revealed strong evidence for monophyly of the aculeates and the terebrants. Most noteworthy was a strongly supported clade containing the two terebrant superfamilies Icheumonoidea and Chalcidoidea. Conversely, high sequence divergence values resulted in instability at the base of the tree and limited resolution at the higher taxonomic levels. Nevertheless, these results do identify those taxonomic levels for which 16S rRNA sequences are phylogenetically informative.
Collapse
Affiliation(s)
- J N Derr
- Department of Animal Sciences, Texas A&M University, College Station 77843
| | | | | | | |
Collapse
|
21
|
|
22
|
Volinia S, Scapoli C, Gambari R, Barale R, Barrai I. Enrichment of oligonucleotide sets with transcription control signals. II: Mammalian DNA. Nucleic Acids Res 1992; 20:551-6. [PMID: 1741289 PMCID: PMC310422 DOI: 10.1093/nar/20.3.551] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
We studied the frequency distribution of oligonucleotides 10 bp long in a sample of 1.6 Mb of mammalian genes, containing 579 sequences from GenBank(R) 55.0, with the aim of detecting transcription control signals. 2216 decamers had a frequency higher than 10 times the mean and were subjected to further statistical analysis. For each of the 2216 decamers (parents), we counted the individual frequencies of the 30 decamers differing from the parent by one base mutation (progeny) and then calculated two variance/mean chi squares for the progeny, with and without the parent. We then studied the distribution of the ratio between the two chi squares. Out of 2216 decamers, 346 had a chi square ratio of 1.9 or larger. In this final set, which corresponds to less than 0.033 per cent of all possible decamers, 18 were found to contain 23 eukaryotic transcription control elements 5-10 bp of length, such as Sp1 and others. Furthermore, when compared to 210 random sets containing 346 decamers, this set contains a highly significant excess of the longer signals.
Collapse
Affiliation(s)
- S Volinia
- Dipartimento di Biologia Evolutiva e Istituto di Chimica Biologica-Università di Ferrara, Italy
| | | | | | | | | |
Collapse
|
23
|
Schorderet DF, Gartler SM. Analysis of CpG suppression in methylated and nonmethylated species. Proc Natl Acad Sci U S A 1992; 89:957-61. [PMID: 1736311 PMCID: PMC48364 DOI: 10.1073/pnas.89.3.957] [Citation(s) in RCA: 85] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
The development of nearest-neighbor analysis led to the finding that the frequency of the dinucleotide CpG is markedly depressed in vertebrates. One explanation of this suppression is that methylation of CpG found in vertebrates represents a mutational hot spot through deamination of methylcytidine to thymidine. We have examined the role of methylated CpG as a factor in CpG suppression by comparing CpG distributions in coding regions of 121 genes from six species, three with methylated DNA and three with nonmethylated DNA. Overall base composition shows that all species exhibit CpG suppression, with the methylated forms showing significantly greater suppression than nonmethylated forms. When the data are analyzed by CpG position, the mean values of the methylated forms exhibit greater suppression than nonmethylated forms at positions I-II and II-III, but there is considerable overlap of suppression scores for individual species. At position III-I, CpG suppression is marked in all methylated species, and it is reversed in all nonmethylated species. Our analysis supports the hypothesis that CpG patterns at positions II-III and III-I in methylated forms are affected by mutation acting through deamination of methylcytidine to thymidine. We speculate that the excess of CpGs at position III-I in nonmethylated forms may be related to a requirement for minimal thermal stability of the DNA.
Collapse
Affiliation(s)
- D F Schorderet
- Department of Medicine, University of Washington, Seattle 98195
| | | |
Collapse
|
24
|
Volinia S, Scapoli C, Gambari R, Barale R, Barrai I. A set of viral DNA decamers enriched in transcription control signals. Nucleic Acids Res 1991; 19:3733-40. [PMID: 1906607 PMCID: PMC328405 DOI: 10.1093/nar/19.13.3733] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
We studied the frequency distribution of oligonucleotides 10 bp long in a sample of 620 Kb of viral genomes, containing 102 sequences from GenBank, with the aim of detecting transcription control signals. Two thousand three hundred decamers had a frequency 10 times higher than the mean and were subjected to further statistical analysis. For each of the 2300 decamers (parents), we counted the individual frequencies of the 30 decamers differing from the parent by one base mutation (progeny) and then calculated two variance/mean chi squares for the progeny, with and without the parent. We then studied the distribution of the ratio between the two chi squares. Out of 2300 decamers, 10 times more frequent than average, 479 decamers had a chi square ratio of 1.9 or larger. In this final set, which corresponds to less than 0.05% of all possible decamers, 58 decamers were found to contain viral and eukaryotic transcription control elements, like NF-kB, Sp1 and others. Furthermore, this set contains an excess of signals of length 5, 6, 7, 8, 9 and 10, when compared to 150 random sets, bootstrapped from the same viral genomes.
Collapse
Affiliation(s)
- S Volinia
- Dipartimento di Biologia Evolutiva, Università di Ferrara, Italy
| | | | | | | | | |
Collapse
|
25
|
Harper DS, Song K, Jahn CL. Overamplification of macronuclear linear DNA molecules during prolonged vegetative growth of Oxytricha nova. Gene X 1991; 99:55-61. [PMID: 2022323 DOI: 10.1016/0378-1119(91)90033-8] [Citation(s) in RCA: 20] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
During prolonged vegetative growth of a clonal line of Oxytricha nova, several macronuclear linear DNA molecules increased greatly in copy number over the rest of the approx. 24,000 kinds of molecules comprising the macronuclear genome. One of the amplified sequences was the linear DNA molecule encoding rRNA (rDNA). We have cloned and sequenced the other, smaller, amplified molecules and found that they comprise a gene family, with different allelic versions of one of the family members being amplified. Thus, increased replication is a general property of the molecules comprising this gene family. To date, no function has been assigned to these genes; thus, whether the amplification of these sequences has functional significance is unknown. The rDNA molecule and the two small amplified sequences increased 11-, 24- and 107-fold, respectively, during clonal growth of this line, eventually comprising up to 15% of the macronuclear DNA molecules. Seven other macronuclear DNA molecules did not vary substantially in copy number at different times during the clonal growth of this strain. Analysis of cell-to-cell differences in copy numbers in this clonally aged strain indicated more extensive variation than is evident when large populations from different times are compared.
Collapse
Affiliation(s)
- D S Harper
- Laboratory for Molecular Biology, Department of Biological Sciences, University of Illinois, Chicago 60680
| | | | | |
Collapse
|
26
|
Abstract
The analysis of coding sequences reveals nonrandomness in the context of both sense and stop codons. Part of this is related to nucleotide doublet preference, seen also in non-coding sequences and thought to arise from the dependence of mutational events on surrounding sequence. Another nonrandom context element, relating the wobble nucleotides of successive codons, is observed even when doublet preference, codon usage and bias in amino acid doublets are all allowed for. Several phenomena related to protein synthesis have been shown in vivo to be affected by the nucleotide sequence around codons. Thus, nonsense and missense suppression, elongation rate, precision of tRNA selection and polypeptide chain termination are all affected by codon context. At present, it remains unclear how these phenomena may influence the evolution of nonrandomness in the context of codons in natural sequences.
Collapse
Affiliation(s)
- R H Buckingham
- URA 1139 du CNRS, Institut de Biologie Physico-Chimique, Paris, France
| |
Collapse
|
27
|
Stückle EE, Emmrich C, Grob U, Nielsen PJ. Statistical analysis of nucleotide sequences. Nucleic Acids Res 1990; 18:6641-7. [PMID: 2251125 PMCID: PMC332623 DOI: 10.1093/nar/18.22.6641] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
In order to scan nucleic acid databases for potentially relevant but as yet unknown signals, we have developed an improved statistical model for pattern analysis of nucleic acid sequences by modifying previous methods based on Markov chains. We demonstrate the importance of selecting the appropriate parameters in order for the method to function at all. The model allows the simultaneous analysis of several short sequences with unequal base frequencies and Markov order k not equal to 0 as is usually the case in databases. As a test of these modifications, we show that in E. coli sequences there is a bias against palindromic hexamers which correspond to known restriction enzyme recognition sites.
Collapse
Affiliation(s)
- E E Stückle
- Max-Planck-Institut für Immunbiologie, Freiburg, FRG
| | | | | | | |
Collapse
|
28
|
Mott RF, Kirkwood TB, Curnow RN. An accurate approximation to the distribution of the length of the longest matching word between two random DNA sequences. Bull Math Biol 1990; 52:773-84. [PMID: 2279194 DOI: 10.1007/bf02460808] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
An accurate approximation is derived to the distribution of the length of the longest matching word present between two random DNA sequences of finite length, using only elementary probability arguments. The distribution is shown to be consistent with previous asymptotic results for the mean and variance of longest common words. The application of the distribution to assessing the statistical significance of sequence similarities is considered. It is shown how the distribution can be modified to take account of non-independence of neighbouring bases in real sequences.
Collapse
Affiliation(s)
- R F Mott
- Laboratory of Mathematical Biology, National Institute for Medical Research, Mill Hill, London, U.K
| | | | | |
Collapse
|
29
|
Hong J. Prediction of oligonucleotide frequencies based upon dinucleotide frequencies obtained from the nearest neighbor analysis. Nucleic Acids Res 1990; 18:1625-8. [PMID: 2158083 PMCID: PMC330535 DOI: 10.1093/nar/18.6.1625] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
A statistical method of predicting hexanucleotide frequencies is presented. The method requires dinucleotide frequencies which can be readily obtained by nearest neighbor analysis. The frequencies of 64 hexanucleotides of E. coli were estimated and compared well with those predicted by a third order Markov chain.
Collapse
Affiliation(s)
- J Hong
- Biochemical Engineering Program, School of Engineering, University of California, Irvine 92717
| |
Collapse
|
30
|
Goldstein A, Brutlag DL. Is there a relationship between DNA sequences encoding peptide ligands and their receptors? Proc Natl Acad Sci U S A 1989; 86:42-5. [PMID: 2536158 PMCID: PMC286399 DOI: 10.1073/pnas.86.1.42] [Citation(s) in RCA: 23] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
It has been suggested that the coding for a ligand and its receptor may have originated in inverse complementary strands of the same DNA. This would imply a deficiency of stop codons in the complementary strand of the ligand message sequence. We have sought evidence of such deficiencies by an analysis of the usage of selected codons in 23 human neuropeptide and hormone mRNA sequences. We have also searched directly for similarities between substance K or substance P and the substance K receptor. Although bovine proopiomelanocortin has an open reading frame for the full extent of the inverse complement of the coding region, this seems to be a unique case. The data as a whole do not support the hypothesis.
Collapse
Affiliation(s)
- A Goldstein
- Department of Pharmacology, Stanford University School of Medicine, CA 94305
| | | |
Collapse
|
31
|
Tavaré S, Song B. Codon preference and primary sequence structure in protein-coding regions. Bull Math Biol 1989; 51:95-115. [PMID: 2706404 DOI: 10.1007/bf02458838] [Citation(s) in RCA: 21] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Abstract
The stochastic complexity of a data base of 365 protein-coding regions is analysed. When the primary sequence is modeled as a spatially homogeneous Markov source, the fit to observed codon preference is very poor. The situation improves substantially when a non-homogeneous model is used. Some implications for the estimation of species phylogeny and substitution rates are discussed.
Collapse
|
32
|
Dubnick M, Lewis LK, Mount DW. BIGPROBE: a computer program that predicts the sequence of long oligonucleotide probes with high reliability. Nucleic Acids Res 1988; 16:1703-14. [PMID: 3353219 PMCID: PMC338165 DOI: 10.1093/nar/16.5.1703] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
We have written a computer program, BIGPROBE, which facilitates the design of long nucleic acid probes from the partial or complete amino acid sequence of a protein. BIGPROBE relies upon information on codon usage, intercodon dinucleotide frequency, and potential probe self-complementarity. We have examined the accuracy with which the program predicts coding sequences using sample human and rat genes and probe lengths of 30-60 nucleotides. Rat probe sequences selected by BIGPROBE using either codon usage or dinucleotide frequency data alone averaged 86-92% homology with the known exons of the corresponding gene sequences. Predictive accuracy with rat gene probes could be improved to 89-94%, depending upon probe length, by applying codon usage and dinucleotide frequency data in combination. Similar accuracy was achieved for human genes.
Collapse
Affiliation(s)
- M Dubnick
- Department of Molecular and Cellular Biology, University of Arizona, Tucson 85721
| | | | | |
Collapse
|
33
|
Altschul SF, Erickson BW. Significance levels for biological sequence comparison using non-linear similarity functions. Bull Math Biol 1988; 50:77-92. [PMID: 3370371 DOI: 10.1007/bf02459979] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
|
34
|
Baron MD, Davison MD, Jones P, Critchley DR. The sequence of chick alpha-actinin reveals homologies to spectrin and calmodulin. J Biol Chem 1987. [DOI: 10.1016/s0021-9258(18)45426-9] [Citation(s) in RCA: 81] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022] Open
|
35
|
Abstract
Higher plant nuclear sequences reveal avoidance of CpG and TpA doublets. Chloroplast sequences avoid the TpA doublet in all codon positions. The chloroplast genome is not methylated but codon positions II-III and untranslated regions avoid CpG. The mitochondrial genome, also unmethylated, avoids CpG in all codon positions. We therefore deduce that methylation is not sufficient to explain CpG avoidance in the higher plant systems. Other factors must be taken into account such as amino acid composition, codon choices and perhaps stability of the DNA helix.
Collapse
|
36
|
Abstract
Although vertebrate DNA is generally depleted in the dinucleotide CpG, it has recently been shown that some vertebrate genes contain CpG islands, regions of DNA with a high G+C content and a high frequency of CpG dinucleotides relative to the bulk genome. In this study, a large number of sequences of vertebrate genes were screened for the presence of CpG islands. Each CpG island was then analysed in terms of length, nucleotide composition, frequency of CpG dinucleotides, and location relative to the transcription unit of the associated gene. CpG islands were associated with the 5' ends of all housekeeping genes and many tissue-specific genes, and with the 3' ends of some tissue-specific genes. A few genes contained both 5' and 3' CpG islands, separated by several thousand base-pairs of CpG-depleted DNA. The 5' CpG islands extended through 5'-flanking DNA, exons and introns, whereas most of the 3' CpG islands appeared to be associated with exons. CpG islands were generally found in the same position relative to the transcription unit of equivalent genes in different species, with some notable exceptions. The locations of G/C boxes, composed of the sequence GGGCGG or its reverse complement CCGCCC, were investigated relative to the location of CpG islands. G/C boxes were found to be rare in CpG-depleted DNA and plentiful in CpG islands, where they occurred in 3' CpG islands, as well as in 5' CpG islands associated with tissue-specific and housekeeping genes. G/C boxes were located both upstream and downstream from the transcription start site of genes with 5' CpG islands. Thus, G/C boxes appeared to be a feature of CpG islands in general, rather than a feature of the promoter region of housekeeping genes. Two theories for the maintenance of a high frequency of CpG dinucleotides in CpG islands were tested: that CpG islands in methylated genomes are maintained, despite a tendency for 5mCpG to mutate by deamination to TpG+CpA, by the structural stability of a high G+C content alone, and that CpG islands associated with exons result from some selective importance of the arginine codon CGX. Neither of these theories could account for the distribution of CpG dinucleotides in the sequences analysed. Possible functions of CpG islands in transcriptional and post-transcriptional regulation of gene expression were discussed, and were related to theories for the maintenance of CpG islands as "methylation-free zones" in germline DNA.
Collapse
Affiliation(s)
- M Gardiner-Garden
- Kanematsu Laboratories, Royal Prince Alfred Hospital, Camperdown N.S.W., Australia
| | | |
Collapse
|
37
|
Price GJ, Jones P, Davison MD, Patel B, Eperon IC, Critchley DR. Isolation and characterization of a vinculin cDNA from chick-embryo fibroblasts. Biochem J 1987; 245:595-603. [PMID: 3117046 PMCID: PMC1148163 DOI: 10.1042/bj2450595] [Citation(s) in RCA: 44] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
A chick-embryo fibroblast lambda gt11 cDNA library was screened with affinity-purified antibodies to chick gizzard vinculin. One recombinant was purified to homogeneity and the fusion protein expressed in Escherichia coli strain C600. The fusion protein was unstable, but polypeptides that reacted with vinculin antibodies, but not non-immune immunoglobulin, were detected by Western blotting. The recombinant contained a single EcoRI fragment of 2891 bp with a single open reading frame. The deduced protein sequence could be aligned with that of six CNBr-cleavage peptides and two tryptic peptides derived from chicken gizzard vinculin. AUG-247 has tentatively been identified as the initiation codon, as it is contained within the consensus sequence for initiation sites of higher eukaryotes. The cDNA lacks 3' sequence and encodes 74% of the vinculin sequence, presuming the molecular mass of vinculin to be 130,000 Da. Analysis of the deduced sequence showed no homologies with other protein sequences, but it does display a triple internal repeat of 112 amino acid residues covering residues 259-589. The sequences surrounding the seven tyrosine residues in the available sequence were aligned with the tyrosine autophosphorylation consensus sequence found in protein tyrosine kinases. Tyr-822 showed a good match to this consensus, and may represent one of the two major sites of tyrosine phosphorylation by pp60v-sre. Northern blots showed that the 2.89 kb vinculin cDNA hybridized to one size of mRNA (approx. 7 kb) in chick-embryo fibroblasts, chick smooth muscle and chick skeletal muscle. Southern blots revealed multiple hybridizing bands in genomic DNA.
Collapse
Affiliation(s)
- G J Price
- Department of Biochemistry, University of Leicester, U.K
| | | | | | | | | | | |
Collapse
|
38
|
Cooper DN, Gerber-Huber S, Nardelli D, Schubiger JL, Wahli W. The distribution of the dinucleotide CpG and cytosine methylation in the vitellogenin gene family. J Mol Evol 1987; 25:107-15. [PMID: 3116270 DOI: 10.1007/bf02101752] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Sequence data from regions of five vertebrate vitellogenin genes were used to examine the frequency, distribution, and mutability of the dinucleotide CpG, the preferred modification site for eukaryotic DNA methyltransferases. The observed level of the CpG dinucleotide in all five genes was markedly lower than that expected from the known mononucleotide frequencies. CpG suppression was greater in introns than in exons. CpG-containing codons were found to be avoided in the vitellogenin genes, but not completely despite the redundancy of the genetic code. Frequency and distribution patterns of this dinucleotide varied dramatically among these otherwise closely related genes. Dense clusters of CpG dinucleotides tended to appear in regions of either functional or structural interest (e.g., in the transposon-like Vi-element of Xenopus) and these clusters contained 5-methylcytosine (5 mC). 5 mC is known to undergo deamination to form thymidine, but the extent to which this transition occurs in the heavily methylated genomes of vertebrates and its contribution to CpG suppression are still unclear. Sequence comparison of the methylated vitellogenin gene regions identified C----T and G----A substitutions that were found to occur at relatively high frequencies. The predicted products of CpG deamination, TpG and CpA, were elevated. These findings are consistent with the view that CpG distribution and methylation are interdependent and that deamination of 5 mC plays an important role in promoting evolutionary change at the nucleotide sequence level.
Collapse
Affiliation(s)
- D N Cooper
- Department of Neurochemistry, University of London, United Kingdom
| | | | | | | | | |
Collapse
|
39
|
Adams RL, Davis T, Rinaldi A, Eason R. CpG deficiency, dinucleotide distributions and nucleosome positioning. EUROPEAN JOURNAL OF BIOCHEMISTRY 1987; 165:107-15. [PMID: 3569286 DOI: 10.1111/j.1432-1033.1987.tb11200.x] [Citation(s) in RCA: 22] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
The dinucleotide CpG is deficient in (A + T)-rich regions of vertebrate DNA in both coding and non-coding sequences and there is a corresponding increase above expectation in the occurrence of TpG and CpA. By contrast in (G + C)-rich regions no deficiency of CpG is found. Such (G + C)-rich sequences, containing the expected number of CpG dinucleotides, alternate along the genome with (A + T)-rich sequences which have a lower than expected CpG content. The G + C content of vertebrate DNA can oscillate with a period of 150-200 bp and this may be a factor in positioning nucleosomes. The role of mutagenesis in loss of CpG and increase of A + T, particularly in non-coding regions, is discussed.
Collapse
|
40
|
Trifonov EN. Translation framing code and frame-monitoring mechanism as suggested by the analysis of mRNA and 16 S rRNA nucleotide sequences. J Mol Biol 1987; 194:643-52. [PMID: 2443708 DOI: 10.1016/0022-2836(87)90241-5] [Citation(s) in RCA: 179] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Protein coding sequences carry an additional message in the form of a universal three-base periodical pattern (G-non-G-N)n, which is expressed as a strong preference for guanines in the first positions of the codons in mRNA and lack of guanines in the second positions. This periodicity appears immediately after the initiation codon and is maintained along the mRNA as far as the termination triplet, where it disappears abruptly. Known cases of ribosome slippage during translation (leaky frameshifts, out-of-frame gene fusion) are analyzed. At the sites of the slippage the G-periodical pattern is found to be interrupted. It reappears downstream from the slippage sites, in a new frame that corresponds to the new translation frame. This suggests that the (G-non-G-N)n pattern in the mRNA may be responsible for monitoring the correct reading frame during translation. Several sites with complementary C-periodical structure are found in the Escherichia coli 16 S rRNA sequence. Only three of them are exposed to various interactions at the surface of the small ribosomal subunit: (517)gcCagCagCegC, (1395)caCacCgcC and (1531)auCacCucC. A model of a frame-monitoring mechanism is suggested based on the weak complementarity of G-periodical mRNA to the C-periodical sites in the ribosomal RNA. The model is strongly supported by the fact that the hypothetical frame-monitoring sites in the 16 S rRNA that are derived from the nucleotide sequence analysis are also the only sites known to be actually involved or implicated in rRNA-mRNA interactions.
Collapse
Affiliation(s)
- E N Trifonov
- Department of Polymer Research, Weizmann Institute of Science, Rehovot, Israel
| |
Collapse
|
41
|
Gruskin KD, Smith TF, Goodman M. Possible origin of a calmodulin gene that lacks intervening sequences. Proc Natl Acad Sci U S A 1987; 84:1605-8. [PMID: 3470746 PMCID: PMC304484 DOI: 10.1073/pnas.84.6.1605] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
The divergent, muscle-specific allele of the chicken calmodulin gene contains no intervening sequences and apparently was produced by a reverse transcriptase-mediated event. The nucleotide and deduced amino acid sequences of this gene were compared with nucleotide and amino acid sequence data of other known calmodulin genes in order to investigate its evolutionary history. These comparisons, as well as the CpG dinucleotide content, support the conclusion that this highly divergent chicken calmodulin gene did not exist for any significant period of times as a pseudogene and suggest plausible alternative genetic histories. The most parsimonious history involves the viral import of a very old foreign gene of high CpG content.
Collapse
|
42
|
Hyde JE, Sims PF. Anomalous dinucleotide frequencies in both coding and non-coding regions from the genome of the human malaria parasite Plasmodium falciparum. Gene X 1987; 61:177-87. [PMID: 3327756 DOI: 10.1016/0378-1119(87)90112-0] [Citation(s) in RCA: 38] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
We have statistically analysed the distribution of nucleotides and dinucleotides in 21 genes of the 81% A + T-rich human malaria parasite Plasmodium falciparum. The mRNA-synonymous strands of this protozoan show in general a marked excess of purines over pyrimidines, correlated with abnormally high levels of Lys and Glu. We have used the large differences in base composition between coding and non-coding regions to estimate that the parasite possesses in the range of 2700-5400 genes. The dinucleotide preference patterns are compared with consensus patterns derived from other organisms [Nussinov, Nucl. Acids Res. 12 (1984) 1749-1763]. Patterns in the coding regions surprisingly resemble those of higher, rather than lower eukaryotes, particularly with respect to TG elevation and CG suppression. The latter is correlated with an abnormally low level of Arg in these parasites. In the non-coding regions, the four dinucleotides made up of C and/or G are found with significantly higher frequencies than expected (approx. 50-150%), specifically to the 5' side of the coding regions. The possible role of these dinucleotides in control sequences is discussed.
Collapse
Affiliation(s)
- J E Hyde
- Department of Biochemistry and Applied Molecular Biology, University of Manchester Institute of Science and Technology, U.K
| | | |
Collapse
|
43
|
Abstract
The genome of the human malaria parasite Plasmodium falciparum has an A + T content of about 82%, higher than any other organism whose DNA has been characterized. Computer analysis of 36 kb of available nucleotide sequences from this species showed that the coding regions, with an A + T content of 69.0%, are flanked by more A + T-rich regions of 86.0% A + T. Within the coding sequences, the A/T ratio was 1.68 in the mRNA sense strand, and overall A + T content in the three codon positions increased in the order 1st-2nd-3rd position. Codons with T or especially A in the third position were strongly preferred. Codon usage among individual parasite genes was very similar compared to genes from other species. Dinucleotide frequencies for the parasite DNA were close to those expected for a random sequence with the known base composition, except that the CpG frequency in the coding sequences was low.
Collapse
|
44
|
Altschul SF, Erickson BW. A nonlinear measure of subalignment similarity and its significance levels. Bull Math Biol 1986; 48:617-32. [PMID: 3495309 DOI: 10.1007/bf02462327] [Citation(s) in RCA: 40] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
|
45
|
Hall JD, Gibbs JS, Coen DM, Mount DW. Structural organization and unusual codon usage in the DNA polymerase gene from herpes simplex virus type 1. DNA (MARY ANN LIEBERT, INC.) 1986; 5:281-8. [PMID: 3017656 DOI: 10.1089/dna.1986.5.281] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
We have analyzed the protein and nucleic acid sequences of the DNA polymerase from herpes simplex virus type 1 (HSV-1) to provide insight into the expression and possible structure of this enzyme. Extensive similarity between the amino acid sequence and that of the Epstein Barr virus DNA polymerase is reported. We describe probable structural similarities between these proteins and the use of these similarities to define structural and functional domains within the polymerase. Analysis of base composition and codon usage reveals that several genes from HSV-1, including DNA polymerase, exhibit a strong preference for guanine or cytosine at the third codon position. This preference may result from the high guanine + cytosine content of the virus and produces a highly restricted codon usage, different from that of the host cell. Consequences of the unusual codon usage for viral expression include the potential for extensive mRNA secondary structure.
Collapse
|
46
|
Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci U S A 1986; 83:5155-9. [PMID: 3460087 PMCID: PMC323909 DOI: 10.1073/pnas.83.14.5155] [Citation(s) in RCA: 248] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Determination of first- and second-order Markov chain homogeneity of sets of nuclear eukaryotic DNA sequences, both coding and noncoding, finds similarities imperceptible to the standard Needleman-Wunsch base matching or dot-matrix algorithms. These measures of the similarities of the distributions of adjacent pairs or triplets are in agreement with accepted evolutionary-tree topologies. Hierarchical clustering of the distributions of doublets of 30 miscellaneous coding sequences gives clusters in reasonable agreement with accepted biological classifications. In addition to similarity by homology, there is also observed similarity of disparate genes in the same organism--for example, all three disparate yeast genes (two enzymes and actin) form a well-distinguished cluster.
Collapse
|
47
|
Michel CJ. New statistical approach to discriminate between protein coding and non-coding regions in DNA sequences and its evaluation. J Theor Biol 1986; 120:223-36. [PMID: 3784581 DOI: 10.1016/s0022-5193(86)80176-x] [Citation(s) in RCA: 24] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
We propose a new approach to study protein coding and non-coding regions in DNA sequences, by making use of two complementary statistical methods. The principal component analysis (PCA) is a graphical method to represent DNA sequences which are characterized by some quantitative parameters: it is a help to the intuition. The discriminating analysis (DA) is a quantitative method which permits to classify the DNA sequences. It leads to an evaluation of the first method and to a decision. The value of this approach has been confirmed since we also have found some results which had been described recently in the literature. Furthermore, this general methodology has permitted us to show the existence of parameters which identify the nucleic acid sequence functional domains, without having to make use of the properties of the genetic code.
Collapse
|
48
|
Fristensky B. Improving the efficiency of dot-matrix similarity searches through use of an oligomer table. Nucleic Acids Res 1986; 14:597-610. [PMID: 3753792 PMCID: PMC339447 DOI: 10.1093/nar/14.1.597] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Dot-matrix sequence similarity searches can be greatly speeded up through use of a table listing all locations of short oligomers in one of the sequences to find potential similarities with a second sequence. The algorithm described finds similarities between two sequences of lengths M and N, comparing L residues at a time, with an efficiency of L X M X N/(SK) where S is the alphabet size, and k is the length of the oligomer. For nucleic acids, in which S = 4, use of a tetranucleotide table results in an efficiency of L X M X N/256. The simplicity of the approach allows for a straightforward calculation of the level of similarities expected to be found for given search parameters. Furthermore, the storage required is minimal, allowing for even large sequences to be compared on small microcomputers. Theoretical considerations regarding the use of this search are discussed.
Collapse
|
49
|
Fitch WM, Smith T, Breslow JL. Detecting internally repeated sequences and inferring the history of duplication. Methods Enzymol 1986; 128:773-88. [PMID: 3088394 DOI: 10.1016/0076-6879(86)28105-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
50
|
A unique element resembling a processed pseudogene. J Biol Chem 1986. [DOI: 10.1016/s0021-9258(17)42420-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
|