1
|
Bioinformatics Approaches to Gain Insights into cis-Regulatory Motifs Involved in mRNA Localization. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2019; 1203:165-194. [PMID: 31811635 DOI: 10.1007/978-3-030-31434-7_7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Messenger RNA (mRNA) is a fundamental intermediate in the expression of proteins. As an integral part of this important process, protein production can be localized by the targeting of mRNA to a specific subcellular compartment. The subcellular destination of mRNA is suggested to be governed by a region of its primary sequence or secondary structure, which consequently dictates the recruitment of trans-acting factors, such as RNA-binding proteins or regulatory RNAs, to form a messenger ribonucleoprotein particle. This molecular ensemble is requisite for precise and spatiotemporal control of gene expression. In the context of RNA localization, the description of the binding preferences of an RNA-binding protein defines a motif, and one, or more, instance of a given motif is defined as a localization element (zip code). In this chapter, we first discuss the cis-regulatory motifs previously identified as mRNA localization elements. We then describe motif representation in terms of entropy and information content and offer an overview of motif databases and search algorithms. Finally, we provide an outline of the motif topology of asymmetrically localized mRNA molecules.
Collapse
|
2
|
Zhao X, Sze SH. Motif finding in DNA sequences based on skipping nonconserved positions in background Markov chains. J Comput Biol 2011; 18:759-70. [PMID: 21554019 DOI: 10.1089/cmb.2010.0197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
One strategy to identify transcription factor binding sites is through motif finding in upstream DNA sequences of potentially co-regulated genes. Despite extensive efforts, none of the existing algorithms perform very well. We consider a string representation that allows arbitrary ignored positions within the nonconserved portion of single motifs, and use O(2(l)) Markov chains to model the background distributions of motifs of length l while skipping these positions within each Markov chain. By focusing initially on positions that have fixed nucleotides to define core occurrences, we develop an algorithm to identify motifs of moderate lengths. We compare the performance of our algorithm to other motif finding algorithms on a few benchmark data sets, and show that significant improvement in accuracy can be obtained when the sites are sufficiently conserved within a given sample, while comparable performance is obtained when the site conservation rate is low. A software program (PosMotif ) and detailed results are available online at http://faculty.cse.tamu.edu/shsze/posmotif.
Collapse
Affiliation(s)
- Xiaoyan Zhao
- Department of Computer Science & Engineering, Texas A&M University, College Station, Texas 77843, USA
| | | |
Collapse
|
3
|
Dekhtyar M, Morin A, Sakanyan V. Triad pattern algorithm for predicting strong promoter candidates in bacterial genomes. BMC Bioinformatics 2008; 9:233. [PMID: 18471287 PMCID: PMC2412878 DOI: 10.1186/1471-2105-9-233] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2007] [Accepted: 05/09/2008] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Bacterial promoters, which increase the efficiency of gene expression, differ from other promoters by several characteristics. This difference, not yet widely exploited in bioinformatics, looks promising for the development of relevant computational tools to search for strong promoters in bacterial genomes. RESULTS We describe a new triad pattern algorithm that predicts strong promoter candidates in annotated bacterial genomes by matching specific patterns for the group I sigma70 factors of Escherichia coli RNA polymerase. It detects promoter-specific motifs by consecutively matching three patterns, consisting of an UP-element, required for interaction with the alpha subunit, and then optimally-separated patterns of -35 and -10 boxes, required for interaction with the sigma70 subunit of RNA polymerase. Analysis of 43 bacterial genomes revealed that the frequency of candidate sequences depends on the A+T content of the DNA under examination. The accuracy of in silico prediction was experimentally validated for the genome of a hyperthermophilic bacterium, Thermotoga maritima, by applying a cell-free expression assay using the predicted strong promoters. In this organism, the strong promoters govern genes for translation, energy metabolism, transport, cell movement, and other as-yet unidentified functions. CONCLUSION The triad pattern algorithm developed for predicting strong bacterial promoters is well suited for analyzing bacterial genomes with an A+T content of less than 62%. This computational tool opens new prospects for investigating global gene expression, and individual strong promoters in bacteria of medical and/or economic significance.
Collapse
Affiliation(s)
| | - Amelie Morin
- Laboratoire de Biotechnologie, UMR CNRS 6204, Université de Nantes, 2 rue de la Houssinière, 44322 Nantes, France
| | - Vehary Sakanyan
- Laboratoire de Biotechnologie, UMR CNRS 6204, Université de Nantes, 2 rue de la Houssinière, 44322 Nantes, France
- ProtNeteomix, 2 rue de la Houssinière, 44322 Nantes, France
| |
Collapse
|
4
|
|
5
|
Sandve GK, Drabløs F. A survey of motif discovery methods in an integrated framework. Biol Direct 2006; 1:11. [PMID: 16600018 PMCID: PMC1479319 DOI: 10.1186/1745-6150-1-11] [Citation(s) in RCA: 109] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2006] [Accepted: 04/06/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND There has been a growing interest in computational discovery of regulatory elements, and a multitude of motif discovery methods have been proposed. Computational motif discovery has been used with some success in simple organisms like yeast. However, as we move to higher organisms with more complex genomes, more sensitive methods are needed. Several recent methods try to integrate additional sources of information, including microarray experiments (gene expression and ChlP-chip). There is also a growing awareness that regulatory elements work in combination, and that this combinatorial behavior must be modeled for successful motif discovery. However, the multitude of methods and approaches makes it difficult to get a good understanding of the current status of the field. RESULTS This paper presents a survey of methods for motif discovery in DNA, based on a structured and well defined framework that integrates all relevant elements. Existing methods are discussed according to this framework. CONCLUSION The survey shows that although no single method takes all relevant elements into consideration, a very large number of different models treating the various elements separately have been tried. Very often the choices that have been made are not explicitly stated, making it difficult to compare different implementations. Also, the tests that have been used are often not comparable. Therefore, a stringent framework and improved test methods are needed to evaluate the different approaches in order to conclude which ones are most promising. REVIEWERS This article was reviewed by Eugene V. Koonin, Philipp Bucher (nominated by Mikhail Gelfand) and Frank Eisenhaber.
Collapse
Affiliation(s)
- Geir Kjetil Sandve
- Department of Computer and Information Science, NTNU – Norwegian University of Science and Technology, N-7052, Trondheim, Norway
| | - Finn Drabløs
- Department of Cancer Research and Molecular Medicine, NTNU – Norwegian University of Science and Technology, N-7006, Trondheim, Norway
| |
Collapse
|
6
|
Marsan L, Sagot MF. Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J Comput Biol 2001; 7:345-62. [PMID: 11108467 DOI: 10.1089/106652700750050826] [Citation(s) in RCA: 171] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
This paper introduces two exact algorithms for extracting conserved structured motifs from a set of DNA sequences. Structured motifs may be described as an ordered collection of p > or = 1 "boxes" (each box corresponding to one part of the structured motif), p substitution rates (one for each box) and p - 1 intervals of distance (one for each pair of successive boxes in the collection). The contents of the boxes--that is, the motifs themselves--are unknown at the start of the algorithm. This is precisely what the algorithms are meant to find. A suffix tree is used for finding such motifs. The algorithms are efficient enough to be able to infer site consensi, such as, for instance, promoter sequences or regulatory sites, from a set of unaligned sequences corresponding to the noncoding regions upstream from all genes of a genome. In particular, both algorithms time complexity scales linearly with N2n where n is the average length of the sequences and N their number. An application to the identification of promoter and regulatory consensus sequences in bacterial genomes is shown.
Collapse
Affiliation(s)
- L Marsan
- Institut Gaspard Monge, Université de Marne la Vallée 5
| | | |
Collapse
|
7
|
Vanet A, Marsan L, Labigne A, Sagot MF. Inferring regulatory elements from a whole genome. An analysis of Helicobacter pylori sigma(80) family of promoter signals. J Mol Biol 2000; 297:335-53. [PMID: 10715205 DOI: 10.1006/jmbi.2000.3576] [Citation(s) in RCA: 60] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Helicobacter pylori is adapted to life in a unique niche, the gastric epithelium of primates. Its promoters may therefore be different from those of other bacteria. Here, we determine motifs possibly involved in the recognition of such promoter sequences by the RNA polymerase using a new motif identification method. An important feature of this method is that the motifs are sought with the least possible assumptions about what they may look like. The method starts by considering the whole genome of H. pylori and attempts to infer directly from it a description for a family of promoters. Thus, this approach differs from searching for such promoters with a previously established description. The two algorithms are based on the idea of inferring motifs by flexibly comparing words in the sequences with an external object, instead of between themselves. The first algorithm infers single motifs, the second a combination of two motifs separated from one another by strictly defined, sterically constrained distances. Besides independently finding motifs known to be present in other bacteria, such as the Shine-Dalgarno sequence and the TATA-box, this approach suggests the existence in H. pylori of a new, combined motif, TTAAGC, followed optimally 21 bp downstream by TATAAT. Between these two motifs, there is in some cases another, TTTTAA or, less frequently, a repetition of TTAAGC separated optimally from the TATA-box by 12 bp. The combined motif TTAAGCx(21+/-2)TATAAT is present with no errors immediately upstream from the only two copies of the ribosomal 23 S-5 S RNA genes in H. pylori, and with one error upstream from the only two copies of the ribosomal 16 S RNA genes. The operons of both ribosomal RNA molecules are strongly expressed, representing an encouraging sign of the pertinence of the motifs found by the algorithms. In 25 cases out of a possible 30, the combined motif is found with no more than three substitutions immediately upstream from ribosomal proteins, or operons containing a ribosomal protein. This is roughly the same frequency of occurrence as for TTGACAx(15-19)TATAAT (with the same maximum number of substitutions allowed) described as being the sigma(70 )promoter sequence consensus in Bacillus subtilis and Escherichia coli. The frequency of occurrence of the new motif obtained, TTAAGCx(19-23)TATAAT, remains high when all protein genes in H. pylori are considered, as is the case for the TTGACAx(15-19)TATAAT motif in B. subtilis but not in E. coli.
Collapse
Affiliation(s)
- A Vanet
- Institut de Biologie Physico-Chimique, UPR CNRS 9073, 13 rue Pierre et Marie Curie, Paris, 75005, France
| | | | | | | |
Collapse
|
8
|
Abstract
This paper presents a survey of currently available mathematical models and algorithmical methods for trying to identify promoter sequences. The methods concern both searching in a genome for a previously defined consensus and extracting a consensus from a set of sequences. Such methods were often tailored for either eukaryotes or prokaryotes although this does not preclude use of the same method for both types of organisms. The survey therefore covers all methods; however, emphasis is placed on prokaryotic promoter sequence identification. Illustrative applications of the main extracting algorithms are given for three bacteria.
Collapse
Affiliation(s)
- A Vanet
- Institut de biologie physico-chimique, Paris, France
| | | | | |
Collapse
|
9
|
Brazma A, Jonassen I, Eidhammer I, Gilbert D. Approaches to the automatic discovery of patterns in biosequences. J Comput Biol 1998; 5:279-305. [PMID: 9672833 DOI: 10.1089/cmb.1998.5.279] [Citation(s) in RCA: 157] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
This paper surveys approaches to the discovery of patterns in biosequences and places these approaches within a formal framework that systematises the types of patterns and the discovery algorithms. Patterns with expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering the patterns that are the most frequently used in molecular bioinformatics. A formulation is given of the problem of the automatic discovery of such patterns from a set of sequences, and an analysis is presented of the ways in which an assessment can be made of the significance of the discovered patterns. It is shown that the problem is related to problems studied in the field of machine learning. The major part of this paper comprises a review of a number of existing methods developed to solve the problem and how these relate to each other, focusing on the algorithms underlying the approaches. A comparison is given of the algorithms, and examples are given of patterns that have been discovered using the different methods.
Collapse
Affiliation(s)
- A Brazma
- EMBL Outstation-Hinxton, European Bioinformatics Institute, Cambridge, UK
| | | | | | | |
Collapse
|
10
|
Abstract
Important progress has been made in the past two years in the identification of Pol II promoters. For most other regulatory elements, however, current biological knowledge is still insufficient to allow the development of prediction tools. The phylogenetic-footprinting strategy, which is based on the comparative analysis of homologous sequences, is a very efficient approach to identify new unknown regulatory elements. The recent organization of large-scale sequencing projects for some model vertebrate organisms will be extremely valuable for the prediction of regulatory elements in the human genome.
Collapse
Affiliation(s)
- L Duret
- Laboratoire de Biométrie, Génétique et Biologie des Populations, CNRS Université Claude Barnard, Villeurbanne, France.
| | | |
Collapse
|
11
|
Lawrence C, Reilly A. Likelihood Inference for Permuted Data with Application to Gene Regulation. J Am Stat Assoc 1996. [DOI: 10.1080/01621459.1996.10476665] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
12
|
Wang JT, Marr TG, Shasha D, Shapiro BA, Chirn GW. Discovering active motifs in sets of related protein sequences and using them for classification. Nucleic Acids Res 1994; 22:2769-75. [PMID: 8052532 PMCID: PMC308246 DOI: 10.1093/nar/22.14.2769] [Citation(s) in RCA: 28] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
We describe a method for discovering active motifs in a set of related protein sequences. The method is an automatic two step process: (1) find candidate motifs in a small sample of the sequences; (2) test whether these motifs are approximately present in all the sequences. To reduce the running time, we develop two optimization heuristics based on statistical estimation and pattern matching techniques. Experimental results obtained by running these algorithms on generated data and functionally related proteins demonstrate the good performance of the presented method compared with visual method of O'Farrell and Leopold. By combining the discovered motifs with an existing fingerprint technique, we develop a protein classifier. When we apply the classifier to the 698 groups of related proteins in the PROSITE catalog, it gives information that is complementary to the BLOCKS protein classifier of Henikoff and Henikoff. Thus, using our classifier in conjunction with theirs, one can obtain high confidence classifications (if BLOCKS and our classifier agree) or suggest a new hypothesis (if the two disagree).
Collapse
Affiliation(s)
- J T Wang
- Department of Computer and Information Science, New Jersey Institute of Technology, Newark 07102
| | | | | | | | | |
Collapse
|
13
|
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 1993; 262:208-14. [PMID: 8211139 DOI: 10.1126/science.8211139] [Citation(s) in RCA: 1214] [Impact Index Per Article: 39.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
A wealth of protein and DNA sequence data is being generated by genome projects and other sequencing efforts. A crucial barrier to deciphering these sequences and understanding the relations among them is the difficulty of detecting subtle local residue patterns common to multiple sequences. Such patterns frequently reflect similar molecular structures and biological properties. A mathematical definition of this "local multiple alignment" problem suitable for full computer automation has been used to develop a new and sensitive algorithm, based on the statistical method of iterative sampling. This algorithm finds an optimized local alignment model for N sequences in N-linear time, requiring only seconds on current workstations, and allows the simultaneous detection and optimization of multiple patterns and pattern repeats. The method is illustrated as applied to helix-turn-helix proteins, lipocalins, and prenyltransferases.
Collapse
Affiliation(s)
- C E Lawrence
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894
| | | | | | | | | | | |
Collapse
|
14
|
Leung MY, Blaisdell BE, Burge C, Karlin S. An efficient algorithm for identifying matches with errors in multiple long molecular sequences. J Mol Biol 1991; 221:1367-78. [PMID: 1942056 PMCID: PMC4076298 DOI: 10.1016/0022-2836(91)90938-3] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
An efficient algorithm is described for finding matches, repeats and other word relations, allowing for errors, in large data sets of long molecular sequences. The algorithm entails hashing on fixed-size words in conjunction with the use of a linked list connecting all occurrences of the same word. The average memory and run time requirement both increase almost linearly with the total sequence length. Some results of the program's performance on a database of Escherichia coli DNA sequences are presented.
Collapse
Affiliation(s)
- M Y Leung
- Division of Mathematics, Computer Science and Statistics, University of Texas, San Antonio 78249-0664
| | | | | | | |
Collapse
|
15
|
Abstract
Multiple sequence alignment can be a useful technique for studying molecular evolution, as well as for analyzing relationships between structure or function and primary sequence. We have developed for this purpose an interactive program, MACAW (Multiple Alignment Construction and Analysis Workbench), that allows the user to construct multiple alignments by locating, analyzing, editing, and combining "blocks" of aligned sequence segments. MACAW incorporates several novel features. (1) Regions of local similarity are located by a new search algorithm that avoids many of the limitations of previous techniques. (2) The statistical significance of blocks of similarity is evaluated using a recently developed mathematical theory. (3) Candidate blocks may be evaluated for potential inclusion in a multiple alignment using a variety of visualization tools. (4) A user interface permits each block to be edited by moving its boundaries or by eliminating particular segments, and blocks may be linked to form a composite multiple alignment. No completely automatic program is likely to deal effectively with all the complexities of the multiple alignment problem; by combining a powerful similarity search algorithm with flexible editing, analysis and display tools, MACAW allows the alignment strategy to be tailored to the problem at hand.
Collapse
Affiliation(s)
- G D Schuler
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
| | | | | |
Collapse
|
16
|
Abstract
A method has been developed for aligning segments of several sequences at once. The number of search steps depends only polynomially on the number of sequences, instead of exponentially, because most alignments are rejected without being evaluated explicitly. A data structure herein called the "heap" facilitates this process. For a set of n sequence segments, the overall similarity is taken to be the sum of all the constituent segment pair similarities, which are in turn sums of corresponding residue similarity scores from a Table. The statistical models that test alignments for significance make it possible to group sequences objectively, even when most or all of the interrelationships are weak. These tests are very sensitive, while remaining quite conservative, and discourage the addition of "misfit" sequences to an existing set. The new techniques are applied to a set of five DNA-binding proteins, to a group of three enzymes that employ the coenzyme FAD, and to a control set. The alignment previously proposed for the DNA-binding proteins on the basis of structural comparisons and inspection of sequences is supported quite dramatically, and a highly significant alignment is found for the FAD-binding proteins.
Collapse
|
17
|
Citron BA, Chaudary PV, Rao DN, Kaufman S. Evidence for transcription and potential translation of the human 1.9 kb HindIII repetitive element. Nucleic Acids Res 1986; 14:3137-42. [PMID: 3008108 PMCID: PMC339726 DOI: 10.1093/nar/14.7.3137] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Recombinant cDNA clones corresponding to the human 1.9kb HindIII repetitive element have been isolated from a cDNA library of liver cytoplasmic polyadenylated RNA. These cDNAs share 95% homology with the reported genomic DNA sequence and a similar amount of homology at the amino acid level with putative coding sequences (see preceding article by Mottez et al). They were isolated as two of four false positives from a human cDNA library in lambda gt11 and were selected with an antibody to an unrelated enzyme. These results provide direct evidence that this repetitive element is transcribed to form poly(A)+ RNA which could be translatable. Also, these observations may add to our understanding of the sources of false positives which are frequently observed in screens of cDNA libraries with antibodies as probes.
Collapse
|
18
|
Krishnan G, Kaul RK, Jagadeeswaran P. DNA sequence analysis: a procedure to find homologies among many sequences. Nucleic Acids Res 1986; 14:543-50. [PMID: 3753788 PMCID: PMC339439 DOI: 10.1093/nar/14.1.543] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
SEQCMP, a program that analyzes and searches for homology among multiple nucleic acid sequences, is described. The sequences are compared by the dot matrix method and the consensus sequence is derived by superimposing all the dot matrices on one another. The program is written in MBASIC and runs on IBM-PC microcomputer. It is interactive and can be used by investigators with no computer background or experience.
Collapse
|
19
|
Galas DJ, Eggert M, Waterman MS. Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. J Mol Biol 1985; 186:117-28. [PMID: 3908689 DOI: 10.1016/0022-2836(85)90262-1] [Citation(s) in RCA: 153] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
The basic nature of the sequence features that define a promoter sequence for Escherichia coli RNA polymerase have been established by a variety of biochemical and genetic methods. We have developed rigorous analytical methods for finding unknown patterns that occur imperfectly in a set of several sequences, and have used them to examine a set of bacterial promoters. The algorithm easily discovers the "consensus" sequences for the -10 and -35 regions, which are essentially identical to the results of previous analyses, but requires no prior assumptions about the common patterns. By explicitly specifying the nature of the search for consensus sequences, we give a rigorous definition to this concept that should be widely applicable. We also have provided estimates for the statistical significance of common patterns discovered in sets of sequences. In addition to providing a rigorous basis for defining known consensus regions, we have found additional features in these promoters that may have functional significance. These added features were located on either side of the -35 region. The pattern 5', or upstream, from the -35 region was found using the standard alphabet (A, G, C and T), but the pattern between the -10 and the -35 regions was detectable only in a sub-alphabet. Recent results relating DNA sequence to helix conformation suggest that the former (upstream) pattern may have a functional significance. Possible roles in promoter function are discussed in this light, and an observation of altered promoter function involving the upstream region is reported that appears to support the suggestion of function in at least one case.
Collapse
|
20
|
|
21
|
Korn LJ, Queen C. Analysis of biological sequences on small computers. DNA (MARY ANN LIEBERT, INC.) 1984; 3:421-36. [PMID: 6210184 DOI: 10.1089/dna.1.1984.3.421] [Citation(s) in RCA: 26] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
We review a wide range of programs for micro- and minicomputers that facilitate the collection and analysis of DNA, RNA, or protein sequences. Special-purpose programs that perform a single function and general-purpose programs that perform a set of functions are both considered. The information presented should be useful to any biologist who wishes to obtain a computer program to aid in sequence investigations.
Collapse
|
22
|
|
23
|
Waterman MS, Arratia R, Galas DJ. Pattern recognition in several sequences: consensus and alignment. Bull Math Biol 1984; 46:515-27. [PMID: 6509229 DOI: 10.1007/bf02459500] [Citation(s) in RCA: 81] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
|
24
|
Queen C, Korn LJ. A comprehensive sequence analysis program for the IBM personal computer. Nucleic Acids Res 1984; 12:581-99. [PMID: 6546431 PMCID: PMC321072 DOI: 10.1093/nar/12.1part2.581] [Citation(s) in RCA: 512] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We have developed a versatile program for the analysis of nucleic acid and protein sequences on the IBM Personal Computer. The program is interactive and self-instructing. It contains all the features generally found in sequence analysis programs on large computers, including extensive homology routines, as well as new procedures for the entry of sequence data. The program contains facilities to store and utilize the entire Nucleic Acid Sequence Data Bank. We have devised a new algorithm to find restriction enzyme sites, which allows our microcomputer program to find all sites on a small plasmid for 100 different enzymes in 1 to 2 minutes.
Collapse
|
25
|
Kröger M, Kröger-Block A. Simplified computer programs for search of homology within nucleotide sequences. Nucleic Acids Res 1984; 12:193-201. [PMID: 6546417 PMCID: PMC320996 DOI: 10.1093/nar/12.1part1.193] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Four new computer programs for search of homology within nucleotide sequences are presented. The main scope of the program design is flexibility, independence of sequence length and the capability to be used by any molecular biologist without any prior computer experience. The programs offer a linear search, a search for maximal identity, an alignment along a given sequence and a search based on homology within the amino acid coding capacity of nucleotide sequences. The language is Fortran V. Copies are available on request.
Collapse
|
26
|
Salemme A, Furano AV. A convenient method for locating sets of related short sequences in DNA sequences of any length. Nucleic Acids Res 1984; 12:257-62. [PMID: 6320091 PMCID: PMC321002 DOI: 10.1093/nar/12.1part1.257] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
In investigating sequence variants in a family of highly repeated rat DNA, we needed to search the consensus sequence of the repeat unit of this family for short sequences which would become, with one base change, recognition sites for various restriction endonucleases. To do this, we have designed a pair of programs to search DNA sequences of any length for sets of related short sequences, allowing user-specified mismatches in the short sequence. Since putative regulatory regions are generally short sequences, these programs are also useful for locating all possible versions of such sequences in any given DNA. We describe the programs, and present results of searches using the programs.
Collapse
|
27
|
Weir JP, Moss B. Nucleotide sequence of the vaccinia virus thymidine kinase gene and the nature of spontaneous frameshift mutations. J Virol 1983; 46:530-7. [PMID: 6842679 PMCID: PMC255155 DOI: 10.1128/jvi.46.2.530-537.1983] [Citation(s) in RCA: 153] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Nucleotide sequencing of a 1,300-base-pair vaccinia virus DNA segment previously shown to contain a thymidine kinase (TK) gene revealed an uninterrupted reading frame of 177 codons capable of producing a polypeptide with a molecular weight of 20,102. Mapping of the TK mRNA by primer extension indicated a unique 5' end that precedes the initiation codon by only six nucleotides. Multiple 3' ends within a 10-nucleotide region, about 30 nucleotides beyond the termination codon, were located by nuclease digestion of DNA-RNA hybrids, and the length of the TK transcript, exclusive of polyadenylate, was estimated to be approximately 570 nucleotides. The region preceding the TK mRNA start site is extremely A + T rich and has sequence homologies with three other early genes. Genetic information is so compressed in this region of the DNA that the putative transcriptional regulatory sequence of the TK gene overlaps the coding sequence of a late gene. Only nine nucleotides separate the termination codon of the late gene from the initiation codon of the TK gene. Downstream, 66 nucleotides separate the TK termination codon from the apparent initiation codon of another early gene. The nature of three independent TK- mutants was revealed by nucleotide sequencing. Each has a nucleotide reiteration leading to a +1 frameshift and a nonsense codon downstream. The location of one frameshift mutation provided evidence that the first ATG is used for initiation of protein synthesis.
Collapse
|
28
|
Osterburg G, Glatting KH, Buchert J, Wolters J. A fast method for arranging DNA sequence fragments. COMPUTER PROGRAMS IN BIOMEDICINE 1983; 16:61-9. [PMID: 6687857 DOI: 10.1016/0010-468x(83)90010-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
A method is described which allows efficient arrangement of DNA sequence fragments and, based on this arrangement, reconstruction of a complete DNA sequence. The concepts and algorithms used are based on the mathematical theory of graphs. The amount of human interaction required is considerably reduced compared to existing methods. An experiment with a set of 168 fragments yields a DNA sequence of about 5800 bases almost automatically.
Collapse
|
29
|
Fristensky B, Lis J, Wu R. Portable microcomputer software for nucleotide sequence analysis. Nucleic Acids Res 1982; 10:6451-63. [PMID: 6184674 PMCID: PMC326935 DOI: 10.1093/nar/10.20.6451] [Citation(s) in RCA: 88] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
The most common types of nucleotide sequence data analyses and handling can be done more conveniently and inexpensively on microcomputers than on large time-sharing systems. We present a package of computer programs for the analysis of DNA and RNA sequence data which overcomes many of the limitations imposed by microcomputers, while offering most of the features of programs commonly available on large computers, including sequence numbering and translation, restriction site and homology searches with dot-matrix plots, nucleotide distribution analysis, and graphic display of data. Most of the programs were written in Standard Pascal (on an Apple II computer) to facilitate portability to other micro-, mini-, and and mainframe computers.
Collapse
|