1
|
Tong H, Schliekelman P, Mrázek J. Unsupervised statistical discovery of spaced motifs in prokaryotic genomes. BMC Genomics 2017; 18:27. [PMID: 28056763 PMCID: PMC5217627 DOI: 10.1186/s12864-016-3400-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2016] [Accepted: 12/09/2016] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND DNA sequences contain repetitive motifs which have various functions in the physiology of the organism. A number of methods have been developed for discovery of such sequence motifs with a primary focus on detection of regulatory motifs and particularly transcription factor binding sites. Most motif-finding methods apply probabilistic models to detect motifs characterized by unusually high number of copies of the motif in the analyzed sequences. RESULTS We present a novel method for detection of pairs of motifs separated by spacers of variable nucleotide sequence but conserved length. Unlike existing methods for motif discovery, the motifs themselves are not required to occur at unusually high frequency but only to exhibit a significant preference to occur at a specific distance from each other. In the present implementation of the method, motifs are represented by pentamers and all pairs of pentamers are evaluated for statistically significant preference for a specific distance. An important step of the algorithm eliminates motif pairs where the spacers separating the two motifs exhibit a high degree of sequence similarity; such motif pairs likely arise from duplications of the whole segment including the motifs and the spacer rather than due to selective constraints indicative of a functional importance of the motif pair. The method was used to scan 569 complete prokaryotic genomes for novel sequence motifs. Some motifs detected were previously known but other motifs found in the search appear to be novel. Selected motif pairs were subjected to further investigation and in some cases their possible biological functions were proposed. CONCLUSIONS We present a new motif-finding technique that is applicable to scanning complete genomes for sequence motifs. The results from analysis of 569 genomes suggest that the method detects previously known motifs that are expected to be found as well as new motifs that are unlikely to be discovered by traditional motif-finding methods. We conclude that our approach to detection of significant motif pairs can complement existing motif-finding techniques in discovery of novel functional sequence motifs in complete genomes.
Collapse
Affiliation(s)
- Hao Tong
- Department of Statistics, University of Georgia, Athens, GA, 30602, USA
| | - Paul Schliekelman
- Department of Statistics, University of Georgia, Athens, GA, 30602, USA
| | - Jan Mrázek
- Department of Microbiology and Institute of Bioinformatics, University of Georgia, Athens, GA, 30602, USA.
| |
Collapse
|
2
|
Misas E, Muñoz JF, Gallo JE, McEwen JG, Clay OK. From NGS assembly challenges to instability of fungal mitochondrial genomes: A case study in genome complexity. Comput Biol Chem 2016; 61:258-69. [PMID: 26970210 DOI: 10.1016/j.compbiolchem.2016.02.016] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2015] [Revised: 02/03/2016] [Accepted: 02/16/2016] [Indexed: 01/26/2023]
Abstract
The presence of repetitive or non-unique DNA persisting over sizable regions of a eukaryotic genome can hinder the genome's successful de novo assembly from short reads: ambiguities in assigning genome locations to the non-unique subsequences can result in premature termination of contigs and thus overfragmented assemblies. Fungal mitochondrial (mtDNA) genomes are compact (typically less than 100 kb), yet often contain short non-unique sequences that can be shown to impede their successful de novo assembly in silico. Such repeats can also confuse processes in the cell in vivo. A well-studied example is ectopic (out-of-register, illegitimate) recombination associated with repeat pairs, which can lead to deletion of functionally important genes that are located between the repeats. Repeats that remain conserved over micro- or macroevolutionary timescales despite such risks may indicate functionally or structurally (e.g., for replication) important regions. This principle could form the basis of a mining strategy for accelerating discovery of function in genome sequences. We present here our screening of a sample of 11 fully sequenced fungal mitochondrial genomes by observing where exact k-mer repeats occurred several times; initial analyses motivated us to focus on 17-mers occurring more than three times. Based on the diverse repeats we observe, we propose that such screening may serve as an efficient expedient for gaining a rapid but representative first insight into the repeat landscapes of sparsely characterized mitochondrial chromosomes. Our matching of the flagged repeats to previously reported regions of interest supports the idea that systems of persisting, non-trivial repeats in genomes can often highlight features meriting further attention.
Collapse
Affiliation(s)
- Elizabeth Misas
- Cellular & Molecular Biology Unit, Corporación para Investigaciones Biológicas, Medellín, Colombia; Institute of Biology, Universidad de Antioquia, Medellín, Colombia
| | - José Fernando Muñoz
- Cellular & Molecular Biology Unit, Corporación para Investigaciones Biológicas, Medellín, Colombia; Institute of Biology, Universidad de Antioquia, Medellín, Colombia
| | - Juan Esteban Gallo
- Cellular & Molecular Biology Unit, Corporación para Investigaciones Biológicas, Medellín, Colombia; Doctoral Program in Biomedical Sciences, Universidad del Rosario, Bogotá, Colombia
| | - Juan Guillermo McEwen
- Cellular & Molecular Biology Unit, Corporación para Investigaciones Biológicas, Medellín, Colombia; School of Medicine, Universidad de Antioquia, Medellín, Colombia
| | - Oliver Keatinge Clay
- Cellular & Molecular Biology Unit, Corporación para Investigaciones Biológicas, Medellín, Colombia; School of Medicine and Health Sciences, Universidad del Rosario, Bogotá, Colombia.
| |
Collapse
|
3
|
Bi C. SEAM: A STOCHASTIC EM-TYPE ALGORITHM FOR MOTIF-FINDING IN BIOPOLYMER SEQUENCES. J Bioinform Comput Biol 2011; 5:47-77. [PMID: 17477491 DOI: 10.1142/s0219720007002527] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2006] [Revised: 08/22/2006] [Accepted: 10/14/2006] [Indexed: 12/21/2022]
Abstract
Position weight matrix-based statistical modeling for the identification and characterization of motif sites in a set of unaligned biopolymer sequences is presented. This paper describes and implements a new algorithm, the Stochastic EM-type Algorithm for Motif-finding (SEAM), and redesigns and implements the EM-based motif-finding algorithm called deterministic EM (DEM) for comparison with SEAM, its stochastic counterpart. The gold standard example, cyclic adenosine monophosphate receptor protein (CRP) binding sequences, together with other biological sequences, is used to illustrate the performance of the new algorithm and compare it with other popular motif-finding programs. The convergence of the new algorithm is shown by simulation. The in silico experiments using simulated and biological examples illustrate the power and robustness of the new algorithm SEAM in de novo motif discovery.
Collapse
Affiliation(s)
- Chengpeng Bi
- Children's Mercy Hospitals and Clinics, 2401 Gillham Road, Pediatrics Research Building, Third Floor, Kansas City, Missouri 64108, USA.
| |
Collapse
|
4
|
Bi C. A Monte Carlo EM algorithm for de novo motif discovery in biomolecular sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2009; 6:370-386. [PMID: 19644166 DOI: 10.1109/tcbb.2008.103] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Motif discovery methods play pivotal roles in deciphering the genetic regulatory codes (i.e., motifs) in genomes as well as in locating conserved domains in protein sequences. The Expectation Maximization (EM) algorithm is one of the most popular methods used in de novo motif discovery. Based on the position weight matrix (PWM) updating technique, this paper presents a Monte Carlo version of the EM motif-finding algorithm that carries out stochastic sampling in local alignment space to overcome the conventional EM's main drawback of being trapped in a local optimum. The newly implemented algorithm is named as Monte Carlo EM Motif Discovery Algorithm (MCEMDA). MCEMDA starts from an initial model, and then it iteratively performs Monte Carlo simulation and parameter update until convergence. A log-likelihood profiling technique together with the top-k strategy is introduced to cope with the phase shifts and multiple modal issues in motif discovery problem. A novel grouping motif alignment (GMA) algorithm is designed to select motifs by clustering a population of candidate local alignments and successfully applied to subtle motif discovery. MCEMDA compares favorably to other popular PWM-based and word enumerative motif algorithms tested using simulated (l, d)-motif cases, documented prokaryotic, and eukaryotic DNA motif sequences. Finally, MCEMDA is applied to detect large blocks of conserved domains using protein benchmarks and exhibits its excellent capacity while compared with other multiple sequence alignment methods.
Collapse
Affiliation(s)
- Chengpeng Bi
- Bioinformatics and Intelligent Computing Laboratory, Division of Clinical Pharmacology, Children's Mercy Hospitals and Clinics, 2401 Gillham Road, Kansas City, MO 64108, USA.
| |
Collapse
|
5
|
Jiang Y, Cukic B, Adjeroh DA, Skinner HD, Lin J, Shen QJ, Jiang BH. An algorithm for identifying novel targets of transcription factor families: application to hypoxia-inducible factor 1 targets. Cancer Inform 2009; 7:75-89. [PMID: 19352460 PMCID: PMC2664698 DOI: 10.4137/cin.s1054] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Efficient and effective analysis of the growing genomic databases requires the development of adequate computational tools. We introduce a fast method based on the suffix tree data structure for predicting novel targets of hypoxia-inducible factor 1 (HIF-1) from huge genome databases. The suffix tree data structure has two powerful applications here: one is to extract unknown patterns from multiple strings/sequences in linear time; the other is to search multiple strings/sequences using multiple patterns in linear time. Using 15 known HIF-1 target gene sequences as a training set, we extracted 105 common patterns that all occur in the 15 training genes using suffix trees. Using these 105 common patterns along with known subsequences surrounding HIF-1 binding sites from the literature, the algorithm searches a genome database that contains 2,078,786 DNA sequences. It reported 258 potentially novel HIF-1 targets including 25 known HIF-1 targets. Based on microarray studies from the literature, 17 putative genes were confirmed to be upregulated by HIF-1 or hypoxia inside these 258 genes. We further studied one of the potential targets, COX-2, in the biological lab; and showed that it was a biologically relevant HIF-1 target. These results demonstrate that our methodology is an effective computational approach for identifying novel HIF-1 targets.
Collapse
Affiliation(s)
- Yue Jiang
- Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26506, USA.
| | | | | | | | | | | | | |
Collapse
|
6
|
Cho YS, Lee SY, Kim KY, Bang IC, Kim DS, Nam YK. Gene structure and expression of metallothionein during metal exposures in Hemibarbus mylodon. ECOTOXICOLOGY AND ENVIRONMENTAL SAFETY 2008; 71:125-37. [PMID: 17889936 DOI: 10.1016/j.ecoenv.2007.08.005] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/24/2007] [Revised: 06/28/2007] [Accepted: 08/02/2007] [Indexed: 05/17/2023]
Abstract
Metallothionein gene was characterized in Hemibarbus mylodon, an endangered fish species. H. mylodon MT shared a high homology with other vertebrate MTs, including (1) tripartite exon/intron structure, (2) typical regulatory elements such as MREs and GC boxes in the 5'-flanking region, and (3) high proportion of Cysteines (33.3%) in its amino acid sequence. MT mRNA was ubiquitously detected in various tissues. Basal level of MT mRNA was the highest in ovary while the lowest in heart. Transcription of MT was highly inducible by exposures to waterborne cadmium (0.1-10 microM), copper (2-10 microM) or zinc (2-10 microM), based on real-time RT-PCR. Cadmium was more potent for the stimulation of MT transcripts than copper and zinc. Liver was more responsive to heavy metals than kidney and gill. In overall, the transcriptional activation of MT gene by metal exposures followed a dose- and/or time-dependent fashion.
Collapse
Affiliation(s)
- Young Sun Cho
- Department of Aquaculture, Pukyong National University (PKNU), Busan 608-737, Republic of Korea
| | | | | | | | | | | |
Collapse
|
7
|
Lascaro D, Castellana S, Gasparre G, Romeo G, Saccone C, Attimonelli M. The RHNumtS compilation: features and bioinformatics approaches to locate and quantify Human NumtS. BMC Genomics 2008; 9:267. [PMID: 18522722 PMCID: PMC2447851 DOI: 10.1186/1471-2164-9-267] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2007] [Accepted: 06/03/2008] [Indexed: 11/21/2022] Open
Abstract
Background To a greater or lesser extent, eukaryotic nuclear genomes contain fragments of their mitochondrial genome counterpart, deriving from the random insertion of damaged mtDNA fragments. NumtS (Nuclear mt Sequences) are not equally abundant in all species, and are redundant and polymorphic in terms of copy number. In population and clinical genetics, it is important to have a complete overview of NumtS quantity and location. Searching PubMed for NumtS or Mitochondrial pseudo-genes yields hundreds of papers reporting Human NumtS compilations produced by in silico or wet-lab approaches. A comparison of published compilations clearly shows significant discrepancies among data, due both to unwise application of Bioinformatics methods and to a not yet correctly assembled nuclear genome. To optimize quantification and location of NumtS, we produced a consensus compilation of Human NumtS by applying various bioinformatics approaches. Results Location and quantification of NumtS may be achieved by applying database similarity searching methods: we have applied various methods such as Blastn, MegaBlast and BLAT, changing both parameters and database; the results were compared, further analysed and checked against the already published compilations, thus producing the Reference Human Numt Sequences (RHNumtS) compilation. The resulting NumtS total 190. Conclusion The RHNumtS compilation represents a highly reliable reference basis, which may allow designing a lab protocol to test the actual existence of each NumtS. Here we report preliminary results based on PCR amplification and sequencing on 41 NumtS selected from RHNumtS among those with lower score. In parallel, we are currently designing the RHNumtS database structure for implementation in the HmtDB resource. In the future, the same database will host NumtS compilations from other organisms, but these will be generated only when the nuclear genome of a specific organism has reached a high-quality level of assembly.
Collapse
Affiliation(s)
- Daniela Lascaro
- Dipartimento di Biochimica e Biologia Molecolare E, Quagliariello, Università di Bari, Via E, Orabona 4, 70126 Bari, Italy.
| | | | | | | | | | | |
Collapse
|
8
|
Mrázek J, Xie S, Guo X, Srivastava A. AIMIE: a web-based environment for detection and interpretation of significant sequence motifs in prokaryotic genomes. Bioinformatics 2008; 24:1041-8. [PMID: 18304933 DOI: 10.1093/bioinformatics/btn077] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Genomes contain biologically significant information that extends beyond that encoded in genes. Some of this information relates to various short dispersed repeats distributed throughout the genome. The goal of this work was to combine tools for detection of statistically significant dispersed repeats in DNA sequences with tools to aid development of hypotheses regarding their possible physiological functions in an easy-to-use web-based environment. RESULTS Ab Initio Motif Identification Environment (AIMIE) was designed to facilitate investigations of dispersed sequence motifs in prokaryotic genomes. We used AIMIE to analyze the Escherichia coli and Haemophilus influenzae genomes in order to demonstrate the utility of the new environment. AIMIE detected repeated extragenic palindrome (REP) elements, CRISPR repeats, uptake signal sequences, intergenic dyad sequences and several other over-represented sequence motifs. Distributional patterns of these motifs were analyzed using the tools included in AIMIE. AVAILABILITY AIMIE and the related software can be accessed at our web site http://www.cmbl.uga.edu/software.html.
Collapse
Affiliation(s)
- Jan Mrázek
- Department of Microbiology, University of Georgia, Athens, GA 30602-2605, USA.
| | | | | | | |
Collapse
|
9
|
Chin F, Leung HCM. DNA motif representation with nucleotide dependency. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2008; 5:110-119. [PMID: 18245880 DOI: 10.1109/tcbb.2007.70220] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
The problem of discovering novel motifs of binding sites is important to the understanding of gene regulatory networks. Motifs are generally represented by matrices (position weight matrix (PWM) or position specific scoring matrix (PSSM) or strings. However, these representations cannot model biological binding sites well because they fail to capture nucleotide interdependence. It has been pointed out by many researchers that the nucleotides of the DNA binding site cannot be treated independently, e.g. the binding sites of zinc finger in proteins. In this paper, a new representation called Scored Position Specific Pattern (SPSP), which is a generalization of the matrix and string representations, is introduced which takes into consideration the dependent occurrences of neighboring nucleotides. Even though the problem of discovering the optimal motif in SPSP representation is proved to be NP-hard, we introduce a heuristic algorithm called SPSP-Finder, which can effectively find optimal motifs in most simulated cases and some real cases for which existing popular motif finding software, such as Weeder, MEME and AlignACE, fail.
Collapse
Affiliation(s)
- Francis Chin
- Department of Computer Science, The University of Hong Kong, Hong Kong.
| | | |
Collapse
|
10
|
Abstract
BACKGROUND Unraveling the mechanisms that regulate gene expression is a major challenge in biology. An important task in this challenge is to identify regulatory elements, especially the binding sites in deoxyribonucleic acid (DNA) for transcription factors. These binding sites are short DNA segments that are called motifs. Recent advances in genome sequence availability and in high-throughput gene expression analysis technologies have allowed for the development of computational methods for motif finding. As a result, a large number of motif finding algorithms have been implemented and applied to various motif models over the past decade. This survey reviews the latest developments in DNA motif finding algorithms. RESULTS Earlier algorithms use promoter sequences of coregulated genes from single genome and search for statistically overrepresented motifs. Recent algorithms are designed to use phylogenetic footprinting or orthologous sequences and also an integrated approach where promoter sequences of coregulated genes and phylogenetic footprinting are used. All the algorithms studied have been reported to correctly detect the motifs that have been previously detected by laboratory experimental approaches, and some algorithms were able to find novel motifs. However, most of these motif finding algorithms have been shown to work successfully in yeast and other lower organisms, but perform significantly worse in higher organisms. CONCLUSION Despite considerable efforts to date, DNA motif finding remains a complex challenge for biologists and computer scientists. Researchers have taken many different approaches in developing motif discovery tools and the progress made in this area of research is very encouraging. Performance comparison of different motif finding tools and identification of the best tools have proven to be a difficult task because tools are designed based on algorithms and motif models that are diverse and complex and our incomplete understanding of the biology of regulatory mechanism does not always provide adequate evaluation of underlying algorithms over motif models.
Collapse
Affiliation(s)
- Modan K Das
- Computer Science Department, Oklahoma State University, Stillwater, Oklahoma 74078, USA
- USDA-ARS, Department of Plant Sciences, University of Arizona, Tucson, Arizona 85721, USA
| | - Ho-Kwok Dai
- Computer Science Department, Oklahoma State University, Stillwater, Oklahoma 74078, USA
| |
Collapse
|
11
|
GuhaThakurta D. Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res 2006; 34:3585-98. [PMID: 16855295 PMCID: PMC1524905 DOI: 10.1093/nar/gkl372] [Citation(s) in RCA: 98] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computational algorithms is difficult. However, significant advances have been made in the computational methods for modeling and detection of DNA regulatory elements. The availability of complete genome sequence from multiple organisms, as well as mRNA profiling and high-throughput experimental methods for mapping protein-binding sites in DNA, have contributed to the development of methods that utilize these auxiliary data to inform the detection of transcriptional regulatory elements. Progress is also being made in the identification of cis-regulatory modules and higher order structures of the regulatory sequences, which is essential to the understanding of transcription regulation in the metazoan genomes. This article reviews the computational approaches for modeling and identification of genomic regulatory elements, with an emphasis on the recent developments, and current challenges.
Collapse
Affiliation(s)
- Debraj GuhaThakurta
- Research Genetics Division, Rosetta Inpharmatics LLC, Merck & Co., Inc, 401 Terry Avenue North, Seattle, WA 98109, USA.
| |
Collapse
|
12
|
Leung HCM, Chin FYL. Algorithms for challenging motif problems. J Bioinform Comput Biol 2006; 4:43-58. [PMID: 16568541 DOI: 10.1142/s0219720006001692] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2005] [Revised: 08/01/2005] [Accepted: 08/01/2005] [Indexed: 11/18/2022]
Abstract
Pevzner and Sze(19) have introduced the Planted (l,d)-Motif Problem to find similar patterns (motifs) in sequences which represent the promoter regions of co-regulated genes, where l is the length of the motif and d is the maximum Hamming distance around the similar patterns. Many algorithms have been developed to solve this motif problem. However, these algorithms either have long running times or do not guarantee the motif can be found. In this paper, we introduce new algorithms to solve this motif problem. Our algorithms can find motifs in reasonable time for not only the challenging (9, 2), (11, 3), (15, 5)-motif problems but for even longer motifs, say (20, 7), (30, 11) and (40, 15), which have never been seriously attempted by other researchers because of the large time and space required. Besides, our algorithms can be extended to find more complicated motifs structure called cis-regulatory modules (CRM).
Collapse
Affiliation(s)
- Henry C M Leung
- Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong, China.
| | | |
Collapse
|
13
|
Fadiel A, Lithwick S, Ganji G, Scherer SW. Remarkable sequence signatures in archaeal genomes. ARCHAEA-AN INTERNATIONAL MICROBIOLOGICAL JOURNAL 2005; 1:185-90. [PMID: 15803664 PMCID: PMC2685567 DOI: 10.1155/2003/458235] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
Complete archaeal genomes were probed for the presence of long (> or = 25 bp) oligonucleotide repeats (words). We detected the presence of many words distributed in tandem with narrow ranges of periodicity (i.e., spacer length between repeats). Similar words were not identified in genomes of non-archaeal species, namely Escherichia coli, Bacillus subtilis, Haemophilus influenzae, Mycoplasma genitalium and Mycoplasma pneumoniae. BLAST similarity searches against the GenBank nucleotide sequence database revealed that these words were archaeal species-specific, indicating that they are of a signature character. Sequence analysis and genome viewing tools showed these repeats to be restricted to non-coding regions. Thus, archaea appear to possess a non-coding genomic signature that is absent in bacterial species. The identification of a species-specific genomic signature would be of great value to archaeal genome mapping, evolutionary studies and analyses of genome complexity.
Collapse
Affiliation(s)
- Ahmed Fadiel
- The Center for Applied Genomics, Hospital for Sick Children, Toronto, Ontario M5G 1Z8, Canada.
| | | | | | | |
Collapse
|
14
|
Betel D, Hogue CWV. Kangaroo--a pattern-matching program for biological sequences. BMC Bioinformatics 2002; 3:20. [PMID: 12150718 PMCID: PMC119856 DOI: 10.1186/1471-2105-3-20] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2002] [Accepted: 07/31/2002] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND Biologists are often interested in performing a simple database search to identify proteins or genes that contain a well-defined sequence pattern. Many databases do not provide straightforward or readily available query tools to perform simple searches, such as identifying transcription binding sites, protein motifs, or repetitive DNA sequences. However, in many cases simple pattern-matching searches can reveal a wealth of information. We present in this paper a regular expression pattern-matching tool that was used to identify short repetitive DNA sequences in human coding regions for the purpose of identifying potential mutation sites in mismatch repair deficient cells. RESULTS Kangaroo is a web-based regular expression pattern-matching program that can search for patterns in DNA, protein, or coding region sequences in ten different organisms. The program is implemented to facilitate a wide range of queries with no restriction on the length or complexity of the query expression. The program is accessible on the web at http://bioinfo.mshri.on.ca/kangaroo/ and the source code is freely distributed at http://sourceforge.net/projects/slritools/. CONCLUSION A low-level simple pattern-matching application can prove to be a useful tool in many research settings. For example, Kangaroo was used to identify potential genetic targets in a human colorectal cancer variant that is characterized by a high frequency of mutations in coding regions containing mononucleotide repeats.
Collapse
Affiliation(s)
- Doron Betel
- Department of Biochemistry, University of Toronto, Toronto, Ontario, M5S 1A8, Canada
- Samuel Lunenfeld Research Institute, Mount Sinai Hospital, 600 University Ave., Toronto, Ontario, M5G 1X5, Canada
| | - Christopher WV Hogue
- Department of Biochemistry, University of Toronto, Toronto, Ontario, M5S 1A8, Canada
- Samuel Lunenfeld Research Institute, Mount Sinai Hospital, 600 University Ave., Toronto, Ontario, M5G 1X5, Canada
| |
Collapse
|
15
|
Papatsenko DA, Makeev VJ, Lifanov AP, Régnier M, Nazina AG, Desplan C. Extraction of functional binding sites from unique regulatory regions: the Drosophila early developmental enhancers. Genome Res 2002; 12:470-81. [PMID: 11875036 PMCID: PMC155290 DOI: 10.1101/gr.212502] [Citation(s) in RCA: 60] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The early developmental enhancers of Drosophila melanogaster comprise one of the most sophisticated regulatory systems in higher eukaryotes. An elaborate code in their DNA sequence translates both maternal and early embryonic regulatory signals into spatial distribution of transcription factors. One of the most striking features of this code is the redundancy of binding sites for these transcription factors (BSTF). Using this redundancy, we explored the possibility of predicting functional binding sites in a single enhancer region without any prior consensus/matrix description or evolutionary sequence comparisons. We developed a conceptually simple algorithm, Scanseq, that employs an original statistical evaluation for identifying the most redundant motifs and locates the position of potential BSTF in a given regulatory region. To estimate the biological relevance of our predictions, we built thorough literature-based annotations for the best-known Drosophila developmental enhancers and we generated detailed distribution maps for the most robust binding sites. The high statistical correlation between the location of BSTF in these experiment-based maps and the location predicted in silico by Scanseq confirmed the relevance of our approach. We also discuss the definition of true binding sites and the possible biological principles that govern patterning of regulatory regions and the distribution of transcriptional signals.
Collapse
|
16
|
Pesole G, Mignone F, Gissi C, Grillo G, Licciulli F, Liuni S. Structural and functional features of eukaryotic mRNA untranslated regions. Gene 2001; 276:73-81. [PMID: 11591473 DOI: 10.1016/s0378-1119(01)00674-6] [Citation(s) in RCA: 292] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
The crucial role of the non-coding portion of genomes is now widely acknowledged. In particular, mRNA untranslated regions are involved in many post-transcriptional regulatory pathways that control mRNA localization, stability and translation efficiency. We review in this paper the major structural and compositional features of eukaryotic mRNA untranslated regions and provide some examples of bioinformatic analyses for their functional characterization.
Collapse
Affiliation(s)
- G Pesole
- Dipartimento di Fisiologia e Biochimica Generali, Università di Milano, via Celoria, 26, 20133, Milan, Italy.
| | | | | | | | | | | |
Collapse
|
17
|
Medici N, Abbondanza C, Nigro V, Rossi V, Piluso G, Belsito A, Gallo L, Roscigno A, Bontempo P, Puca AA, Molinari AM, Moncharmont B, Puca GA. Identification of a DNA binding protein cooperating with estrogen receptor as RIZ (retinoblastoma interacting zinc finger protein). Biochem Biophys Res Commun 1999; 264:983-9. [PMID: 10544042 DOI: 10.1006/bbrc.1999.1604] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Double-stranded DNA fragments were selected from a random pool by repeated cycles of estrogen receptor-specific immunoprecipitation in the presence of a nuclear extract and PCR amplification (cyclic amplification and selection of target, CAST, for multiple elements). Fragments were cloned and sequence analysis indicated the 5-nucleotide word TTGGC was the most recurrent sequence unrelated to the known estrogen responsive element. Screening a HeLa cell expression library with a probe designed with multiple repeats of this sequence resulted in the identification of a 1700-aa protein showing a complete homology with the product of the human retinoblastoma-interacting zinc-finger gene RIZ. In transfection experiments, RIZ protein was able to bestow estrogen inducibility to a promoter containing an incomplete estrogen responsive element and a TTGGC motif. RIZ protein present in MCF-7 cell nuclear extract retarded the TTGGC-containing probe in an EMSA. Estrogen receptor was co-immunoprecipitated from MCF-7 cell extract by antibodies to RIZ protein and vice versa, thus indicating an existing interaction between these two proteins.
Collapse
Affiliation(s)
- N Medici
- Facoltà di Medicina e Chirurgia, Seconda Università degli studi di Napoli, Larghetto Sant' Aniello a Caponapoli, 2, Naples, I-80138, Italy
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
18
|
Abstract
This paper presents a survey of currently available mathematical models and algorithmical methods for trying to identify promoter sequences. The methods concern both searching in a genome for a previously defined consensus and extracting a consensus from a set of sequences. Such methods were often tailored for either eukaryotes or prokaryotes although this does not preclude use of the same method for both types of organisms. The survey therefore covers all methods; however, emphasis is placed on prokaryotic promoter sequence identification. Illustrative applications of the main extracting algorithms are given for three bacteria.
Collapse
Affiliation(s)
- A Vanet
- Institut de biologie physico-chimique, Paris, France
| | | | | |
Collapse
|
19
|
Ostergaard L, Pedersen AG, Jespersen HM, Brunak S, Welinder KG. Computational analyses and annotations of the Arabidopsis peroxidase gene family. FEBS Lett 1998; 433:98-102. [PMID: 9738941 DOI: 10.1016/s0014-5793(98)00849-7] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Classical heme-containing plant peroxidases have been ascribed a wide variety of functional roles related to development, defense, lignification, and hormonal signaling. More than 40 peroxidase genes are now known in Arabidopsis thaliana for which functional association is complicated by a general lack of peroxidase substrate specificity. Computational analysis was performed on 30 near full-length Arabidopsis peroxidase cDNAs for annotation of start codons and signal peptide cleavage sites. A compositional analysis revealed that 23 of the 30 peroxidase cDNAs have 5' untranslated regions containing 40-71% adenine, a rare feature observed also in cDNAs which predominantly encode stress-induced proteins, and which may indicate translational regulation.
Collapse
Affiliation(s)
- L Ostergaard
- Department of Protein Chemistry, Institute of Molecular Biology, University of Copenhagen, Denmark
| | | | | | | | | |
Collapse
|
20
|
Pesole G, Attimonelli M, Saccone C. Linguistic analysis of nucleotide sequences: algorithms for pattern recognition and analysis of codon strategy. Methods Enzymol 1996; 266:281-94. [PMID: 8743690 DOI: 10.1016/s0076-6879(96)66019-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Affiliation(s)
- G Pesole
- Dipartmento di Biochimica e Biologia Molecolare, Università di Bari, Italy
| | | | | |
Collapse
|
21
|
Larsen NI, Engelbrecht J, Brunak S. Analysis of eukaryotic promoter sequences reveals a systematically occurring CT-signal. Nucleic Acids Res 1995; 23:1223-30. [PMID: 7739901 PMCID: PMC306835 DOI: 10.1093/nar/23.7.1223] [Citation(s) in RCA: 26] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
A general data study of eukaryotic promoter sequences from widely different species is presented. Mammalian promoters with known transcription initiation sites represented the largest subclass of the data, and for this group neural network algorithms were trained to predict the location of the initiation site in a test set. The prediction accuracy of this local method was higher than what could be expected from the known non-local structure of eukaryotic promoters. Subsequent analysis revealed, besides the consensus of the two known important subregions: the TATA-box TATAAA and the Cap-signal CA, a CT-signal positioned on the average seven nucleotides downstream of the transcription initiation site. The consensus of the CT-signal is CTNCNG. The details of this core promoter element were disclosed using multiple alignment and have earlier only been described in a few isolated examples.
Collapse
Affiliation(s)
- N I Larsen
- Department of Physical Chemistry, Technical University of Denmark, Lyngby
| | | | | |
Collapse
|
22
|
Pesole G, Attimonelli M, Saccone C. Linguistic approaches to the analysis of sequence information. Trends Biotechnol 1994; 12:401-8. [PMID: 7765386 DOI: 10.1016/0167-7799(94)90028-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Biological macromolecules have many features that resemble modern languages. Thus, linguistic approaches to the analysis of sequence information are becoming powerful tools for deciphering genetic texts. The methodologies used, to date, to determine the global parameters of the genetic language and meaningful patterns within it are described.
Collapse
Affiliation(s)
- G Pesole
- Dipartimento di Biochimica e Biologia Molecolare, Università di Bari, Italy
| | | | | |
Collapse
|
23
|
Scherer S, McPeek MS, Speed TP. Atypical regions in large genomic DNA sequences. Proc Natl Acad Sci U S A 1994; 91:7134-8. [PMID: 8041759 PMCID: PMC44353 DOI: 10.1073/pnas.91.15.7134] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
Large genomic DNA sequences contain regions with distinctive patterns of sequence organization. We describe a method using logarithms of probabilities based on seventh-order Markov chains to rapidly identify genomic sequences that do not resemble models of genome organization built from compilations of octanucleotide usage. Data bases have been constructed from Escherichia coli and Saccharomyces cerevisiae DNA sequences of > 1000 nt and human sequences of > 10,000 nt. Atypical genes and clusters of genes have been located in bacteriophage, yeast, and primate DNA sequences. We consider criteria for statistical significance of the results, offer possible explanations for the observed variation in genome organization, and give additional applications of these methods in DNA sequence analysis.
Collapse
Affiliation(s)
- S Scherer
- Human Genome Center, Lawrence Berkeley Laboratory, Berkeley, CA 94720
| | | | | |
Collapse
|
24
|
Pesole G, Fiormarino G, Saccone C. Sequence analysis and compositional properties of untranslated regions of human mRNAs. Gene 1994; 140:219-25. [PMID: 8144029 DOI: 10.1016/0378-1119(94)90547-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
A detailed computer analysis of the untranslated regions, 5'-UTR and 3'-UTR, of human mRNA sequences is reported. The compositional properties of these regions, compared with those of the corresponding coding regions, indicate that 5'-UTR and 3'-UTR are less affected by the isochore compartmentalization than the corresponding third codon positions of mRNAs. The presence of higher functional constraints in 5'-UTR is also reported. Dinucleotide analysis shows a depletion of CpG and TpA in both sequences. A search for significant sequence motifs using the WORDUP algorithm reveals the patterns already known to have a functional role in the mRNA UTR, and several other motifs whose functional roles remain to be demonstrated. This type of analysis may be particularly useful for guiding site-directed mutagenesis experiments. In addition, it can be used for assessing the nature of anonymous sequences now produced in large amounts in megabase sequencing projects.
Collapse
Affiliation(s)
- G Pesole
- Centro Studi Mitocondri e Metabolismo Energetico, CNR, Bari, Italy
| | | | | |
Collapse
|
25
|
de Zamaroczy M, Bernardi G. The mosaic organization of the mitochondrial introns of Saccharomyces cerevisiae: features and evolutionary origins. Gene 1992; 122:91-9. [PMID: 1452043 DOI: 10.1016/0378-1119(92)90036-o] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
The introns of three genes (oxi3, cob and 21S) from the mitochondrial (mt) genome of Saccharomyces cerevisiae contain closed reading frames (CRFs). In the present work, we have analyzed these sequences in their oligodeoxyribonucleotide (oligo; isostich) patterns. We have shown that the relative amounts of di- to hexanucleotides, when compared to random sequences having the same sizes and compositions, exhibit the same deviations as the intergenic noncoding sequences of the mt genome (except for the CRFs from 21S intron). In contrast, intronic open reading frames (ORFs) showed oligo patterns which were generally quite distinct from those of CRFs, although some similarities could be detected in some cases (especially for aI5 alpha). The mt introns of yeast, therefore, are endowed with a mosaic structure, in which CRFs derive from mt intergenic sequences, whereas ORFs have a different origin (indicated as exogenous by other evidences) yet show, in some cases, the effects of 'sequence assimilation' with CRFs.
Collapse
Affiliation(s)
- M de Zamaroczy
- Laboratoire de Génétique Moléculaire, Institut Jacques Monod, Paris, France
| | | |
Collapse
|