1
|
Tognon M, Giugno R, Pinello L. A survey on algorithms to characterize transcription factor binding sites. Brief Bioinform 2023; 24:bbad156. [PMID: 37099664 PMCID: PMC10422928 DOI: 10.1093/bib/bbad156] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 03/27/2023] [Accepted: 04/01/2023] [Indexed: 04/28/2023] Open
Abstract
Transcription factors (TFs) are key regulatory proteins that control the transcriptional rate of cells by binding short DNA sequences called transcription factor binding sites (TFBS) or motifs. Identifying and characterizing TFBS is fundamental to understanding the regulatory mechanisms governing the transcriptional state of cells. During the last decades, several experimental methods have been developed to recover DNA sequences containing TFBS. In parallel, computational methods have been proposed to discover and identify TFBS motifs based on these DNA sequences. This is one of the most widely investigated problems in bioinformatics and is referred to as the motif discovery problem. In this manuscript, we review classical and novel experimental and computational methods developed to discover and characterize TFBS motifs in DNA sequences, highlighting their advantages and drawbacks. We also discuss open challenges and future perspectives that could fill the remaining gaps in the field.
Collapse
Affiliation(s)
- Manuel Tognon
- Computer Science Department, University of Verona, Verona, Italy
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Rosalba Giugno
- Computer Science Department, University of Verona, Verona, Italy
| | - Luca Pinello
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Pathology, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
2
|
Mohanty S, Pattnaik PK, Al-Absi AA, Kang DK. A Review on Planted ( l, d) Motif Discovery Algorithms for Medical Diagnose. SENSORS (BASEL, SWITZERLAND) 2022; 22:1204. [PMID: 35161949 PMCID: PMC8838483 DOI: 10.3390/s22031204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Revised: 01/19/2022] [Accepted: 01/31/2022] [Indexed: 11/16/2022]
Abstract
Personalized diagnosis of chronic disease requires capturing the continual pattern across the biological sequence. This repeating pattern in medical science is called "Motif". Motifs are the short, recurring patterns of biological sequences that are supposed signify some health disorder. They identify the binding sites for transcription factors that modulate and synchronize the gene expression. These motifs are important for the analysis and interpretation of various health issues like human disease, gene function, drug design, patient's conditions, etc. Searching for these patterns is an important step in unraveling the mechanisms of gene expression properly diagnose and treat chronic disease. Thus, motif identification has a vital role in healthcare studies and attracts many researchers. Numerous approaches have been characterized for the motif discovery process. This article attempts to review and analyze fifty-four of the most frequently found motif discovery processes/algorithms from different approaches and summarizes the discussion with their strengths and weaknesses.
Collapse
Affiliation(s)
- Satarupa Mohanty
- School of Computer Engineering, KIIT Deemed to be University, Bhubaneswar 751024, India; (S.M.); (P.K.P.)
| | - Prasant Kumar Pattnaik
- School of Computer Engineering, KIIT Deemed to be University, Bhubaneswar 751024, India; (S.M.); (P.K.P.)
| | | | - Dae-Ki Kang
- Department of Computer & Information Engineering, Dongseo University, 47 Jurye-ro, Sasang-gu, Busan 47011, Korea
| |
Collapse
|
3
|
Quan L, Mei J, He R, Sun X, Nie L, Li K, Lyu Q. Quantifying Intensities of Transcription Factor-DNA Binding by Learning From an Ensemble of Protein Binding Microarrays. IEEE J Biomed Health Inform 2021; 25:2811-2819. [PMID: 33571101 DOI: 10.1109/jbhi.2021.3058518] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
The control of the coordinated expression of genes is primarily regulated by the interactions between transcription factors (TFs) and their DNA binding sites, which are an integral part of transcriptional regulatory networks. There are many computational tools focused on determining TF binding or unbinding to a DNA sequence. However, other tools focused on further determining the relative preference of such binding are needed. Here, we propose a regression model with deep learning, called SemanticBI, to predict intensities of TF-DNA binding. SemanticBI is a convolutional neural network (CNN)-recurrent neural network (RNN) architecture model that was trained on an ensemble of protein binding microarray data sets that covered multiple TFs. Using this approach, SemanticBI exhibited superior accuracy in predicting binding intensities compared to other popular methods. Moreover, SemanticBI uncovered vectorized sequence-oriented features using its CNN-RNN architecture, which is an abstract representation of the original DNA sequences. Additionally, the use of SemanticBI raises the question of whether motifs are necessary for computational models of TF binding. The online SemanticBI service can be accessed at http://qianglab.scst.suda.edu.cn/semantic/.
Collapse
|
4
|
RNAdemocracy: an ensemble method for RNA secondary structure prediction using consensus scoring. Comput Biol Chem 2019; 83:107151. [DOI: 10.1016/j.compbiolchem.2019.107151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2018] [Revised: 06/05/2019] [Accepted: 10/15/2019] [Indexed: 11/18/2022]
|
5
|
Rogozin IB, Pavlov YI, Goncearenco A, De S, Lada AG, Poliakov E, Panchenko AR, Cooper DN. Mutational signatures and mutable motifs in cancer genomes. Brief Bioinform 2019; 19:1085-1101. [PMID: 28498882 DOI: 10.1093/bib/bbx049] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2017] [Indexed: 12/22/2022] Open
Abstract
Cancer is a genetic disorder, meaning that a plethora of different mutations, whether somatic or germ line, underlie the etiology of the 'Emperor of Maladies'. Point mutations, chromosomal rearrangements and copy number changes, whether they have occurred spontaneously in predisposed individuals or have been induced by intrinsic or extrinsic (environmental) mutagens, lead to the activation of oncogenes and inactivation of tumor suppressor genes, thereby promoting malignancy. This scenario has now been recognized and experimentally confirmed in a wide range of different contexts. Over the past decade, a surge in available sequencing technologies has allowed the sequencing of whole genomes from liquid malignancies and solid tumors belonging to different types and stages of cancer, giving birth to the new field of cancer genomics. One of the most striking discoveries has been that cancer genomes are highly enriched with mutations of specific kinds. It has been suggested that these mutations can be classified into 'families' based on their mutational signatures. A mutational signature may be regarded as a type of base substitution (e.g. C:G to T:A) within a particular context of neighboring nucleotide sequence (the bases upstream and/or downstream of the mutation). These mutational signatures, supplemented by mutable motifs (a wider mutational context), promise to help us to understand the nature of the mutational processes that operate during tumor evolution because they represent the footprints of interactions between DNA, mutagens and the enzymes of the repair/replication/modification pathways.
Collapse
Affiliation(s)
- Igor B Rogozin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, USA
| | - Youri I Pavlov
- Eppley Institute for Cancer Research, University of Nebraska Medical Center, USA
| | | | | | - Artem G Lada
- Department Microbiology and Molecular Genetics, University of California, Davis, USA
| | - Eugenia Poliakov
- Laboratory of Retinal Cell and Molecular Biology, National Eye Institute, National Institutes of Health, USA
| | - Anna R Panchenko
- National Center for Biotechnology Information, National Institutes of Health, USA
| | | |
Collapse
|
6
|
Lee NK, Li X, Wang D. A comprehensive survey on genetic algorithms for DNA motif prediction. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2018.07.004] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
7
|
Raghunath A, Nagarajan R, Sundarraj K, Panneerselvam L, Perumal E. Genome-wide identification and analysis of Nrf2 binding sites - Antioxidant response elements in zebrafish. Toxicol Appl Pharmacol 2018; 360:236-248. [PMID: 30243843 DOI: 10.1016/j.taap.2018.09.013] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2018] [Revised: 09/08/2018] [Accepted: 09/13/2018] [Indexed: 12/30/2022]
Abstract
In the post-genomic era, deciphering the Nrf2 binding sites - antioxidant response elements (AREs) is an essential task that underlies and governs the Keap1-Nrf2-ARE pathway - a cell survival response pathway to environmental stresses in the vertebrate model system. AREs regulate the transcription of a repertoire of phase II detoxifying and/or oxidative-stress responsive genes, offering protection against toxic chemicals, carcinogens, and xenobiotics. In order to identify and analyze AREs in zebrafish, a pattern search algorithm was developed to identify AREs and computational tools available online were utilized to analyze the identified AREs in zebrafish. This study identified the AREs within 30 kb upstream from the transcription start site of antioxidant genes and mitochondrial genes. We report for the first time the AREs of all the known protein coding genes in the zebrafish genome. Western blotting, RT2 profiler array PCR, and qRT-PCR were performed to test whether AREs influence the Nrf2 target genes expression in the zebrafish larvae using sulforaphane. This study reveals unique AREs that have not been previously reported in the cytoprotective genes. Nine TGAG/CNNNTC and six TGAG/CNNNGC AREs were observed significantly. Our findings suggest that AREs drive the dynamic transcriptional events of Nrf2 target genes in the zebrafish larvae on exposure to sulforaphane. The identified abundant putative AREs will define the Keap1-Nrf2-ARE network and elucidate the precise regulation of Nrf2-ARE pathway in not only diseases but also in embryonic development, inflammation, and aerobic respiration. Our results help to understand the dynamic complexity of the Nrf2-ARE system in zebrafish.
Collapse
Affiliation(s)
- Azhwar Raghunath
- Molecular Toxicology Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore 641 046, Tamilnadu, India
| | - Raju Nagarajan
- Department of Biotechnology, Indian Institute of Technology Madras, Chennai 600 036, Tamilnadu, India
| | - Kiruthika Sundarraj
- Molecular Toxicology Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore 641 046, Tamilnadu, India
| | - Lakshmikanthan Panneerselvam
- Molecular Toxicology Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore 641 046, Tamilnadu, India
| | - Ekambaram Perumal
- Molecular Toxicology Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore 641 046, Tamilnadu, India.
| |
Collapse
|
8
|
Lee C, Moroldo M, Perdomo-Sabogal A, Mach N, Marthey S, Lecardonnel J, Wahlberg P, Chong AY, Estellé J, Ho SYW, Rogel-Gaillard C, Gongora J. Inferring the evolution of the major histocompatibility complex of wild pigs and peccaries using hybridisation DNA capture-based sequencing. Immunogenetics 2017; 70:401-417. [PMID: 29256177 DOI: 10.1007/s00251-017-1048-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2017] [Accepted: 11/25/2017] [Indexed: 12/20/2022]
Abstract
The major histocompatibility complex (MHC) is a key genomic model region for understanding the evolution of gene families and the co-evolution between host and pathogen. To date, MHC studies have mostly focused on species from major vertebrate lineages. The evolution of MHC classical (Ia) and non-classical (Ib) genes in pigs has attracted interest because of their antigen presentation roles as part of the adaptive immune system. The pig family Suidae comprises over 18 extant species (mostly wild), but only the domestic pig has been extensively sequenced and annotated. To address this, we used a DNA-capture approach, with probes designed from the domestic pig genome, to generate MHC data for 11 wild species of pigs and their closest living family, Tayassuidae. The approach showed good efficiency for wild pigs (~80% reads mapped, ~87× coverage), compared to tayassuids (~12% reads mapped, ~4× coverage). We retrieved 145 MHC loci across both families. Phylogenetic analyses show that the class Ia and Ib genes underwent multiple duplications and diversifications before suids and tayassuids diverged from their common ancestor. The histocompatibility genes mostly form orthologous groups and there is genetic differentiation for most of these genes between Eurasian and sub-Saharan African wild pigs. Tests of selection showed that the peptide-binding region of class Ib genes was under positive selection. These findings contribute to better understanding of the evolutionary history of the MHC, specifically, the class I genes, and provide useful data for investigating the immune response of wild populations against pathogens.
Collapse
Affiliation(s)
- Carol Lee
- Sydney School of Veterinary Science, Faculty of Science, The University of Sydney, Sydney, Australia
| | - Marco Moroldo
- GABI, INRA, AgroParisTech, Université Paris-Saclay, Jouy-en-Josas, France
| | - Alvaro Perdomo-Sabogal
- Sydney School of Veterinary Science, Faculty of Science, The University of Sydney, Sydney, Australia.,Institute of Animal Science (460i), Department of Bioinformatics, University of Hohenheim, Stuttgart, Germany
| | - Núria Mach
- GABI, INRA, AgroParisTech, Université Paris-Saclay, Jouy-en-Josas, France
| | - Sylvain Marthey
- GABI, INRA, AgroParisTech, Université Paris-Saclay, Jouy-en-Josas, France
| | - Jérôme Lecardonnel
- GABI, INRA, AgroParisTech, Université Paris-Saclay, Jouy-en-Josas, France
| | - Per Wahlberg
- GABI, INRA, AgroParisTech, Université Paris-Saclay, Jouy-en-Josas, France
| | - Amanda Y Chong
- Sydney School of Veterinary Science, Faculty of Science, The University of Sydney, Sydney, Australia.,Earlham Institute, Norwich Research Park, Norwich, UK
| | - Jordi Estellé
- GABI, INRA, AgroParisTech, Université Paris-Saclay, Jouy-en-Josas, France
| | - Simon Y W Ho
- School of Life and Environmental Sciences, Faculty of Science, The University of Sydney, Sydney, Australia
| | | | - Jaime Gongora
- Sydney School of Veterinary Science, Faculty of Science, The University of Sydney, Sydney, Australia.
| |
Collapse
|
9
|
Wolff JG. A scaleable technique for best-match retrieval of sequential information using metrics-guided search. J Inf Sci 2016. [DOI: 10.1177/016555159402000103] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
A new technique is described for retrieving infor mation by finding the best match or matches between a textual 'query' and a textual database. The technique uses principles of beam search with a measure of probability to guide the search and prune the search tree. Unlike many methods for comparing strings, the method gives a set of alternative matches, graded by the 'quality' of the matching achieved.
Collapse
Affiliation(s)
- J. Gerard Wolff
- School of Electronic Engineering and Computer Systems, University of Wales, Bangor, Gwynedd, Wales, UK
| |
Collapse
|
10
|
Hosseinpour B, Bakhtiarizadeh MR, Khosravi P, Ebrahimie E. Predicting distinct organization of transcription factor binding sites on the promoter regions: a new genome-based approach to expand human embryonic stem cell regulatory network. Gene 2013; 531:212-9. [DOI: 10.1016/j.gene.2013.09.011] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2013] [Revised: 09/01/2013] [Accepted: 09/04/2013] [Indexed: 12/23/2022]
|
11
|
Quader S, Huang CH. Effect of positional dependence and alignment strategy on modeling transcription factor binding sites. BMC Res Notes 2012; 5:340. [PMID: 22748199 PMCID: PMC3465234 DOI: 10.1186/1756-0500-5-340] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2012] [Accepted: 06/07/2012] [Indexed: 11/29/2022] Open
Abstract
Background Many consensus-based and Position Weight Matrix-based methods for recognizing transcription factor binding sites (TFBS) are not well suited to the variability in the lengths of binding sites. Besides, many methods discard known binding sites while building the model. Moreover, the impact of Information Content (IC) and the positional dependence of nucleotides within an aligned set of TFBS has not been well researched for modeling variable-length binding sites. In this paper, we propose ML-Consensus (Mixed-Length Consensus): a consensus model for variable-length TFBS which does not exclude any reported binding sites. Methods We consider Pairwise Score (PS) as a measure of positional dependence of nucleotides within an alignment of TFBS. We investigate how the prediction accuracy of ML-Consensus is affected by the incorporation of IC and PS with a particular binding site alignment strategy. We perform cross-validations for datasets of six species from the TRANSFAC public database, and analyze the results using ROC curves and the Wilcoxon matched-pair signed-ranks test. Results We observe that the incorporation of IC and PS in ML-Consensus results in statistically significant improvement in the prediction accuracy of the model. Moreover, the existence of a core region among the known binding sites (of any length) is witnessed by the pairwise coexistence of nucleotides within the core length. Conclusions These observations suggest the possibility of an efficient multiple sequence alignment algorithm for aligning TFBS, accommodating known binding sites of any length, for optimal (or near-optimal) TFBS prediction. However, designing such an algorithm is a matter of further investigation.
Collapse
Affiliation(s)
- Saad Quader
- Department of Computer Science & Engineering, University of Connecticut, Storrs, 06269-2155, USA
| | | |
Collapse
|
12
|
Intron identification approaches based on weighted features and fuzzy decision trees. Comput Biol Med 2011; 42:112-22. [PMID: 22099702 DOI: 10.1016/j.compbiomed.2011.10.015] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2010] [Revised: 04/11/2011] [Accepted: 10/13/2011] [Indexed: 11/22/2022]
Abstract
Current computational predictions of splice sites largely depend on the sequence patterns of known intronic sequence features (ISFs) described in the classical intron definition model (IDM). The computation-oriented IDM (CO-IDM) clearly provides more specific and concrete information for describing intron flanks of splice sites (IFSSs). In the paper, we proposed a novel approach of fuzzy decision trees (FDTs) which utilize (1) weighted ISFs of twelve uni-frame patterns (UFPs) and forty-five multi-frame patterns (MFPs) and (2) gain ratios to improve the performances in identifying an intron. First, we fuzzified extracted features from genomic sequences using membership functions with an unsupervised self-organizing map (SOM) technique. Then, we brought in different viewpoints of globally weighting and crossly referring in generating fuzzy rules, which are interpretable and useful for biologists to verify whether a sequence is an intron or not. Finally, the experimental results revealed the effectiveness of the proposed method in improving the identification accuracy. Besides, we also implemented an on-line intronic identifier to infer an unknown genomic sequence.
Collapse
|
13
|
Reid JE, Evans KJ, Dyer N, Wernisch L, Ott S. Variable structure motifs for transcription factor binding sites. BMC Genomics 2010; 11:30. [PMID: 20074339 PMCID: PMC2824720 DOI: 10.1186/1471-2164-11-30] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2009] [Accepted: 01/14/2010] [Indexed: 02/06/2023] Open
Abstract
Background Classically, models of DNA-transcription factor binding sites (TFBSs) have been based on relatively few known instances and have treated them as sites of fixed length using position weight matrices (PWMs). Various extensions to this model have been proposed, most of which take account of dependencies between the bases in the binding sites. However, some transcription factors are known to exhibit some flexibility and bind to DNA in more than one possible physical configuration. In some cases this variation is known to affect the function of binding sites. With the increasing volume of ChIP-seq data available it is now possible to investigate models that incorporate this flexibility. Previous work on variable length models has been constrained by: a focus on specific zinc finger proteins in yeast using restrictive models; a reliance on hand-crafted models for just one transcription factor at a time; and a lack of evaluation on realistically sized data sets. Results We re-analysed binding sites from the TRANSFAC database and found motivating examples where our new variable length model provides a better fit. We analysed several ChIP-seq data sets with a novel motif search algorithm and compared the results to one of the best standard PWM finders and a recently developed alternative method for finding motifs of variable structure. All the methods performed comparably in held-out cross validation tests. Known motifs of variable structure were recovered for p53, Stat5a and Stat5b. In addition our method recovered a novel generalised version of an existing PWM for Sp1 that allows for variable length binding. This motif improved classification performance. Conclusions We have presented a new gapped PWM model for variable length DNA binding sites that is not too restrictive nor over-parameterised. Our comparison with existing tools shows that on average it does not have better predictive accuracy than existing methods. However, it does provide more interpretable models of motifs of variable structure that are suitable for follow-up structural studies. To our knowledge, we are the first to apply variable length motif models to eukaryotic ChIP-seq data sets and consequently the first to show their value in this domain. The results include a novel motif for the ubiquitous transcription factor Sp1.
Collapse
Affiliation(s)
- John E Reid
- MRC Biostatistics Unit, Institute of Public Health, Forvie Site, Cambridge, CB2 0SR, UK.
| | | | | | | | | |
Collapse
|
14
|
|
15
|
van Hijum SAFT, Medema MH, Kuipers OP. Mechanisms and evolution of control logic in prokaryotic transcriptional regulation. Microbiol Mol Biol Rev 2009; 73:481-509, Table of Contents. [PMID: 19721087 PMCID: PMC2738135 DOI: 10.1128/mmbr.00037-08] [Citation(s) in RCA: 96] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
A major part of organismal complexity and versatility of prokaryotes resides in their ability to fine-tune gene expression to adequately respond to internal and external stimuli. Evolution has been very innovative in creating intricate mechanisms by which different regulatory signals operate and interact at promoters to drive gene expression. The regulation of target gene expression by transcription factors (TFs) is governed by control logic brought about by the interaction of regulators with TF binding sites (TFBSs) in cis-regulatory regions. A factor that in large part determines the strength of the response of a target to a given TF is motif stringency, the extent to which the TFBS fits the optimal TFBS sequence for a given TF. Advances in high-throughput technologies and computational genomics allow reconstruction of transcriptional regulatory networks in silico. To optimize the prediction of transcriptional regulatory networks, i.e., to separate direct regulation from indirect regulation, a thorough understanding of the control logic underlying the regulation of gene expression is required. This review summarizes the state of the art of the elements that determine the functionality of TFBSs by focusing on the molecular biological mechanisms and evolutionary origins of cis-regulatory regions.
Collapse
Affiliation(s)
- Sacha A F T van Hijum
- Molecular Genetics, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Kerklaan 30, 9751 NN Haren, The Netherlands.
| | | | | |
Collapse
|
16
|
HOU L, QIAN MP, ZHU YP, DENG MH. Advances on bioinformatic research in transcription factor binding sites. YI CHUAN = HEREDITAS 2009; 31:365-73. [DOI: 10.3724/sp.j.1005.2009.00365] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
17
|
Zare-Mirakabad F, Ahrabian H, Sadeghi M, Nowzari-Dalini A, Goliaei B. New scoring schema for finding motifs in DNA Sequences. BMC Bioinformatics 2009; 10:93. [PMID: 19302709 PMCID: PMC2679735 DOI: 10.1186/1471-2105-10-93] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2008] [Accepted: 03/20/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Pattern discovery in DNA sequences is one of the most fundamental problems in molecular biology with important applications in finding regulatory signals and transcription factor binding sites. An important task in this problem is to search (or predict) known binding sites in a new DNA sequence. For this reason, all subsequences of the given DNA sequence are scored based on an scoring function and the prediction is done by selecting the best score. By assuming no dependency between binding site base positions, most of the available tools for known binding site prediction are designed. Recently Tomovic and Oakeley investigated the statistical basis for either a claim of dependence or independence, to determine whether such a claim is generally true, and they presented a scoring function for binding site prediction based on the dependency between binding site base positions. Our primary objective is to investigate the scoring functions which can be used in known binding site prediction based on the assumption of dependency or independency in binding site base positions. RESULTS We propose a new scoring function based on the dependency between all positions in biding site base positions. This scoring function uses joint information content and mutual information as a measure of dependency between positions in transcription factor binding site. Our method for modeling dependencies is simply an extension of position independency methods. We evaluate our new scoring function on the real data sets extracted from JASPAR and TRANSFAC data bases, and compare the obtained results with two other well known scoring functions. CONCLUSION The results demonstrate that the new approach improves known binding site discovery and show that the joint information content and mutual information provide a better and more general criterion to investigate the relationships between positions in the TFBS. Our scoring function is formulated by simple mathematical calculations. By implementing our method on several biological data sets, it can be induced that this method performs better than methods that do not consider dependencies.
Collapse
Affiliation(s)
- Fatemeh Zare-Mirakabad
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| | - Hayedeh Ahrabian
- Center of Excellence in Biomathematics, School of Mathematics, Statistics, and Computer Science, University of Tehran, Tehran, Iran
| | - Mehdei Sadeghi
- National Institute of Genetic Engineering and Biotechnology, Tehran, Iran
- School of Computer Science, Institute for Studies in Theoretical Physics and Mathematics (IPM), Tehran, Iran
| | - Abbas Nowzari-Dalini
- Center of Excellence in Biomathematics, School of Mathematics, Statistics, and Computer Science, University of Tehran, Tehran, Iran
| | - Bahram Goliaei
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| |
Collapse
|
18
|
Liu S, Song Q, Cao A, Yang X, Wu Y. Robust mixture model clustering of DNA binding sites. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2008; 2006:2032-5. [PMID: 17946928 DOI: 10.1109/iembs.2006.260414] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Nucleotide sequences contain motifs that preserved through evolution because they are important to the structure or function of the molecules. DNA binding site analysis is an important issue in biology experiments as well as in computational methods. To find DNA binding sites that bind to specific transcription factors, we develop a robust mixed effect mixture model (RMEMM). The DNA sequences are represented as mixed effect model of position specific frequency, considering the relationship of frequency between positions. The results show that the mean effect is similar to position-specific scoring matrices (PSSM), providing a new view of the sequence. This model is robust to outliers or data with a bit large tails on distribution.
Collapse
|
19
|
Schug J. Using TESS to Predict Transcription Factor Binding Sites in DNA Sequence. ACTA ACUST UNITED AC 2008; Chapter 2:Unit 2.6. [DOI: 10.1002/0471250953.bi0206s21] [Citation(s) in RCA: 182] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
20
|
Habib N, Kaplan T, Margalit H, Friedman N. A novel Bayesian DNA motif comparison method for clustering and retrieval. PLoS Comput Biol 2008; 4:e1000010. [PMID: 18463706 PMCID: PMC2265534 DOI: 10.1371/journal.pcbi.1000010] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2007] [Accepted: 01/24/2008] [Indexed: 11/17/2022] Open
Abstract
Characterizing the DNA-binding specificities of transcription factors is a key problem in computational biology that has been addressed by multiple algorithms. These usually take as input sequences that are putatively bound by the same factor and output one or more DNA motifs. A common practice is to apply several such algorithms simultaneously to improve coverage at the price of redundancy. In interpreting such results, two tasks are crucial: clustering of redundant motifs, and attributing the motifs to transcription factors by retrieval of similar motifs from previously characterized motif libraries. Both tasks inherently involve motif comparison. Here we present a novel method for comparing and merging motifs, based on Bayesian probabilistic principles. This method takes into account both the similarity in positional nucleotide distributions of the two motifs and their dissimilarity to the background distribution. We demonstrate the use of the new comparison method as a basis for motif clustering and retrieval procedures, and compare it to several commonly used alternatives. Our results show that the new method outperforms other available methods in accuracy and sensitivity. We incorporated the resulting motif clustering and retrieval procedures in a large-scale automated pipeline for analyzing DNA motifs. This pipeline integrates the results of various DNA motif discovery algorithms and automatically merges redundant motifs from multiple training sets into a coherent annotated library of motifs. Application of this pipeline to recent genome-wide transcription factor location data in S. cerevisiae successfully identified DNA motifs in a manner that is as good as semi-automated analysis reported in the literature. Moreover, we show how this analysis elucidates the mechanisms of condition-specific preferences of transcription factors. Regulation of gene expression plays a central role in the activity of living cells and in their response to internal (e.g., cell division) or external (e.g., stress) stimuli. Key players in determining gene-specific regulation are transcription factors that bind sequence-specific sites on the DNA, modulating the expression of nearby genes. To understand the regulatory program of the cell, we need to identify these transcription factors, when they act, and on which genes. Transcription regulatory maps can be assembled by computational analysis of experimental data, by discovering the DNA recognition sequences (motifs) of transcription factors and their occurrences along the genome. Such an analysis usually results in a large number of overlapping motifs. To reconstruct regulatory maps, it is crucial to combine similar motifs and to relate them to transcription factors. To this end we developed an accurate fully-automated method, termed BLiC, based upon an improved similarity measure for comparing DNA motifs. By applying it to genome-wide data in yeast, we identified the DNA motifs of transcription factors and their putative target genes. Finally, we analyze motifs of transcription factor that alter their target genes under different conditions, and show how cells adjust their regulatory program in response to environmental changes.
Collapse
Affiliation(s)
- Naomi Habib
- School of Computer Science and Engineering, The Hebrew University, Jerusalem, Israel
| | | | | | | |
Collapse
|
21
|
Harris SR, Pisani D, Gower DJ, Wilkinson M. Investigating stagnation in morphological phylogenetics using consensus data. Syst Biol 2007; 56:125-9. [PMID: 17366142 DOI: 10.1080/10635150601115624] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022] Open
Affiliation(s)
- Simon R Harris
- Department of Zoology, The Natural History Museum, London, SW7 5BD, UK.
| | | | | | | |
Collapse
|
22
|
Abstract
MOTIVATION Most of the available tools for transcription factor binding site prediction are based on methods which assume no sequence dependence between the binding site base positions. Our primary objective was to investigate the statistical basis for either a claim of dependence or independence, to determine whether such a claim is generally true, and to use the resulting data to develop improved scoring functions for binding-site prediction. RESULTS Using three statistical tests, we analyzed the number of binding sites showing dependent positions. We analyzed transcription factor-DNA crystal structures for evidence of position dependence. Our final conclusions were that some factors show evidence of dependencies whereas others do not. We observed that the conformational energy (Z-score) of the transcription factor-DNA complexes was lower (better) for sequences that showed dependency than for those that did not (P < 0.02). We suggest that where evidence exists for dependencies, these should be modeled to improve binding-site predictions. However, when no significant dependency is found, this correction should be omitted. This may be done by converting any existing scoring function which assumes independence into a form which includes a dependency correction. We present an example of such an algorithm and its implementation as a web tool. AVAILABILITY http://promoterplot.fmi.ch/cgi-bin/dep.html
Collapse
Affiliation(s)
- Andrija Tomovic
- Friedrich Miescher Institute for Biomedical Research, Novartis Research Foundation, Basel, Switzerland
| | | |
Collapse
|
23
|
Bhardwaj N, Lu H. Residue-level prediction of DNA-binding sites and its application on DNA-binding protein predictions. FEBS Lett 2007; 581:1058-66. [PMID: 17316627 PMCID: PMC1993824 DOI: 10.1016/j.febslet.2007.01.086] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2006] [Revised: 12/11/2006] [Accepted: 01/25/2007] [Indexed: 11/19/2022]
Abstract
Protein-DNA interactions are crucial to many cellular activities such as expression-control and DNA-repair. These interactions between amino acids and nucleotides are highly specific and any aberrance at the binding site can render the interaction completely incompetent. In this study, we have three aims focusing on DNA-binding residues on the protein surface: to develop an automated approach for fast and reliable recognition of DNA-binding sites; to improve the prediction by distance-dependent refinement; use these predictions to identify DNA-binding proteins. We use a support vector machines (SVM)-based approach to harness the features of the DNA-binding residues to distinguish them from non-binding residues. Features used for distinction include the residue's identity, charge, solvent accessibility, average potential, the secondary structure it is embedded in, neighboring residues, and location in a cationic patch. These features collected from 50 proteins are used to train SVM. Testing is then performed on another set of 37 proteins, much larger than any testing set used in previous studies. The testing set has no more than 20% sequence identity not only among its pairs, but also with the proteins in the training set, thus removing any undesired redundancy due to homology. This set also has proteins with an unseen DNA-binding structural class not present in the training set. With the above features, an accuracy of 66% with balanced sensitivity and specificity is achieved without relying on homology or evolutionary information. We then develop a post-processing scheme to improve the prediction using the relative location of the predicted residues. Balanced success is then achieved with average sensitivity, specificity and accuracy pegged at 71.3%, 69.3% and 70.5%, respectively. Average net prediction is also around 70%. Finally, we show that the number of predicted DNA-binding residues can be used to differentiate DNA-binding proteins from non-DNA-binding proteins with an accuracy of 78%. Results presented here demonstrate that machine-learning can be applied to automated identification of DNA-binding residues and that the success rate can be ameliorated as more features are added. Such functional site prediction protocols can be useful in guiding consequent works such as site-directed mutagenesis and macromolecular docking.
Collapse
Affiliation(s)
- Nitin Bhardwaj
- Bioinformatics Program, Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607, USA
| | | |
Collapse
|
24
|
|
25
|
Mahony S, Benos PV, Smith TJ, Golden A. Self-organizing neural networks to support the discovery of DNA-binding motifs. Neural Netw 2006; 19:950-62. [PMID: 16839740 DOI: 10.1016/j.neunet.2006.05.023] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Identification of the short DNA sequence motifs that serve as binding targets for transcription factors is an important challenge in bioinformatics. Unsupervised techniques from the statistical learning theory literature have often been applied to motif discovery, but effective solutions for large genomic datasets have yet to be found. We present here three self-organizing neural networks that have applicability to the motif-finding problem. The core system in this study is a previously described SOM-based motif-finder named SOMBRERO. The motif-finder is integrated in this work with a SOM-based method that automatically constructs generalized models for structurally related motifs and initializes SOMBRERO with relevant biological knowledge. A self-organizing tree method that displays the relationships between various motifs is also presented, and it is shown that such a method can act as an effective structural classifier of novel motifs. The performance of the three self-organizing neural networks is evaluated here using various datasets.
Collapse
Affiliation(s)
- Shaun Mahony
- Department of Computational Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA.
| | | | | | | |
Collapse
|
26
|
Abstract
Computational methods for de novo identification of gene regulation elements, such as transcription factor binding sites, have proved to be useful for deciphering genetic regulatory networks. However, despite the availability of a large number of algorithms, their strengths and weaknesses are not sufficiently understood. Here, we designed a comprehensive set of performance measures and benchmarked five modern sequence-based motif discovery algorithms using large datasets generated from Escherichia coli RegulonDB. Factors that affect the prediction accuracy, scalability and reliability are characterized. It is revealed that the nucleotide and the binding site level accuracy are very low, while the motif level accuracy is relatively high, which indicates that the algorithms can usually capture at least one correct motif in an input sequence. To exploit diverse predictions from multiple runs of one or more algorithms, a consensus ensemble algorithm has been developed, which achieved 6-45% improvement over the base algorithms by increasing both the sensitivity and specificity. Our study illustrates limitations and potentials of existing sequence-based motif discovery algorithms. Taking advantage of the revealed potentials, several promising directions for further improvements are discussed. Since the sequence-based algorithms are the baseline of most of the modern motif discovery algorithms, this paper suggests substantial improvements would be possible for them.
Collapse
Affiliation(s)
- Jianjun Hu
- Department of Biological Sciences, College of Science, Purdue UniversityWest Lafayette, IN 47907, USA
- Department of Computer Science, College of Science, Purdue UniversityWest Lafayette, IN 47907, USA
| | - Bin Li
- Department of Computer Science, College of Science, Purdue UniversityWest Lafayette, IN 47907, USA
| | - Daisuke Kihara
- Department of Biological Sciences, College of Science, Purdue UniversityWest Lafayette, IN 47907, USA
- Department of Computer Science, College of Science, Purdue UniversityWest Lafayette, IN 47907, USA
- Markey Center for Structural Biology, College of Science, Purdue UniversityWest Lafayette, IN 47907, USA
- The Bindley Bioscience Center, College of Science, Purdue UniversityWest Lafayette, IN 47907, USA
- To whom correspondence should be addressed. Tel: +1 765 496 2284; Fax: +1 765 494 1189;
| |
Collapse
|
27
|
Styczynski MP, Stephanopoulos G. Overview of computational methods for the inference of gene regulatory networks. Comput Chem Eng 2005. [DOI: 10.1016/j.compchemeng.2004.08.029] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
28
|
Cowell LG, Davila M, Ramsden D, Kelsoe G. Computational tools for understanding sequence variability in recombination signals. Immunol Rev 2004; 200:57-69. [PMID: 15242396 DOI: 10.1111/j.0105-2896.2004.00171.x] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The recombination signals (RSs) that guide V(D)J rearrangement are remarkably diverse. In mice, fewer than 16% of RSs carry consensus heptamers and nonamers and none also contain a consensus spacer sequence. It is increasingly clear that this variability regulates recombination: genetic variability in RSs may help enforce allelic exclusion, determine the general nature of antigen receptor repertoires, and mitigate autoreactivity in B lymphocytes. The great diversity of RSs has largely precluded, however, empiric determinations of how RS sequence affects recombination. For example, 4(39) unique 23-RSs are possible or approximately 3 x 10(23) sequences; some 7 x 10(13) unique 23-RSs can be produced just by changes in the spacer. In contrast, the recombination activities of only 100 or so RSs have been measured, and it is unlikely that the activities of even a tiny fraction of extant RSs can be determined. We have addressed the problem of how sequence determines the efficiency of RS templates by generating computational models that describe the correlation structure of mouse RSs. These models successfully predict RS activity and identify functional, cryptic RSs (cRSs). These models permit studies to identify RSs and cRSs for empiric study and constitute a tool useful for understanding RS structure and function.
Collapse
Affiliation(s)
- Lindsay G Cowell
- Department of Biostatistics and Bioinformatics, Center for Bioinformatics and Computational Biology, Duke University, Durham, NC, USA
| | | | | | | |
Collapse
|
29
|
Rogozin IB, Pavlov YI. Theoretical analysis of mutation hotspots and their DNA sequence context specificity. Mutat Res 2003; 544:65-85. [PMID: 12888108 DOI: 10.1016/s1383-5742(03)00032-2] [Citation(s) in RCA: 123] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Mutation frequencies vary significantly along nucleotide sequences such that mutations often concentrate at certain positions called hotspots. Mutation hotspots in DNA reflect intrinsic properties of the mutation process, such as sequence specificity, that manifests itself at the level of interaction between mutagens, DNA, and the action of the repair and replication machineries. The hotspots might also reflect structural and functional features of the respective DNA sequences. When mutations in a gene are identified using a particular experimental system, resulting hotspots could reflect the properties of the gene product and the mutant selection scheme. Analysis of the nucleotide sequence context of hotspots can provide information on the molecular mechanisms of mutagenesis. However, the determinants of mutation frequency and specificity are complex, and there are many analytical methods for their study. Here we review computational approaches for analyzing mutation spectra (distribution of mutations along the target genes) that include many mutable (detectable) positions. The following methods are reviewed: derivation of a consensus sequence, application of regression approaches to correlate nucleotide sequence features with mutation frequency, mutation hotspot prediction, analysis of oligonucleotide composition of regions containing mutations, pairwise comparison of mutation spectra, analysis of multiple spectra, and analysis of "context-free" characteristics. The advantages and pitfalls of these methods are discussed and illustrated by examples from the literature. The most reliable analyses were obtained when several methods were combined and information from theoretical analysis and experimental observations was considered simultaneously. Simple, robust approaches should be used with small samples of mutations, whereas combinations of simple and complex approaches may be required for large samples. We discuss several well-documented studies where analysis of mutation spectra has substantially contributed to the current understanding of molecular mechanisms of mutagenesis. The nucleotide sequence context of mutational hotspots is a fingerprint of interactions between DNA and DNA repair, replication, and modification enzymes, and the analysis of hotspot context provides evidence of such interactions.
Collapse
Affiliation(s)
- Igor B Rogozin
- Institute of Cytology and Genetics, Russian Academy of Sciences, Novosibirsk, Russia
| | | |
Collapse
|
30
|
Sosinsky A, Bonin CP, Mann RS, Honig B. Target Explorer: An automated tool for the identification of new target genes for a specified set of transcription factors. Nucleic Acids Res 2003; 31:3589-92. [PMID: 12824372 PMCID: PMC168951 DOI: 10.1093/nar/gkg544] [Citation(s) in RCA: 82] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
With the increasing number of eukaryotic genomes available, high-throughput automated tools for identification of regulatory DNA sequences are becoming increasingly feasible. Several computational approaches for the prediction of regulatory elements were recently developed. Here we combine the prediction of clusters of binding sites for transcription factors with context information taken from genome annotations. Target Explorer automates the entire process from the creation of a customized library of binding sites for known transcription factors through the prediction and annotation of putative target genes that are potentially regulated by these factors. It was specifically designed for the well-annotated Drosophila melanogaster genome, but most options can be used for sequences from other genomes as well. Target Explorer is available at http://trantor.bioc.columbia.edu/Target_Explorer/
Collapse
Affiliation(s)
- Alona Sosinsky
- Department of Biochemistry and Molecular Biophysics, Columbia University College of Physicians and Surgeons, New York, USA
| | | | | | | |
Collapse
|
31
|
Affiliation(s)
- Jonathan Schug
- Center Of Bioinformatics, University of Pennsylvania Philadelphia Pennsylvania
| |
Collapse
|
32
|
Cowell LG, Davila M, Kepler TB, Kelsoe G. Identification and utilization of arbitrary correlations in models of recombination signal sequences. Genome Biol 2002; 3:RESEARCH0072. [PMID: 12537561 PMCID: PMC151174 DOI: 10.1186/gb-2002-3-12-research0072] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2002] [Revised: 09/04/2002] [Accepted: 10/10/2002] [Indexed: 01/26/2023] Open
Abstract
BACKGROUND A significant challenge in bioinformatics is to develop methods for detecting and modeling patterns in variable DNA sequence sites, such as protein-binding sites in regulatory DNA. Current approaches sometimes perform poorly when positions in the site do not independently affect protein binding. We developed a statistical technique for modeling the correlation structure in variable DNA sequence sites. The method places no restrictions on the number of correlated positions or on their spatial relationship within the site. No prior empirical evidence for the correlation structure is necessary. RESULTS We applied our method to the recombination signal sequences (RSS) that direct assembly of B-cell and T-cell antigen-receptor genes via V(D)J recombination. The technique is based on model selection by cross-validation and produces models that allow computation of an information score for any signal-length sequence. We also modeled RSS using order zero and order one Markov chains. The scores from all models are highly correlated with measured recombination efficiencies, but the models arising from our technique are better than the Markov models at discriminating RSS from non-RSS. CONCLUSIONS Our model-development procedure produces models that estimate well the recombinogenic potential of RSS and are better at RSS recognition than the order zero and order one Markov models. Our models are, therefore, valuable for studying the regulation of both physiologic and aberrant V(D)J recombination. The approach could be equally powerful for the study of promoter and enhancer elements, splice sites, and other DNA regulatory sites that are highly variable at the level of individual nucleotide positions.
Collapse
Affiliation(s)
- Lindsay G Cowell
- Department of Immunology, Duke University Medical Center, Durham, NC 27710, USA
| | - Marco Davila
- Department of Immunology, Duke University Medical Center, Durham, NC 27710, USA
| | - Thomas B Kepler
- Center for Bioinformatics and Computational Biology, Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, NC 27710, USA
| | - Garnett Kelsoe
- Department of Immunology, Duke University Medical Center, Durham, NC 27710, USA
| |
Collapse
|
33
|
Schneider TD. Consensus sequence Zen. APPLIED BIOINFORMATICS 2002; 1:111-9. [PMID: 15130839 PMCID: PMC1852464] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 04/29/2023]
Abstract
Consensus sequences are widely used in molecular biology but they have many flaws. As a result, binding sites of proteins and other molecules are missed during studies of genetic sequences and important biological effects cannot be seen. Information theory provides a mathematically robust way to avoid consensus sequences. Instead of using consensus sequences, sequence conservation can be quantitatively presented in bits of information by using sequence logo graphics to represent the average of a set of sites, and sequence walker graphics to represent individual sites.
Collapse
Affiliation(s)
- Thomas D Schneider
- Laboratory of Experimental and Computational Biology, National Cancer Institute at Frederick, National Institutes of Health, Frederick, MD 21702-1201, USA.
| |
Collapse
|
34
|
Abstract
Availability of complete bacterial genomes opens the way to the comparative approach to the recognition of transcription regulatory sites. Assumption of regulon conservation in conjunction with profile analysis provides two lines of independent evidence making it possible to make highly specific predictions. Recently this approach was used to analyze several regulons in eubacteria and archaebacteria. The present review covers recent advances in the comparative analysis of transcriptional regulation in prokaryotes and phylogenetic fingerprinting techniques in eukaryotes, and describes the emerging patterns of the evolution of regulatory systems.
Collapse
Affiliation(s)
- M S Gelfand
- State Scientific Center for Biotechnology 'NIIGenetika', Moscow, Russia.
| |
Collapse
|
35
|
|
36
|
Ramsden DA, Baetz K, Wu GE. Conservation of sequence in recombination signal sequence spacers. Nucleic Acids Res 1994; 22:1785-96. [PMID: 8208601 PMCID: PMC308075 DOI: 10.1093/nar/22.10.1785] [Citation(s) in RCA: 122] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
The variable domains of immunoglobulins and T cell receptors are assembled through the somatic, site specific recombination of multiple germline segments (V, D, and J segments) or V(D)J rearrangement. The recombination signal sequence (RSS) is necessary and sufficient for cell type specific targeting of the V(D)J rearrangement machinery to these germline segments. Previously, the RSS has been described as possessing both a conserved heptamer and a conserved nonamer motif. The heptamer and nonamer motifs are separated by a 'spacer' that was not thought to possess significant sequence conservation, however the length of the spacer could be either 12 +/- 1 bp or 23 +/- 1 bp long. In this report we have assembled and analyzed an extensive data base of published RSS. We have derived, through extensive consensus comparison, a more detailed description of the RSS than has previously been reported. Our analysis indicates that RSS spacers possess significant conservation of sequence, and that the conserved sequence in 12 bp spacers is similar to the conserved sequence in the first half of 23 bp spacers.
Collapse
Affiliation(s)
- D A Ramsden
- Department of Immunology, University of Toronto, Ontario, Canada
| | | | | |
Collapse
|
37
|
Abstract
In recent years, methods of consensus, developed for the solution of problems in the social sciences, have become widely used in molecular biology. We study a method of consensus originally due to Waterman et al. (Waterman, Galas and Arratis. 1984. Pattern recognition in several sequences: consensus and alignment. Bull. math. Biol. 46, 515-527) which is used to identify patterns or features in a molecular sequence where a pattern can vary in position within a given window. We show that some well-known consensus methods of the social sciences, the median and the mean, are special cases of this method for certain choices of the parameters used in it and give a precise account of the parameters for which these special cases arise. We also show that the specific parameters used in the method of Waterman et al. make their method equivalent to the media procedure which is widely used in the social sciences.
Collapse
Affiliation(s)
- B Mirkin
- Department of Informatics and Applied Statistics, Central Economics- Mathematics Institute, Moscow, Russia
| | | |
Collapse
|
38
|
|
39
|
Abstract
We introduce a parameterized threshold consensus method (th chi) for molecular sequences which is based on a majority-rule voting principle. In contrast to other frequency-based methods, the th chi method uses a single criterion to return ambiguity codes of different lengths. We derive basic features of the method and establish that it returns at most two ambiguity codes at any position of the consensus sequence. We bound from below the size of the frequency gap that exists when the th chi method returns an ambiguity code. Using such properties, we compare the th chi method to other consensus methods for molecular sequences which are defined in terms of threshold or gap criteria.
Collapse
Affiliation(s)
- W H Day
- Department of Computer Science, Memorial University of Newfoundland, St John's, Canada
| | | |
Collapse
|