1
|
Odronitz F, Pillmann H, Keller O, Waack S, Kollmar M. WebScipio: an online tool for the determination of gene structures using protein sequences. BMC Genomics 2008; 9:422. [PMID: 18801164 PMCID: PMC2644328 DOI: 10.1186/1471-2164-9-422] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2008] [Accepted: 09/18/2008] [Indexed: 11/13/2022] Open
Abstract
Background Obtaining the gene structure for a given protein encoding gene is an important step in many analyses. A software suited for this task should be readily accessible, accurate, easy to handle and should provide the user with a coherent representation of the most probable gene structure. It should be rigorous enough to optimise features on the level of single bases and at the same time flexible enough to allow for cross-species searches. Results WebScipio, a web interface to the Scipio software, allows a user to obtain the corresponding coding sequence structure of a here given a query protein sequence that belongs to an already assembled eukaryotic genome. The resulting gene structure is presented in various human readable formats like a schematic representation, and a detailed alignment of the query and the target sequence highlighting any discrepancies. WebScipio can also be used to identify and characterise the gene structures of homologs in related organisms. In addition, it offers a web service for integration with other programs. Conclusion WebScipio is a tool that allows users to get a high-quality gene structure prediction from a protein query. It offers more than 250 eukaryotic genomes that can be searched and produces predictions that are close to what can be achieved by manual annotation, for in-species and cross-species searches alike. WebScipio is freely accessible at .
Collapse
Affiliation(s)
- Florian Odronitz
- Max-Planck-Institut für Biophysikalische Chemie, Abteilung NMR-basierte Strukturbiologie, Am Fassberg 11, 37077 Göttingen, Germany.
| | | | | | | | | |
Collapse
|
2
|
Cobb J, Büsst C, Petrou S, Harrap S, Ellis J. Searching for functional genetic variants in non-coding DNA. Clin Exp Pharmacol Physiol 2008; 35:372-5. [PMID: 18307723 DOI: 10.1111/j.1440-1681.2008.04880.x] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
1. The search for DNA sequence variants for complex human polygenic conditions has been a strong focus of recent genetic research. While gene loci have been identified, few variants in the coding sequences of these genes have been found, suggesting that non-coding sequence variation may underlie many complex conditions. 2. Non-coding DNA harbours regulatory elements capable of making changes to gene expression. However, regulatory DNA sequences are currently difficult to recognize and their function is poorly understood, complicating the task of assigning potential functional significance to non-coding variation. 3. Comparative genomics, the study of evolutionary DNA conservation, has enabled the emergent field of non-coding DNA identification in human disease analysis. 4. This brief review will focus on the potential of a relatively high throughput technique based on comparative genomics, that may aid in the identification of functionally important non-coding sequence variation in complex diseases.
Collapse
Affiliation(s)
- Joanna Cobb
- Department of Physiology, The University of Melbourne, Victoria, Australia
| | | | | | | | | |
Collapse
|
3
|
Freeling M, Rapaka L, Lyons E, Pedersen B, Thomas BC. G-boxes, bigfoot genes, and environmental response: characterization of intragenomic conserved noncoding sequences in Arabidopsis. THE PLANT CELL 2007; 19:1441-57. [PMID: 17496117 PMCID: PMC1913728 DOI: 10.1105/tpc.107.050419] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/18/2007] [Revised: 03/10/2007] [Accepted: 04/19/2007] [Indexed: 05/15/2023]
Abstract
A tetraploidy left Arabidopsis thaliana with 6358 pairs of homoeologs that, when aligned, generated 14,944 intragenomic conserved noncoding sequences (CNSs). Our previous work assembled these phylogenetic footprints into a database. We show that known transcription factor (TF) binding motifs, including the G-box, are overrepresented in these CNSs. A total of 254 genes spanning long lengths of CNS-rich chromosomes (Bigfoot) dominate this database. Therefore, we made subdatabases: one containing Bigfoot genes and the other containing genes with three to five CNSs (Smallfoot). Bigfoot genes are generally TFs that respond to signals, with their modal CNS positioned 3.1 kb 5' from the ATG. Smallfoot genes encode components of signal transduction machinery, the cytoskeleton, or involve transcription. We queried each subdatabase with each possible 7-nucleotide sequence. Among hundreds of hits, most were purified from CNSs, and almost all of those significantly enriched in CNSs had no experimental history. The 7-mers in CNSs are not 5'- to 3'-oriented in Bigfoot genes but are often oriented in Smallfoot genes. CNSs with one G-box tend to have two G-boxes. CNSs were shared with the homoeolog only and with no other gene, suggesting that binding site turnover impedes detection. Bigfoot genes may function in adaptation to environmental change.
Collapse
Affiliation(s)
- Michael Freeling
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA.
| | | | | | | | | |
Collapse
|
4
|
Abstract
DNA sequence alignment is a prerequisite to virtually all comparative genomic analyses, including the identification of conserved sequence motifs, estimation of evolutionary divergence between sequences, and inference of historical relationships among genes and species. While it is mere common sense that inaccuracies in multiple sequence alignments can have detrimental effects on downstream analyses, it is important to know the extent to which the inferences drawn from these alignments are robust to errors and biases inherent in all sequence alignments. A survey of investigations into strengths and weaknesses of sequence alignments reveals, as expected, that alignment quality is generally poor for two distantly related sequences and can often be improved by adding additional sequences as stepping stones between distantly related species. Errors in sequence alignment are also found to have a significant negative effect on subsequent inference of sequence divergence, phylogenetic trees, and conserved motifs. However, our understanding of alignment biases remains rudimentary, and sequence alignment procedures continue to be used somewhat like benign formatting operations to make sequences equal in length. Because of the central role these alignments now play in our endeavors to establish the tree of life and to identify important parts of genomes through evolutionary functional genomics, we see a need for increased community effort to investigate influences of alignment bias on the accuracy of large-scale comparative genomics.
Collapse
Affiliation(s)
- Sudhir Kumar
- Center for Evolutionary Functional Genomics, Biodesign Institute and School of Life Sciences, Arizona State University, Tempe, Arizona 85287-5301, USA.
| | | |
Collapse
|
5
|
Thomas BC, Rapaka L, Lyons E, Pedersen B, Freeling M. Arabidopsis intragenomic conserved noncoding sequence. Proc Natl Acad Sci U S A 2007; 104:3348-53. [PMID: 17301222 PMCID: PMC1805546 DOI: 10.1073/pnas.0611574104] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2006] [Indexed: 11/18/2022] Open
Abstract
After the most recent tetraploidy in the Arabidopsis lineage, most gene pairs lost one, but not both, of their duplicates. We manually inspected the 3,179 retained gene pairs and their surrounding gene space still present in the genome using a custom-made viewer application. The display of these pairs allowed us to define intragenic conserved noncoding sequences (CNSs), identify exon annotation errors, and discover potentially new genes. Using a strict algorithm to sort high-scoring pair sequences from the bl2seq data, we created a database of 14,944 intragenomic Arabidopsis CNSs. The mean CNS length is 31 bp, ranging from 15 to 285 bp. There are approximately 1.7 CNSs associated with a typical gene, and Arabidopsis CNSs are found in all areas around exons, most frequently in the 5' upstream region. Gene ontology classifications related to transcription, regulation, or "response to ..." external or endogenous stimuli, especially hormones, tend to be significantly overrepresented among genes containing a large number of CNSs, whereas protein localization, transport, and metabolism are common among genes with no CNSs. There is a 1.5% overlap between these CNSs and the 218,982 putative RNAs in the Arabidopsis Small RNA Project database, allowing for two mismatches. These CNSs provide a unique set of noncoding sequences enriched for function. CNS function is implied by evolutionary conservation and independently supported because CNS-richness predicts regulatory gene ontology categories.
Collapse
Affiliation(s)
| | - Lakshmi Rapaka
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720
| | - Eric Lyons
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720
| | | | - Michael Freeling
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720
| |
Collapse
|
6
|
Pavesi G, Zambelli F, Pesole G. WeederH: an algorithm for finding conserved regulatory motifs and regions in homologous sequences. BMC Bioinformatics 2007; 8:46. [PMID: 17286865 PMCID: PMC1803799 DOI: 10.1186/1471-2105-8-46] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2006] [Accepted: 02/07/2007] [Indexed: 02/08/2023] Open
Abstract
Background This work addresses the problem of detecting conserved transcription factor binding sites and in general regulatory regions through the analysis of sequences from homologous genes, an approach that is becoming more and more widely used given the ever increasing amount of genomic data available. Results We present an algorithm that identifies conserved transcription factor binding sites in a given sequence by comparing it to one or more homologs, adapting a framework we previously introduced for the discovery of sites in sequences from co-regulated genes. Differently from the most commonly used methods, the approach we present does not need or compute an alignment of the sequences investigated, nor resorts to descriptors of the binding specificity of known transcription factors. The main novel idea we introduce is a relative measure of conservation, assuming that true functional elements should present a higher level of conservation with respect to the rest of the sequence surrounding them. We present tests where we applied the algorithm to the identification of conserved annotated sites in homologous promoters, as well as in distal regions like enhancers. Conclusion Results of the tests show how the algorithm can provide fast and reliable predictions of conserved transcription factor binding sites regulating the transcription of a gene, with better performances than other available methods for the same task. We also show examples on how the algorithm can be successfully employed when promoter annotations of the genes investigated are missing, or when regulatory sites and regions are located far away from the genes.
Collapse
Affiliation(s)
- Giulio Pavesi
- Dipartimento di Scienze Biomolecolari e Biotecnologie, University of Milan, Milan, Italy
| | - Federico Zambelli
- Dipartimento di Scienze Biomolecolari e Biotecnologie, University of Milan, Milan, Italy
| | - Graziano Pesole
- Dipartimento di Biochimica e Biologia Molecolare "E. Quagliariello", University of Bari, Bari, Italy
- Istituto Tecnologie Biomediche – Consiglio Nazionale delle Ricerche, Bari, Italy
| |
Collapse
|
7
|
Corbo RM, Prévost M, Raussens V, Gambina G, Moretto G, Scacchi R. Structural and phylogenetic approaches to assess the significance of human Apolipoprotein E variation. Mol Genet Metab 2006; 89:261-9. [PMID: 16621646 DOI: 10.1016/j.ymgme.2006.02.015] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/27/2006] [Accepted: 02/27/2006] [Indexed: 11/26/2022]
Abstract
Apolipoprotein E (APOE) is an important gene whose common polymorphism, and precisely the e *4 allele, has been reportedly associated with some disorders, including Alzheimer's disease (AD) and coronary artery disease. In the course of previous surveys on AD patients and healthy individuals some rare variants were detected by means of Isoelectric focusing and denaturing high-performance liquid chromatography techniques. After a mutation in a gene is identified, the problem arises to understand its effective significance. Structure modelling and phylogenetic analysis methods are widely used to establish the possible deleterious effect of mutations. In this study their usefulness in the analysis of APOE variants was evaluated. The two combined methods provided helpful indications for distinguishing between mutations possibly involved in AD susceptibility and not deleterious mutations.
Collapse
Affiliation(s)
- Rosa Maria Corbo
- Department of Genetics and Molecular Biology, University La Sapienza, Rome, Italy
| | | | | | | | | | | |
Collapse
|
8
|
Elnitski L, Jin VX, Farnham PJ, Jones SJM. Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques. Genome Res 2006; 16:1455-64. [PMID: 17053094 DOI: 10.1101/gr.4140006] [Citation(s) in RCA: 168] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Fields such as genomics and systems biology are built on the synergism between computational and experimental techniques. This type of synergism is especially important in accomplishing goals like identifying all functional transcription factor binding sites in vertebrate genomes. Precise detection of these elements is a prerequisite to deciphering the complex regulatory networks that direct tissue specific and lineage specific patterns of gene expression. This review summarizes approaches for in silico, in vitro, and in vivo identification of transcription factor binding sites. A variety of techniques useful for localized- and high-throughput analyses are discussed here, with emphasis on aspects of data generation and verification.
Collapse
Affiliation(s)
- Laura Elnitski
- Genomic Functional Analysis Section, National Human Genome Research Institute, National Institutes of Health, Rockville, Maryland 20878, USA.
| | | | | | | |
Collapse
|
9
|
Shih ACC, Lee DT, Lin L, Peng CL, Chen SH, Wu YW, Wong CY, Chou MY, Shiao TC, Hsieh MF. SinicView: a visualization environment for comparisons of multiple nucleotide sequence alignment tools. BMC Bioinformatics 2006; 7:103. [PMID: 16509994 PMCID: PMC1434773 DOI: 10.1186/1471-2105-7-103] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2005] [Accepted: 03/02/2006] [Indexed: 01/22/2023] Open
Abstract
Background Deluged by the rate and complexity of completed genomic sequences, the need to align longer sequences becomes more urgent, and many more tools have thus been developed. In the initial stage of genomic sequence analysis, a biologist is usually faced with the questions of how to choose the best tool to align sequences of interest and how to analyze and visualize the alignment results, and then with the question of whether poorly aligned regions produced by the tool are indeed not homologous or are just results due to inappropriate alignment tools or scoring systems used. Although several systematic evaluations of multiple sequence alignment (MSA) programs have been proposed, they may not provide a standard-bearer for most biologists because those poorly aligned regions in these evaluations are never discussed. Thus, a tool that allows cross comparison of the alignment results obtained by different tools simultaneously could help a biologist evaluate their correctness and accuracy. Results In this paper, we present a versatile alignment visualization system, called SinicView, (for Sequence-aligning INnovative and Interactive Comparison VIEWer), which allows the user to efficiently compare and evaluate assorted nucleotide alignment results obtained by different tools. SinicView calculates similarity of the alignment outputs under a fixed window using the sum-of-pairs method and provides scoring profiles of each set of aligned sequences. The user can visually compare alignment results either in graphic scoring profiles or in plain text format of the aligned nucleotides along with the annotations information. We illustrate the capabilities of our visualization system by comparing alignment results obtained by MLAGAN, MAVID, and MULTIZ, respectively. Conclusion With SinicView, users can use their own data sequences to compare various alignment tools or scoring systems and select the most suitable one to perform alignment in the initial stage of sequence analysis.
Collapse
Affiliation(s)
| | - DT Lee
- Institute of Information Science, Academia Sinica, Taipei, 115, Taiwan
- Genomics Research Center, Academia Sinica, Taipei, 115, Taiwan
| | - Laurent Lin
- Institute of Information Science, Academia Sinica, Taipei, 115, Taiwan
| | - Chin-Lin Peng
- Genomics Research Center, Academia Sinica, Taipei, 115, Taiwan
| | - Shiang-Heng Chen
- Institute of Information Science, Academia Sinica, Taipei, 115, Taiwan
| | - Yu-Wei Wu
- Institute of Information Science, Academia Sinica, Taipei, 115, Taiwan
| | - Chun-Yi Wong
- Institute of Information Science, Academia Sinica, Taipei, 115, Taiwan
| | - Meng-Yuan Chou
- Institute of Information Science, Academia Sinica, Taipei, 115, Taiwan
| | - Tze-Chang Shiao
- Institute of Information Science, Academia Sinica, Taipei, 115, Taiwan
| | - Mu-Fen Hsieh
- Institute of Information Science, Academia Sinica, Taipei, 115, Taiwan
| |
Collapse
|
10
|
Kleinjan DA, van Heyningen V. Long-range control of gene expression: emerging mechanisms and disruption in disease. Am J Hum Genet 2005; 76:8-32. [PMID: 15549674 PMCID: PMC1196435 DOI: 10.1086/426833] [Citation(s) in RCA: 648] [Impact Index Per Article: 34.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2004] [Accepted: 10/08/2004] [Indexed: 02/04/2023] Open
Abstract
Transcriptional control is a major mechanism for regulating gene expression. The complex machinery required to effect this control is still emerging from functional and evolutionary analysis of genomic architecture. In addition to the promoter, many other regulatory elements are required for spatiotemporally and quantitatively correct gene expression. Enhancer and repressor elements may reside in introns or up- and downstream of the transcription unit. For some genes with highly complex expression patterns--often those that function as key developmental control genes--the cis-regulatory domain can extend long distances outside the transcription unit. Some of the earliest hints of this came from disease-associated chromosomal breaks positioned well outside the relevant gene. With the availability of wide-ranging genome sequence comparisons, strong conservation of many noncoding regions became obvious. Functional studies have shown many of these conserved sites to be transcriptional regulatory elements that sometimes reside inside unrelated neighboring genes. Such sequence-conserved elements generally harbor sites for tissue-specific DNA-binding proteins. Developmentally variable chromatin conformation can control protein access to these sites and can regulate transcription. Disruption of these finely tuned mechanisms can cause disease. Some regulatory element mutations will be associated with phenotypes distinct from any identified for coding-region mutations.
Collapse
Affiliation(s)
- Dirk A Kleinjan
- MRC Human Genetics Unit, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, Scotland, United Kingdom
| | | |
Collapse
|
11
|
Glöckner G, Lehmann R, Romualdi A, Pradella S, Schulte-Spechtel U, Schilhabel M, Wilske B, Sühnel J, Platzer M. Comparative analysis of the Borrelia garinii genome. Nucleic Acids Res 2004; 32:6038-46. [PMID: 15547252 PMCID: PMC534632 DOI: 10.1093/nar/gkh953] [Citation(s) in RCA: 104] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Three members of the genus Borrelia (B.burgdorferi, B.garinii, B.afzelii) cause tick-borne borreliosis. Depending on the Borrelia species involved, the borreliosis differs in its clinical symptoms. Comparative genomics opens up a way to elucidate the underlying differences in Borrelia species. We analysed a low redundancy whole-genome shotgun (WGS) assembly of a B.garinii strain isolated from a patient with neuroborreliosis in comparison to the B.burgdorferi genome. This analysis reveals that most of the chromosome is conserved (92.7% identity on DNA as well as on amino acid level) in the two species, and no chromosomal rearrangement or larger insertions/deletions could be observed. Furthermore, two collinear plasmids (lp54 and cp26) seem to belong to the basic genome inventory of Borrelia species. These three collinear parts of the Borrelia genome encode 861 genes, which are orthologous in the two species examined. The majority of the genetic information of the other plasmids of B.burgdorferii is also present in B.garinii although orthology is not easy to define due to a high redundancy of the plasmid fraction. Yet, we did not find counterparts of the B.burgdorferi plasmids lp36 and lp38 or their respective gene repertoire in the B.garinii genome. Thus, phenotypic differences between the two species could be attributable to the presence or absence of these two plasmids as well as to the potentially positively selected genes.
Collapse
Affiliation(s)
- G Glöckner
- Genome Analysis, Institute for Molecular Biotechnology, Beutenbergstr. 11, 07745 Jena, Germany.
| | | | | | | | | | | | | | | | | |
Collapse
|
12
|
Abstract
Various experimental and computational approaches have been used to identify genomic locations of transcription-factor binding sites; methods involving computational comparisons of related genomes have been particularly successful. Identifying genomic locations of transcription-factor binding sites, particularly in higher eukaryotic genomes, has been an enormous challenge. Various experimental and computational approaches have been used to detect these sites; methods involving computational comparisons of related genomes have been particularly successful.
Collapse
Affiliation(s)
- Martha L Bulyk
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, New Research Building, 77 Avenue Louis Pasteur, Boston, MA 02115, USA.
| |
Collapse
|