1
|
Stoeckert C, Pizarro A, Manduchi E, Gibson M, Brunk B, Crabtree J, Schug J, Shen-Orr S, Overton GC. A relational schema for both array-based and SAGE gene expression experiments. Bioinformatics 2001; 17:300-8. [PMID: 11301298 DOI: 10.1093/bioinformatics/17.4.300] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION AND RESULTS A relational schema is described for capturing highly parallel gene expression experiments using different technologies. This schema grew out of efforts to build a database for collaborators working on different biological systems and using different types of platforms in their gene expression experiments as well as different types of image quantification software. The tables are conceptually organized into three categories of information: Platform, Experiment (which includes image scanning and quantification), and Data. The strengths of the schema are: (i) integrating information on array elements using a gene index; (ii) describing samples using ontologies; (iii) reducing an experiment to a single RNA source for precise descriptions yet not losing the relationships between experiments done at the same time or for the same project; and (iv) maintaining both raw and processed (e.g. cleansed and normalized) data and recording how the data is processed. The result is a novel schema, which can hold both array and non-array data, is extensible for detailed experimental descriptions that are precise and consistent, and allows for meaningful comparisons of genes between experiments.
Collapse
Affiliation(s)
- C Stoeckert
- Computational Biology and Informatics Laboratory, Center for Bioinformatics, University of Pennsylvania, 1313 Blockley Hall, 418 Guardian Drive, Philadelphia, PA 19104-6021, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
2
|
Davidson SB, Crabtree J, Brunk BP, Schug J, Tannen V, Overton GC, Stoeckert CJ. K2/Kleisli and GUS: Experiments in integrated access to genomic data sources. ACTA ACUST UNITED AC 2001. [DOI: 10.1147/sj.402.0512] [Citation(s) in RCA: 122] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
3
|
Manduchi E, Grant GR, McKenzie SE, Overton GC, Surrey S, Stoeckert CJ. Generation of patterns from gene expression data by assigning confidence to differentially expressed genes. Bioinformatics 2000; 16:685-98. [PMID: 11099255 DOI: 10.1093/bioinformatics/16.8.685] [Citation(s) in RCA: 70] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION A protocol is described to attach expression patterns to genes represented in a collection of hybridization array experiments. Discrete values are used to provide an easily interpretable description of differential expression. Binning cutoffs for each sample type are chosen automatically, depending on the desired false-positive rate for the predictions of differential expression. Confidence levels are derived for the statement that changes in observed levels represent true changes in expression. We have a novel method for calculating this confidence, which gives better results than the standard methods. Our method reflects the broader change of focus in the field from studying a few genes with many replicates to studying many (possibly thousands) of genes simultaneously, but with relatively few replicates. Our approach differs from standard methods in that it exploits the fact that there are many genes on the arrays. These are used to estimate for each sample type an appropriate distribution that is employed to control the false-positive rate of the predictions made. Satisfactory results can be obtained using this method with as few as two replicates. RESULTS The method is illustrated through applications to macroarray and microarray datasets. The first is an erythroid development dataset that we have generated using nylon filter arrays. Clones for genes whose expression is known in these cells were assigned expression patterns which are in accordance with what was expected and which are not picked up by the standards methods. Moreover, genes differentially expressed between normal and leukemic cells were identified. These included genes whose expression was altered upon induction of the leukemic cells to differentiate. The second application is to the microarray data by Alizadeh et al. (2000). Our results are in accordance with their major findings and offer confidence measures for the predictions made. They also provide new insights for further analysis.
Collapse
Affiliation(s)
- E Manduchi
- Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | | | | | | | | | | |
Collapse
|
4
|
Abstract
Blood cell production originates from a rare population of multipotent, self-renewing stem cells. A genome-wide gene expression analysis was performed in order to define regulatory pathways in stem cells as well as their global genetic program. Subtracted complementary DNA libraries from highly purified murine fetal liver stem cells were analyzed with bioinformatic and array hybridization strategies. A large percentage of the several thousand gene products that have been characterized correspond to previously undescribed molecules with properties suggestive of regulatory functions. The complete data, available in a biological process-oriented database, represent the molecular phenotype of the hematopoietic stem cell.
Collapse
Affiliation(s)
- R L Phillips
- Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
5
|
Kolchanov NA, Podkolodnaya OA, Ananko EA, Ignatieva EV, Stepanenko IL, Kel-Margoulis OV, Kel AE, Merkulova TI, Goryachkovskaya TN, Busygina TV, Kolpakov FA, Podkolodny NL, Naumochkin AN, Korostishevskaya IM, Romashchenko AG, Overton GC. Transcription regulatory regions database (TRRD): its status in 2000. Nucleic Acids Res 2000; 28:298-301. [PMID: 10592253 PMCID: PMC102412 DOI: 10.1093/nar/28.1.298] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/1999] [Accepted: 10/04/1999] [Indexed: 11/12/2022] Open
Abstract
Transcription Regulatory Regions Database (TRRD) has been developed for accumulation of experimental information on the structure-function features of regulatory regions of eukaryotic genes. Each entry in TRRD corresponds to a particular gene and contains a description of structure-function features of its regulatory regions (transcription factor binding sites, promoters, enhancers, silencers, etc.) and gene expression regulation patterns. The current release, TRRD 4.2.5, comprises the description of 760 genes, 3403 expression patterns, and >4600 regulatory elements including 3604 transcription factor binding sites, 600 promoters and 152 enhancers. This information was obtained through annotation of 2537 scientific publications. TRRD 4.2.5 is available through the WWW at http://wwwmgs.bionet.nsc.ru/mgs/dbases/trrd4/
Collapse
Affiliation(s)
- N A Kolchanov
- Institute of Cytology and Genetics (Siberian Branch of the Russian Academy of Sciences), Lavrentieva 10, Novosibirsk 630090, Russia.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
6
|
Abstract
MOTIVATION The presentation of genomics data in a perspicuous visual format is critical for its rapid interpretation and validation. Relatively few public database developers have the resources to implement sophisticated front-end user interfaces themselves. Accordingly, these developers would benefit from a reusable toolkit of user interface and data visualization components. RESULTS We have designed the bioWidget toolkit as a set of JavaBean components. It includes a wide array of user interface components and defines an architecture for assembling applications. The toolkit is founded on established software engineering design patterns and principles, including componentry, Model-View-Controller, factored models and schema neutrality. As a proof of concept, we have used the bioWidget toolkit to create three extendible applications: AnnotView, BlastView and AlignView.
Collapse
Affiliation(s)
- S Fischer
- Center for Bioinformatics, University of Pennsylvania, Philadelphia 19104, USA.
| | | | | | | | | |
Collapse
|
7
|
Ponomarenko MP, Ponomarenko IV, Podkolodnaia OA, Frolov AS, Vorob'ev DV, Kolchanov NA, Overton GC. [Averaging results of site recognition can increase the accuracy of annotating the human genome]. Biofizika 1999; 44:649-54. [PMID: 10544815] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 02/14/2023]
Abstract
A systemic approach is proposed, which makes it possible to increase the accuracy of recognition of functional sites in arbitrary DNA sequences. The approach is based on the Central limit theorem and consists in the averaging of a large number of recognitions of a particular site. To obtain a rather large number of recognitions within the framework of conventional methods of recognition, consensus, and frequency matrix, 20 novel oligonucleotide alphabets were used. The approach was used to study the binding sites of GATA-1 and C/EBP transcription factors. It was found that the averaged recognition of these sites is more precise than each of specific recognitions, which just follows from the Central limit theorem.
Collapse
Affiliation(s)
- M P Ponomarenko
- Institute of Cytology and Genetics, Siberian Division, Russian Academy of Sciences, Novosibirsk, Russia
| | | | | | | | | | | | | |
Collapse
|
8
|
Kolchanov NA, Ponomarenko MP, Frolov AS, Ananko EA, Kolpakov FA, Ignatieva EV, Podkolodnaya OA, Goryachkovskaya TN, Stepanenko IL, Merkulova TI, Babenko VV, Ponomarenko YV, Kochetov AV, Podkolodny NL, Vorobiev DV, Lavryushev SV, Grigorovich DA, Kondrakhin YV, Milanesi L, Wingender E, Solovyev V, Overton GC. Integrated databases and computer systems for studying eukaryotic gene expression. Bioinformatics 1999; 15:669-86. [PMID: 10487874 DOI: 10.1093/bioinformatics/15.7.669] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The goal of the work was to develop a WWW-oriented computer system providing a maximal integration of informational and software resources on the regulation of gene expression and navigation through them. Rapid growth of the variety and volume of information accumulated in the databases on regulation of gene expression necessarily requires the development of computer systems for automated discovery of the knowledge that can be further used for analysis of regulatory genomic sequences. RESULTS The GeneExpress system developed includes the following major informational and software modules: (1) Transcription Regulation (TRRD) module, which contains the databases on transcription regulatory regions of eukaryotic genes and TRRD Viewer for data visualization; (2) Site Activity Prediction (ACTIVITY), the module for analysis of functional site activity and its prediction; (3) Site Recognition module, which comprises (a) B-DNA-VIDEO system for detecting the conformational and physicochemical properties of DNA sites significant for their recognition, (b) Consensus and Weight Matrices (ConsFrec) and (c) Transcription Factor Binding Sites Recognition (TFBSR) systems for detecting conservative contextual regions of functional sites and their recognition; (4) Gene Networks (GeneNet), which contains an object-oriented database accumulating the data on gene networks and signal transduction pathways, and the Java-based Viewer for exploration and visualization of the GeneNet information; (5) mRNA Translation (Leader mRNA), designed to analyze structural and contextual properties of mRNA 5'-untranslated regions (5'-UTRs) and predict their translation efficiency; (6) other program modules designed to study the structure-function organization of regulatory genomic sequences and regulatory proteins. AVAILABILITY GeneExpress is available at http://wwwmgs.bionet.nsc. ru/systems/GeneExpress/ and the links to the mirror site(s) can be found at http://wwwmgs.bionet.nsc.ru/mgs/links/mirrors.html+ ++.
Collapse
Affiliation(s)
- N A Kolchanov
- Institute of Cytology & Genetics, Siberian Branch of the Russian Academy of Sciences, Prosp. Lavrentieva 10, Novosibirsk 630090, Russia.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
9
|
Ponomarenko MP, Ponomarenko JV, Frolov AS, Podkolodny NL, Savinkova LK, Kolchanov NA, Overton GC. Identification of sequence-dependent DNA features correlating to activity of DNA sites interacting with proteins. Bioinformatics 1999; 15:687-703. [PMID: 10487875 DOI: 10.1093/bioinformatics/15.7.687] [Citation(s) in RCA: 39] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION The commonly accepted statistical mechanical theory is now multiply confirmed by using the weight matrix methods successfully recognizing DNA sites binding regulatory proteins in prokaryotes. Nevertheless, the recent evaluation of weight matrix methods application for transcription factor binding site recognition in eukaryotes has unexpectedly revealed that the matrix scores correlate better to each other than to the activity of DNA sites interacting with proteins. This observation points out that molecular mechanisms of DNA/protein recognition are more complicated in eukaryotes than in prokaryotes. As the extra events in eukaryotes, the following processes may be considered: (i) competition between the proteins and nucleosome core particle for DNA sites binding these proteins and (ii) interaction between two synergetic/antagonist proteins recognizing a composed element compiled from two DNA sites binding these proteins. That is why identification of the sequence-dependent DNA features correlating with affinity magnitudes of DNA sites interacting with a protein can pinpoint the molecular event limiting this protein/DNA recognition machinery. RESULTS An approach for predicting site activity based on its primary nucleotide sequence has been developed. The approach is realized in the computer system ACTIVITY, containing the databases on site activity and on conformational and physicochemical DNA/RNA parameters. By using the system ACTIVITY, an analysis of some sites was provided and the methods for predicting site activity were constructed. The methods developed are in good agreement with the experimental data. AVAILABILITY The database ACTIVITY is available at http://wwwmgs.bionet.nsc.ru/systems/Activity/ and the mirror site, http://www.cbil.upenn.edu/mgs/systems/acti vity/.
Collapse
Affiliation(s)
- M P Ponomarenko
- Laboratory of Theoretical Genetics, Institute of Cytology & Genetics, 10 Lavrentyev Avenue, Novosibirsk, 630090, Russia.
| | | | | | | | | | | | | |
Collapse
|
10
|
Ponomarenko JV, Ponomarenko MP, Frolov AS, Vorobyev DG, Overton GC, Kolchanov NA. Conformational and physicochemical DNA features specific for transcription factor binding sites. Bioinformatics 1999; 15:654-68. [PMID: 10487873 DOI: 10.1093/bioinformatics/15.7.654] [Citation(s) in RCA: 60] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION A reliable recognition of transcription factor binding sites is essential for analysis of regulatory genomic sequences. The experimental data make evident an important role of DNA conformational features for site functioning. However, Internet-available tools for revealing conformational and physicochemical DNA features significant for the site functioning and subsequent use of these features for site recognition have not been developed up to now. RESULTS We suggest an approach for revealing significant conformational and physicochemical properties of functional sites implemented in the database B-DNA-VIDEO. This database is designed to study the sets of various transcription factor binding sites, providing evidence that transcription factor binding sites are characterized by specific sets of significant conformational and physicochemical DNA properties. For a fixed site, by using the B-DNA features selected for this site recognition, the C-program recognizing this site may be generated, control tested and stored in the database B-DNA-VIDEO. Each B-DNA-VIDEO entry links to the Web-applet recognizing the site, whose significant B-DNA features are stored in this entry as the 'site recognition programs'. The pairwise linked entry-applet pairs are compiled within the B-DNA-VIDEO system, which is simultaneously the database and the program tools package applicable immediately for recognizing the sites stored in the database. Indeed, this is the novelty. Hence, B-DNA-VIDEO is the Web resource of both 'searching for static data' and 'active computation' type, that is why it was called an 'activated database'. AVAILABILITY B-DNA-VIDEO is available at http://wwwmgs.bionet.nsc.ru/systems/BDNAVideo/ and the mirror site at http://www.cbil.upenn.edu/mgs/systems/c onsfreq/.
Collapse
Affiliation(s)
- J V Ponomarenko
- Laboratory of Theoretical Genetics, Institute of Cytology & Genetics, 10 Lavrentyev Avenue, Novosibirsk, 630090, Russia.
| | | | | | | | | | | |
Collapse
|
11
|
Ponomarenko MP, Ponomarenko JV, Frolov AS, Podkolodnaya OA, Vorobyev DG, Kolchanov NA, Overton GC. Oligonucleotide frequency matrices addressed to recognizing functional DNA sites. Bioinformatics 1999; 15:631-43. [PMID: 10487871 DOI: 10.1093/bioinformatics/15.7.631] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION Recognition of functional sites remains a key event in the course of genomic DNA annotation. It is well known that a number of sites have their own specific oligonucleotide content. This pinpoints the fact that the preference of the site-specific nucleotide combinations at adjacent positions within an analyzed functional site could be informative for this site recognition. Hence, Web-available resources describing the site-specific oligonucleotide content of the functional DNA sites and applying the above approach for site recognition are needed. However, they have been poorly developed up to now. RESULTS To describe the specific oligonucleotide content of the functional DNA sites, we introduce the oligonucleotide alphabets, out of which the frequency matrix for a given site could be constructed in addition to a traditional nucleotide frequency matrix. Thus, site recognition accuracy increases. This approach was implemented in the activated MATRIX database accumulating oligonucleotide frequency matrices of the functional DNA sites. We have demonstrated that the false-positive error of the functional site recognition decreases if the oligonucleotide frequency matrixes are added to the nucleotide frequency matrixes commonly used. AVAILABILITY The MATRIX database is available on the Web, http://wwwmgs.bionet.nsc.ru/Dbases/MATRIX/ and the mirror site, http://www.cbil.upenn.edu/mgs/systems/c onsfreq/.
Collapse
Affiliation(s)
- M P Ponomarenko
- Laboratory of Theoretical Genetics, Institute of Cytology & Genetics, 10 Lavrentyeva Avenue, Novosibirsk, 630090, Russia.
| | | | | | | | | | | | | |
Collapse
|
12
|
Stoeckert CJ, Salas F, Brunk B, Overton GC. EpoDB: a prototype database for the analysis of genes expressed during vertebrate erythropoiesis. Nucleic Acids Res 1999; 27:200-3. [PMID: 9847180 PMCID: PMC148135 DOI: 10.1093/nar/27.1.200] [Citation(s) in RCA: 20] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
EpoDB is a database of genes expressed in vertebrate red blood cells. It is also a prototype for the creation of cell and tissue-specific databases from multiple external sources. The information in EpoDB obtained from GenBank, SWISS-PROT, Transfac, TRRD and GERD is curated to provide high quality data for sequence analysis aimed at understanding gene regulation during erythropoiesis. New protocols have been developed for data integration and updating entries. Using a BLAST-based algorithm, we have grouped GenBank entries representing the same gene together. This sequence similarity protocol was also used to identify new entries to be included in EpoDB. We have recently implemented our database in Sybase (relational tables) in addition to SICStus Prolog to provide us with greater flexibility in asking complex queries that utilize information from multiple sources. New additions to the public web site (http://www.cbil.upenn.edu/epodb) for accessing EpoDB are the ability to retrieve groups of entries representing different variants of the same gene and to retrieve gene expression data. The BLAST query has been enhanced by incorporating BLASTView, an interactive and graphical display of BLAST results. We have also enhanced the queries for retrieving sequence from specified genes by the addition of MEME, a motif discovery tool, to the integrated analysis tools which include CLUSTALW and TESS.
Collapse
Affiliation(s)
- C J Stoeckert
- Division of Hematology, The Children's Hospital of Philadelphia, 316E Abramson Research Center, 34th and Civic Center Boulevard, Philadelphia, PA 19104, USA.
| | | | | | | |
Collapse
|
13
|
Overton GC, Bailey C, Crabtree J, Gibson M, Fischer S, Schug J. The GAIA software framework for genome annotation. Pac Symp Biocomput 1998:291-302. [PMID: 9697190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
We describe a software framework, GAIA, that supports semi-automated annotation of uncharacterized sequence data. The annotation framework incorporates annotation by data source integration, data analysis, and manual data entry. Components of the system include a configurable, open data analysis pipeline, a relational information storage manager, and Java-based graphical user interfaces. We discuss design decisions and tradeoffs in building such a system, and policies and strategies for producing consistent, uniform, high quality annotation.
Collapse
Affiliation(s)
- G C Overton
- Center for Bioinformatics, University of Pennsylvania, Philadelphia 19104, USA
| | | | | | | | | | | |
Collapse
|
14
|
Abstract
We have performed a systematic analysis of gene identification in genomic sequence by similarity search against expressed sequence tags (ESTs) to assess the suitability of this method for automated annotation of the human genome. A BLAST-based strategy was constructed to examine the potential of this approach, and was applied to test sets containing all human genomic sequences longer than 5 kb in public databases, plus 300 kb of exhaustively characterized benchmark sequence. At high stringency, 70%-90% of all annotated genes are detected by near-identity to EST sequence; >95% of ESTs aligning with well-annotated sequences overlap a gene. These ESTs provide immediate access to the corresponding cDNA clones for follow-up laboratory verification and subsequent biologic analysis. At lower stringency, up to 97% of annotated genes were identified by similarity to ESTs. The apparent false-positive rate rose to 55% of ESTs among all sequences and 20% among benchmark sequences at the lowest stringency, indicating that many genes in public database entries are unannotated. Approximately half of the alignments span multiple exons, and thus aid in the construction of gene predictions and elucidation of alternative splicing. In addition, ESTs from multiple cDNA libraries frequently cluster over genes, providing a starting point for crude expression profiles. Clone IDs may be used to form EST pairs, and particularly to extend models by associating alignments of lower stringency with high-quality alignments. These results demonstrate that EST similarity search is a practical general-purpose annotation technique that complements pattern recognition methods as a tool for gene characterization.
Collapse
Affiliation(s)
- L C Bailey
- Computational Biology and Informatics Laboratory, Department of Genetics, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania 19104, USA.
| | | | | |
Collapse
|
15
|
Abstract
As increasing amounts of genomic sequence from many organisms become available, and as DNA sequences become a primary reagent in biologic investigations, the role of annotation as a prospective guide for laboratory experiments will expand rapidly. Here we describe a process of high-throughput, reliable annotation, called framework annotation, which is designed to provide a foundation for initial biologic characterization of previously unexamined sequence. To examine this concept in practice, we have constructed Genome Annotation and Information Analysis (GAIA), a prototype software architecture that implements several elements important for framework annotation. The center of GAIA consists of an annotation database and the associated data management subsystem that forms the software bus along which other components communicate. The schema for this database defines three principal concepts: (1) Entries, consisting of sequence and associated historical data; (2) Features, comprising information of biologic interest; and (3) Experiments, describing the evidence that supports Features. The database permits tracking of annotation results over time, as well as assessment of the reliability of particular results. New framework annotation is produced by CARTA, a set of autonomous sensors that perform automatic analyses and assert results into the annotation database. These results are available via a Web-based query interface that uses graphical Java applets as well as text-based HTML pages to display data at different levels of resolution and permit interactive exploration of annotation. We present results for initial application of framework annotation to a set of test sequences, demonstrating its effectiveness in providing a starting point for biologic investigation, and discuss ways in which the current prototype can be improved. The prototype is available for public use and comment at http://www.cbil.upenn.edu/gaia.
Collapse
Affiliation(s)
- L C Bailey
- Computational Biology and Informatics Laboratory, Department of Genetics, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania 19104-6021, USA.
| | | | | | | | | | | |
Collapse
|
16
|
Abstract
EpoDB is a database designed for the study of gene regulation during differentiation and development of vertebrate red blood cells. In building EpoDB, we have taken the in advance approach to the data integration problem: we have extracted data relevant to red blood cells from GenBank, SWISS-PROT, TRRD (transcriptional regulation data) and GERD (expression levels data) to create a single integrated, highly curated view. Tools have been developed to automate data extraction from online resources, cleanse data of errors, enter information manually from the primary literature, generate a uniform, canonical representation of information and maintain data currency. The database is organized around biological features, e.g., genes, rather than sequences, which are supported by a controlled and consistent vocabulary for gene names and gene family names. Beyond the standard database queries, the functionality of EpoDB includes the ability to extract features and subsequences, display sequences and features graphically using bioWidget viewers and integrated analysis tools. EpoDB may be accessed at: http://cbil.humgen.upenn.edu/epodb/
Collapse
Affiliation(s)
- F Salas
- Department of Genetics, University of Pennsylvania School of Medicine, Room 475, Clinical Research Building, 422 Curie Boulevard, Philadelphia, PA 19104-6145, USA
| | | | | | | | | |
Collapse
|
17
|
Ajioka JW, Boothroyd JC, Brunk BP, Hehl A, Hillier L, Manger ID, Marra M, Overton GC, Roos DS, Wan KL, Waterston R, Sibley LD. Gene discovery by EST sequencing in Toxoplasma gondii reveals sequences restricted to the Apicomplexa. Genome Res 1998; 8:18-28. [PMID: 9445484 DOI: 10.1101/gr.8.1.18] [Citation(s) in RCA: 151] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
To accelerate gene discovery and facilitate genetic mapping in the protozoan parasite Toxoplasma gondii, we have generated >7000 new ESTs from the 5' ends of randomly selected tachyzoite cDNAs. Comparison of the ESTs with the existing gene databases identified possible functions for more than 500 new T. gondii genes by virtue of sequence motifs shared with conserved protein families, including factors involved in transcription, translation, protein secretion, signal transduction, cytoskeleton organization, and metabolism. Despite this success in identifying new genes, more than 50% of the ESTs correspond to genes of unknown function, reflecting the divergent evolutionary status of this parasite. A newly recognized class of genes was identified based on its similarity to sequences known only from other members of the same phylum, therefore identifying sequences that are apparently restricted to the Apicomplexa. Such genes may underlie pathways common to this group of medically important parasites, therefore identifying potential targets for intervention.
Collapse
Affiliation(s)
- J W Ajioka
- Department of Pathology, Cambridge University, UK
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
18
|
Schug J, Overton GC. Modeling transcription factor binding sites with Gibbs Sampling and Minimum Description Length encoding. Proc Int Conf Intell Syst Mol Biol 1997; 5:268-71. [PMID: 9322048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Transcription factors, proteins required for the regulation of gene expression, recognize and bind short stretches of DNA on the order of 4 to 10 bases in length. In general, each factor recognizes a family of "similar" sequences rather than a single unique sequence. Ultimately, the transcriptional state of a gene is determined by the cooperative interaction of several bound factors. We have developed a method using Gibbs Sampling and the Minimum Description Length principle for automatically and reliably creating weight matrix models of binding sites from a database (TRANSFAC) of known binding site sequences. Determining the relationship between sequence and binding affinity for a particular factor is an important first step in predicting whether a given uncharacterized sequence is part of a promoter site or other control region. Here we describe the foundation for the methods we will use to develop weight matrix models for transcription factor binding sites.
Collapse
|
19
|
Abstract
Factor VII is a vitamin K-dependent coagulation protein essential for proper hemostasis. The human Factor VII gene spans 13 kilobase pairs and is located on chromosome 13 just 2.8 kilobase pairs 5' to the Factor X gene. In this report, we show that Factor VII transcripts are restricted to the liver and that steady state levels of mRNA are much lower than those of Factor X. The major transcription start site is mapped at -51 by RNase protection assay and primer extension experiments. The first 185 base pairs 5' of the translation start site are sufficient to confer maximal promoter activity in HepG2 cells. Protein binding sites are identified at nucleotides -51 to -32, -63 to -58, -108 to -84, and -233 to -215 by DNase I footprint analysis and gel mobility shift assays. A liver-enriched transcription factor, hepatocyte nuclear factor-4 (HNF-4), and a ubiquitous transcription factor, Spl, are shown to bind within the first 108 base pairs of the promoter region at nucleotide sequences ACTTTG and CCCCTCCCCC, respectively. The importance of these binding sites in promoter activity is demonstrated through independent functional mutagenesis experiments, which show dramatically reduced promoter activity. Transactivation studies with an HNF-4 expression plasmid in HeLa cells also demonstrate the importance of HNF-4 in promoting transcription in non-hepatocyte derived cells. Additionally, the sequence of a naturally occurring allele containing a previously described decanucleotide insert polymorphism at -323 is shown to reduce promoter activity by 33% compared with the more common allelic sequence.
Collapse
Affiliation(s)
- E S Pollak
- Department of Pediatrics, University of Pennsylvania, Children's Hospital of Philadelphia 19104, USA
| | | | | | | | | |
Collapse
|
20
|
Abstract
The National Center for Biotechnology Information (NCBI) has created a database collection that includes several protein and nucleic acid sequence databases, a biosequence-specific subset of MEDLINE, as well as value-added information such as links between similar sequences. Information in the NCBI database is modeled in Abstract Syntax Notation 1 (ASN.1) an Open Systems Interconnection protocol designed for the purpose of exchanging structured data between software applications rather than as a data model for database systems. While the NCBI database is distributed with an easy-to-use information retrieval system, ENTREZ, the ASN.1 data model currently lacks an ad hoc query language for general-purpose data access. For that reason, we have developed a software package, SORTEZ, that transforms the ASN.1 database (or other databases with nested data structures) to a relational data model and subsequently to a relational database management system (Sybase) where information can be accessed through the relational query language, SQL. Because the need to transform data from one data model and schema to another arises naturally in several important contexts, including efficient execution of specific applications, access to multiple databases and adaptation to database evolution this work also serves as a practical study of the issues involved in the various stages of database transformation. We show that transformation from the ASN.1 data model to a relational data model can be largely automated, but that schema transformation and data conversion require considerable domain expertise and would greatly benefit from additional support tools.
Collapse
Affiliation(s)
- K W Hart
- Department of Genetics, University of Pennsylvania School of Medicine, Philadelphia 19104-6145
| | | | | |
Collapse
|
21
|
Abstract
We have developed a general system, QGB, for performing complex queries on the information in the DDBJ/EMBL/GenBank databases, including queries over the structural features of sequences implied in the FEATURE TABLE. Queries are formed in a Structured Query Language (SQL)-like syntax with language extensions to support complex types (e.g., sets, ordered sets, and records) appropriate for representing and querying sequence data. A novel aspect of QGB is its ability to deduce missing features and infer relationships among features as a consequence of constructing a parse tree of sequence structure from information described in the FEATURE TABLE. The grammar for the parse tree is implemented in a customized form of the Definite Clause Grammar syntax of the logic programming language Prolog. The logic grammar formalism was chosen because it provides a perspicuous representation for features and constraints, and Prolog provides an execution model for the grammar rules. Construction of the parse tree also identifies inconsistencies and errors in the FEATURE TABLE that can in some cases be corrected automatically and used to generate an augmented version of the table.
Collapse
Affiliation(s)
- G C Overton
- Department of Genetics, University of Pennsylvania School of Medicine, Philadelphia 19104-6145, USA
| | | | | | | |
Collapse
|
22
|
Emanuel BS, Buetow K, Nussbaum R, Scambler P, Lipinski M, Overton GC. Report of the third international workshop on human chromosome 22 mapping. Cytogenet Cell Genet 1993; 63:206-211. [PMID: 8099004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Affiliation(s)
- B S Emanuel
- Children's Hospital of Philadelphia, PA 19104
| | | | | | | | | | | |
Collapse
|
23
|
Abstract
The detection of similarities between DNA sequences can be accomplished using the signal-processing technique of cross-correlation. An early method used the fast Fourier transform (FFT) to perform correlations on DNA sequences in O(n log n) time for any length sequence. However, this method requires many FFTs (nine), runs no faster if one sequence is much shorter than the other, and measures only global similarity, so that significant short local matches may be missed. We report that, through the use of alternative encodings of the DNA sequence in the complex plane, the number of FFTs performed can be traded off against (i) signal-to-noise ratio, and (ii) a certain degree of filtering for local similarity via k-tuple correlation. Also, when comparing probe sequences against much longer targets, the algorithm can be sped up by decomposing the target and performing multiple small FFTs in an overlap-save arrangement. Finally, by decomposing the probe sequence as well, the detection of local similarities can be further enhanced. With current advances in extremely fast hardware implementations of signal-processing operations, this approach may prove more practical than heretofore.
Collapse
Affiliation(s)
- E A Cheever
- Department of Engineering, Swarthmore College, PA 19081
| | | | | |
Collapse
|
24
|
Abstract
A cDNA clone, p2-4, was isolated from mouse teratocarcinoma-derived parietal endoderm-like cells and used to analyze expression of the corresponding transcript during mouse embryogenesis. Nucleotide-sequence analysis revealed extensive homology between this clone and SPARC/osteonectin cDNA cloned from mouse parietal endoderm and bovine bone cells. The SPARC/osteonectin transcript became more abundant when embryonal carcinoma (EC) cells differentiated into parietal endoderm-like cells. In embryos, the transcript began to appear in the embryo proper on day 11 and continued to be expressed throughout the gestation period. The transcript was also present in extraembryonic membranes and placenta from days 9 and 11 onward, respectively. Thus, expression of the transcript was regulated during differentiation of EC cells and during embryogenesis. In adult mice, several non-bone tissues, including testis, also expressed the transcript. Analysis of germ-cell-deficient mice indicated that non-germ-cell components of the testis expressed the transcript. Analysis of mouse testicular cell lines further suggested that the transcript was abundant in Sertoli cells and Leydig cells. Cumulus oophorus cells that envelope the ovulated egg also expressed high levels of the transcript.
Collapse
Affiliation(s)
- C C Howe
- Wistar Institute, Philadelphia, Pennsylvania 19104
| | | | | | | | | | | |
Collapse
|
25
|
Abstract
Three cDNA clones coding for the 3' region of the intracisternal A-particle (IAP), a mouse endogenous retrovirus, were isolated during screening of a library for genes whose expression was modulated during the retinoic acid-induced differentiation of the embryonal carcinoma cell line F9 into parietal endoderm-like (PE-like) cells. In contrast to previously reported results, no IAP transcripts were detected in either F9 cells or two pluripotent cell lines tested. Instead, IAP transcripts as well as IAPs were abundant in the PE-like cells PYS-2 and F9AcCl 9 and in retinoic acid-induced F9 cells but not in the other differentiated cell types of teratocarcinoma origin which were examined. A comparison of the nucleotide sequences of the three IAP cDNA clones with a genomically integrated proviral sequence (MIA14) demonstrated heterogeneity in both length and sequence among the clones. The position of the poly(A) addition site was determined to be 15 nucleotides from the proposed poly(A) addition signal and to occur after the sequence CAGA, not CA, as previously proposed. Length heterogeneity was greatest in a region of TC repeats 80 base pairs 5' to the poly(A) addition site. Additionally, the putative TATAA box found in MIA14 was deleted in the cDNA clones and in the long terminal repeat regions from two other genomic clones examined. The heterogeneity evident among the cDNA clones further demonstrated that at least two distinct IAP genes are activated during differentiation. An analysis of the rate of transcription in isolated nuclei indicated that the activation of expression of IAP genes in PE-like cells is the result of transcriptional regulation. Together, these observations suggest that the modulation of IAP transcription is regulated autonomously rather than by the fortuitous integration of an IAP sequence adjacent to a developmentally regulated cellular gene.
Collapse
|
26
|
Howe CC, Lugg DK, Overton GC. Post-transcriptional regulation of the abundance of mRNAs encoding alpha-tubulin and a 94,000-dalton protein in teratocarcinoma-derived stem cells versus differentiated cells. Mol Cell Biol 1984; 4:2428-36. [PMID: 6513923 PMCID: PMC369074 DOI: 10.1128/mcb.4.11.2428-2436.1984] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
Changes in the expression of the genes encoding alpha-tubulin and a 94,000-dalton protein (p94) specified by a cDNA clone, p4-30, were examined in a differentiated teratocarcinoma-derived parietal endoderm cell line, PYS-2, and an undifferentiated teratocarcinoma stem cell line, F9. Relative to other proteins or mRNA species, the synthesis rate of the alpha-tubulins and of p94, as well as the levels of their corresponding cytoplasmic mRNAs, were lower in PYS-2 than in F9 cells. The decrease was greater for the relative abundance of cytoplasmic alpha-tubulin mRNA than for p94 mRNA. Similarly, induction of differentiation of F9 cells by simultaneous exposure to retinoic acid (RA) and dibutyryl cyclic AMP resulted in reduced relative levels of the cytoplasmic mRNAs for these proteins. The reduction in abundance of the two RNA species was not due to a decrease in growth rate since the differentiated cells, PYS-2, RA-treated F9, and RA plus dibutyryl cyclic AMP-treated F9 cells, grew at a rate similar to that of undifferentiated F9 cells. However, induction of differentiation of F9 cells by treatment with RA alone did not cause down-regulation of the two RNA species. The relative levels of total cellular RNA encoding alpha-tubulin and p94 in PYS-2 cells were also lower than those in F9 cells to an extent comparable to the decrease in the cytoplasmic RNAs. Since the apparent relative rates of RNA transcription were similar in both cell types, we conclude that the reduction in relative levels of the alpha-tubulin and p94 RNAs in the cell depends largely on the relative stability of the two RNAs and not on the relative rates of transcription. The faster disappearance of the two RNA species relative to other cellular RNAs from actinomycin D-treated PYS-2 compared with F9 cells is consistent with this interpretation.
Collapse
|
27
|
Abstract
Histone gene repeats in S. purpuratus are shown to be of variable length and sequence. Two recombinant plasmids containing the full-length 6.3 kb histone repeat unit are found to differer in length at two sites in the repeating structure and in the occurrence of two restriction enzyme recognition sites. Variation in repeat length is also demonstrated in the unfractionated DNA of five sea urchins and in a sample of DNA enriched for histone gene sequences by density gradient methods. The repeats in each individual are of a very limited number of major classes, which may differ from one another in overall length or in distribution and presence of particular restriction enzyme sites. Variations are found to occur at many regions of the repeat; some have been mapped specifically to spacer regions. Repeats may differ dramatically from individual to individual since there is no one type of repeat class common to all, although the absolute length differences of the repeats that are found are small.
Collapse
|
28
|
Weinberg ES, Overton GC, Hendricks MB, Newrock KM, Cohen LH. Histone gene heterogeneity in the sea urchin Strongylocentrotus purpuratus. Cold Spring Harb Symp Quant Biol 1978; 42 Pt 2:1093-100. [PMID: 277304 DOI: 10.1101/sqb.1978.042.01.110] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
29
|
|
30
|
Weinberg ES, Overton GC, Shutt RH, Reeder RH. Histone gene arrangement in the sea urchin, Strongylocentrotus purpuratus. Proc Natl Acad Sci U S A 1975; 72:4815-9. [PMID: 1108003 PMCID: PMC388822 DOI: 10.1073/pnas.72.12.4815] [Citation(s) in RCA: 26] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
The DNA coding for histones from Strongylocentrotus purpuratus, purified up to 100-fold with the use of Hg+2-CS2-SO4 and actinomycin-CsC1 equilibrium density gradients, has been used to study the clustering of genes coding for different histones and the size of the repeating multigene cluster. When digested with EcoRI restriction endonuclease, the histone DNA is identified in two classes of fragments with molecular weights of 1.15 X 106 and 2.8 X 106, whereas after treatment of the DNA with HindIII restriction endonuclease, histone gene sequences can be identified only in a fragment of 3.95 X 106. Treatment of the DNA with both enzymes simultaneously shows that there is a HindIII site within the smaller EcoRI fragment. Partial digests with HindIII give fragment sizes that appear to be simple multiples of a 3.95 X 106 repeat. Individual histone mRNAs all hybridize to the 3.95 X 106 fragment but only to one or the other EcoRI fragments. The evidence strongly suggests a repeating unit of 3.95 X 106 containing the genes for most, if not all, the histonrs.
Collapse
|