Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

Total Articles

98
(from Reference Citation Analysis)

Article PDFs (38)

Cited by > 0 (91)

Searched Name

W James Kent

Ranked By

Results Analysis

Year Published Analysis
Article Type Analysis
Publication Title Analysis
Category Analysis

Results Analysis

Indexed Articles

Year Published

Show more Refine

Article Statistics

Refine

Publication Titles

Show more Refine

Grant Agencies

Show more Refine

Category

Show more Refine

Number	Citation Analysis
51	28-way vertebrate alignment and conservation track in the UCSC Genome Browser. Genes Dev 2007;17:1797-808. [PMID: 17984227 PMCID: PMC2099589 DOI: 10.1101/gr.6761107] [Citation(s) in RCA: 207] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2007] [Accepted: 08/30/2007] [Indexed: 01/17/2023] Abstract This article describes a set of alignments of 28 vertebrate genome sequences that is provided by the UCSC Genome Browser. The alignments can be viewed on the Human Genome Browser (March 2006 assembly) at http://genome.ucsc.edu, downloaded in bulk by anonymous FTP from http://hgdownload.cse.ucsc.edu/goldenPath/hg18/multiz28way, or analyzed with the Galaxy server at http://g2.bx.psu.edu. This article illustrates the power of this resource for exploring vertebrate and mammalian evolution, using three examples. First, we present several vignettes involving insertions and deletions within protein-coding regions, including a look at some human-specific indels. Then we study the extent to which start codons and stop codons in the human sequence are conserved in other species, showing that start codons are in general more poorly conserved than stop codons. Finally, an investigation of the phylogenetic depth of conservation for several classes of functional elements in the human genome reveals striking differences in the rates and modes of decay in alignability. Each functional class has a distinctive period of stringent constraint, followed by decays that allow (for the case of regulatory regions) or reject (for coding regions and ultraconserved elements) insertions and deletions. Collapse Key Words Collapse MESH Headings Animals Base Sequence Cats Cattle Codon, Initiator/genetics Codon, Terminator/genetics Conserved Sequence Databases, Genetic Dogs Genome, Human Guinea Pigs Humans Mice Molecular Sequence Data Mutagenesis, Insertional Rabbits Rats Sequence Alignment/methods Sequence Deletion Collapse Grants R56 DK065806 NIDDK NIH HHS P41 HG002371 NHGRI NIH HHS 1P41HG02371 NHGRI NIH HHS R01 DK065806 NIDDK NIH HHS DK65806 NIDDK NIH HHS N01CO12400 NCI NIH HHS N01-CO-12400 NCI NIH HHS R01 HG002238 NHGRI NIH HHS HG002238 NHGRI NIH HHS Collapse
52	Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res 2007;17:760-74. [PMID: 17567995 PMCID: PMC1891336 DOI: 10.1101/gr.6034307] [Citation(s) in RCA: 170] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Abstract A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization. Collapse Key Words Collapse MESH Headings Animals Evolution, Molecular Genome, Human Human Genome Project Humans Mammals/genetics Open Reading Frames Phylogeny Sequence Alignment Collapse Grants P41 HG002371 NHGRI NIH HHS R01 GM076705 NIGMS NIH HHS Intramural NIH HHS R43 HG002632 NHGRI NIH HHS U01 HG003150 NHGRI NIH HHS R01 HG002238 NHGRI NIH HHS Collapse
53	Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007;447:799-816. [PMID: 17571346 PMCID: PMC2212820 DOI: 10.1038/nature05874] [Citation(s) in RCA: 3782] [Impact Index Per Article: 222.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Abstract We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function. Collapse Key Words Collapse MESH Headings Chromatin/genetics Chromatin/metabolism Chromatin Immunoprecipitation Conserved Sequence/genetics DNA Replication Evolution, Molecular Exons/genetics Genetic Variation/genetics Genome, Human/genetics Genomics Heterozygote Histones/metabolism Humans Pilot Projects Protein Binding RNA, Messenger/genetics RNA, Untranslated/genetics Regulatory Sequences, Nucleic Acid/genetics Transcription Factors/metabolism Transcription Initiation Site Transcription, Genetic/genetics Collapse Grants R01 HG003541 NHGRI NIH HHS U01 HG003162-03 NHGRI NIH HHS U01 HG002523-01 NHGRI NIH HHS U54 HG003079-01 NHGRI NIH HHS R01 HG003521-01 NHGRI NIH HHS R01 HG003143 NHGRI NIH HHS P41 HG002371 NHGRI NIH HHS U01 HG003161-03 NHGRI NIH HHS U01 HG003156 NHGRI NIH HHS R01 HG003541-03 NHGRI NIH HHS 077198 Wellcome Trust U01 HG003157 NHGRI NIH HHS R01 HG003110 NHGRI NIH HHS U01 HG003161 NHGRI NIH HHS U54 HG003067-01 NHGRI NIH HHS P41 HG002371-03S1 NHGRI NIH HHS U01 HG003157-03 NHGRI NIH HHS U01 HG003147 NHGRI NIH HHS U01 HG003168-02 NHGRI NIH HHS U54 HG003067 NHGRI NIH HHS R01 HG003110-03 NHGRI NIH HHS U01 HG003156-03 NHGRI NIH HHS R01 HG003143-04 NHGRI NIH HHS U01 HG003150-03 NHGRI NIH HHS U01 HG003147-02 NHGRI NIH HHS R01 HG003532-01 NHGRI NIH HHS R01 HG003521 NHGRI NIH HHS U54 HG003273 NHGRI NIH HHS R01 HG003532 NHGRI NIH HHS R01 HG002238-15 NHGRI NIH HHS U01 HG003162 NHGRI NIH HHS K22 HG003169 NHGRI NIH HHS K22 HG003169-01A1 NHGRI NIH HHS F32 CA108313 NCI NIH HHS U54 HG003079 NHGRI NIH HHS U54 HG003273-01 NHGRI NIH HHS U01 HG003151 NHGRI NIH HHS Wellcome Trust 062023 Wellcome Trust U01 HG003151-03 NHGRI NIH HHS U01 HG002523 NHGRI NIH HHS R01 HG003129-03 NHGRI NIH HHS U01 HG003150 NHGRI NIH HHS R01 HG002238 NHGRI NIH HHS Collapse
54	Evolutionary and biomedical insights from the rhesus macaque genome. Science 2007;316:222-34. [PMID: 17431167 DOI: 10.1126/science.1139247] [Citation(s) in RCA: 989] [Impact Index Per Article: 58.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Abstract The rhesus macaque (Macaca mulatta) is an abundant primate species that diverged from the ancestors of Homo sapiens about 25 million years ago. Because they are genetically and physiologically similar to humans, rhesus monkeys are the most widely used nonhuman primate in basic and applied biomedical research. We determined the genome sequence of an Indian-origin Macaca mulatta female and compared the data with chimpanzees and humans to reveal the structure of ancestral primate genomes and to identify evidence for positive selection and lineage-specific expansions and contractions of gene families. A comparison of sequences from individual animals was used to investigate their underlying genetic diversity. The complete description of the macaque genome blueprint enhances the utility of this animal model for biomedical research and improves our understanding of the basic biology of the species. Collapse Key Words Collapse MESH Headings Animals Biomedical Research Evolution, Molecular Female Gene Duplication Gene Rearrangement Genetic Diseases, Inborn Genetic Variation Genome Humans Macaca mulatta/genetics Male Multigene Family Mutation Pan troglodytes/genetics Sequence Analysis, DNA Species Specificity Collapse Grants 062023 Wellcome Trust R01 HG002939 NHGRI NIH HHS U54 HG003068 NHGRI NIH HHS U54 HG003079 NHGRI NIH HHS Collapse
55	The ENCODE Project at UC Santa Cruz. Nucleic Acids Res 2007;35:D663-7. [PMID: 17166863 PMCID: PMC1781110 DOI: 10.1093/nar/gkl1017] [Citation(s) in RCA: 82] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2006] [Revised: 11/01/2006] [Accepted: 11/02/2006] [Indexed: 12/02/2022] Open Abstract The goal of the Encyclopedia Of DNA Elements (ENCODE) Project is to identify all functional elements in the human genome. The pilot phase is for comparison of existing methods and for the development of new methods to rigorously analyze a defined 1% of the human genome sequence. Experimental datasets are focused on the origin of replication, DNase I hypersensitivity, chromatin immunoprecipitation, promoter function, gene structure, pseudogenes, non-protein-coding RNAs, transcribed RNAs, multiple sequence alignment and evolutionarily constrained elements. The ENCODE project at UCSC website (http://genome.ucsc.edu/ENCODE) is the primary portal for the sequence-based data produced as part of the ENCODE project. In the pilot phase of the project, over 30 labs provided experimental results for a total of 56 browser tracks supported by 385 database tables. The site provides researchers with a number of tools that allow them to visualize and analyze the data as well as download data for local analyses. This paper describes the portal to the data, highlights the data that has been made available, and presents the tools that have been developed within the ENCODE project. Access to the data and types of interactive analysis that are possible are illustrated through supplemental examples. Collapse Key Words Collapse MESH Headings Base Sequence Databases, Nucleic Acid Genome, Human Genomics Humans Internet Sequence Alignment Software User-Computer Interface Collapse Grants Collapse
56	Comparative genomic analysis using the UCSC genome browser. Methods Mol Biol 2007;395:17-34. [PMID: 17993665 DOI: 10.1007/978-1-59745-514-5_2] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Abstract Comparative analysis of DNA sequence from multiple species can provide insights into the function and evolutionary processes that shape genomes. The University of California Santa Cruz (UCSC) Genome Bioinformatics group has developed several tools and methodologies in its study of comparative genomics, many of which have been incorporated into the UCSC Genome Browser (http://genome.ucsc.edu), an easy-to-use online tool for browsing genomic data and aligned annotation "tracks" in a single window. The comparative genomics annotations in the browser include pairwise alignments, which aid in the identification of orthologous regions between species, and conservation tracks that show measures of evolutionary conservation among sets of multiply aligned species, highlighting regions of the genome that may be functionally important. A related tool, the UCSC Table Browser, provides a simple interface for querying, analyzing, and downloading the data underlying the Genome Browser annotation tracks. Here, we describe a procedure for examining a genomic region of interest in the Genome Browser, analyzing characteristics of the region, filtering the data, and downloading data sets for further study. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
57	Variation resources at UC Santa Cruz. Nucleic Acids Res 2007;35:D716-20. [PMID: 17151077 PMCID: PMC1781230 DOI: 10.1093/nar/gkl953] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2006] [Revised: 10/21/2006] [Accepted: 10/23/2006] [Indexed: 12/15/2022] Open Abstract The variation resources within the University of California Santa Cruz Genome Browser include polymorphism data drawn from public collections and analyses of these data, along with their display in the context of other genomic annotations. Primary data from dbSNP is included for many organisms, with added information including genomic alleles and orthologous alleles for closely related organisms. Display filtering and coloring is available by variant type, functional class or other annotations. Annotation of potential errors is highlighted and a genomic alignment of the variant's flanking sequence is displayed. HapMap allele frequencies and linkage disequilibrium (LD) are available for each HapMap population, along with non-human primate alleles. The browsing and analysis tools, downloadable data files and links to documentation and other information can be found at http://genome.ucsc.edu/. Collapse Key Words Collapse MESH Headings Alleles Animals Databases, Nucleic Acid Gene Frequency Genomics Genotype Humans Internet Linkage Disequilibrium Mice Polymorphism, Single Nucleotide Rats Recombination, Genetic Sequence Alignment User-Computer Interface Collapse Grants Collapse
58	The UCSC genome browser database: update 2007. Nucleic Acids Res 2006;35:D668-73. [PMID: 17142222 PMCID: PMC1669757 DOI: 10.1093/nar/gkl928] [Citation(s) in RCA: 226] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open Abstract The University of California, Santa Cruz Genome Browser Database contains, as of September 2006, sequence and annotation data for the genomes of 13 vertebrate and 19 invertebrate species. The Genome Browser displays a wide variety of annotations at all scales from the single nucleotide level up to a full chromosome and includes assembly data, genes and gene predictions, mRNA and EST alignments, and comparative genomics, regulation, expression and variation data. The database is optimized for fast interactive performance with web tools that provide powerful visualization and querying capabilities for mining the data. In the past year, 22 new assemblies and several new sets of human variation annotation have been released. New features include VisiGene, a fully integrated in situ hybridization image browser; phyloGif, for drawing evolutionary tree diagrams; a redesigned Custom Track feature; an expanded SNP annotation track; and many new display options. The Genome Browser, other tools, downloadable data files and links to documentation and other information can be found at . Collapse Key Words Collapse MESH Headings Animals Base Sequence Cattle Computer Graphics Conserved Sequence Databases, Genetic Genome, Human Genomics Humans Internet Linkage Disequilibrium Mice Open Reading Frames Polymorphism, Single Nucleotide Rats Regulatory Sequences, Nucleic Acid User-Computer Interface Collapse Grants Collapse
59	Reconstructing contiguous regions of an ancestral genome. Genome Res 2006;16:1557-65. [PMID: 16983148 PMCID: PMC1665639 DOI: 10.1101/gr.5383506] [Citation(s) in RCA: 225] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Abstract This article analyzes mammalian genome rearrangements at higher resolution than has been published to date. We identify 3171 intervals, covering approximately 92% of the human genome, within which we find no rearrangements larger than 50 kilobases (kb) in the lineages leading to human, mouse, rat, and dog from their most recent common ancestor. Combining intervals that are adjacent in all contemporary species produces 1338 segments that may contain large insertions or deletions but that are free of chromosome fissions or fusions as well as inversions or translocations >50 kb in length. We describe a new method for predicting the ancestral order and orientation of those intervals from their observed adjacencies in modern species. We combine the results from this method with data from chromosome painting experiments to produce a map of an early mammalian genome that accounts for 96.8% of the available human genome sequence data. The precision is further increased by mapping inversions as small as 31 bp. Analysis of the predicted evolutionary breakpoints in the human lineage confirms certain published observations but disagrees with others. Although only a few mammalian genomes are currently sequenced to high precision, our theoretical analyses and computer simulations indicate that our results are reasonably accurate and that they will become highly accurate in the foreseeable future. Our methods were developed as part of a project to reconstruct the genome sequence of the last ancestor of human, dogs, and most other placental mammals. Collapse Key Words Collapse MESH Headings Algorithms Animals Base Composition Base Pairing Chromosome Breakage Chromosome Inversion Chromosome Mapping Chromosome Painting Chromosomes Computer Simulation Dogs Evolution, Molecular Gene Deletion Gene Rearrangement Genome Genome, Human Humans Mice Models, Genetic Rats Sequence Alignment/methods Sequence Homology, Nucleic Acid Collapse Grants HG02238 NHGRI NIH HHS P41 HG002371 NHGRI NIH HHS 22XS013A PHS HHS P41HG02371 NHGRI NIH HHS R01 HG002238 NHGRI NIH HHS Collapse
60	A distal enhancer and an ultraconserved exon are derived from a novel retroposon. Nature 2006;441:87-90. [PMID: 16625209 DOI: 10.1038/nature04696] [Citation(s) in RCA: 368] [Impact Index Per Article: 20.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2005] [Accepted: 03/02/2006] [Indexed: 01/15/2023] Abstract Hundreds of highly conserved distal cis-regulatory elements have been characterized so far in vertebrate genomes. Many thousands more are predicted on the basis of comparative genomics. However, in stark contrast to the genes that they regulate, in invertebrates virtually none of these regions can be traced by using sequence similarity, leaving their evolutionary origins obscure. Here we show that a class of conserved, primarily non-coding regions in tetrapods originated from a previously unknown short interspersed repetitive element (SINE) retroposon family that was active in the Sarcopterygii (lobe-finned fishes and terrestrial vertebrates) in the Silurian period at least 410 million years ago (ref. 4), and seems to be recently active in the 'living fossil' Indonesian coelacanth, Latimeria menadoensis. Using a mouse enhancer assay we show that one copy, 0.5 million bases from the neuro-developmental gene ISL1, is an enhancer that recapitulates multiple aspects of Isl1 expression patterns. Several other copies represent new, possibly regulatory, alternatively spliced exons in the middle of pre-existing Sarcopterygian genes. One of these, a more than 200-base-pair ultraconserved region, 100% identical in mammals, and 80% identical to the coelacanth SINE, contains a 31-amino-acid-residue alternatively spliced exon of the messenger RNA processing gene PCBP2 (ref. 6). These add to a growing list of examples in which relics of transposable elements have acquired a function that serves their host, a process termed 'exaptation', and provide an origin for at least some of the many highly conserved vertebrate-specific genomic sequences. Collapse Key Words Collapse MESH Headings Animals Base Sequence Conserved Sequence/genetics Enhancer Elements, Genetic/genetics Exons/genetics Fossils Gene Expression Regulation, Developmental Humans Organ Specificity Phylogeny Retroelements/genetics Short Interspersed Nucleotide Elements/genetics Vertebrates/genetics Collapse Grants Collapse
61	The UCSC Genome Browser Database: update 2006. Nucleic Acids Res 2006;34:D590-8. [PMID: 16381938 PMCID: PMC1347506 DOI: 10.1093/nar/gkj144] [Citation(s) in RCA: 847] [Impact Index Per Article: 47.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open Abstract The University of California Santa Cruz Genome Browser Database (GBD) contains sequence and annotation data for the genomes of about a dozen vertebrate species and several major model organisms. Genome annotations typically include assembly data, sequence composition, genes and gene predictions, mRNA and expressed sequence tag evidence, comparative genomics, regulation, expression and variation data. The database is optimized to support fast interactive performance with web tools that provide powerful visualization and querying capabilities for mining the data. The Genome Browser displays a wide variety of annotations at all scales from single nucleotide level up to a full chromosome. The Table Browser provides direct access to the database tables and sequence data, enabling complex queries on genome-wide datasets. The Proteome Browser graphically displays protein properties. The Gene Sorter allows filtering and comparison of genes by several metrics including expression data and several gene properties. BLAT and In Silico PCR search for sequences in entire genomes in seconds. These tools are highly integrated and provide many hyperlinks to other databases and websites. The GBD, browsing tools, downloadable data files and links to documentation and other information can be found at . Collapse Key Words Collapse MESH Headings Amino Acid Sequence Animals California Computer Graphics Databases, Genetic Dogs Gene Expression Genes Genomics Humans Internet Mice Polymorphism, Single Nucleotide Proteins/chemistry Proteins/genetics Proteins/metabolism Proteomics Rats Sequence Alignment Software User-Computer Interface Collapse Grants Collapse
62	The UCSC Known Genes. Bioinformatics 2006;22:1036-46. [PMID: 16500937 DOI: 10.1093/bioinformatics/btl048] [Citation(s) in RCA: 400] [Impact Index Per Article: 22.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open Abstract The University of California Santa Cruz (UCSC) Known Genes dataset is constructed by a fully automated process, based on protein data from Swiss-Prot/TrEMBL (UniProt) and the associated mRNA data from Genbank. The detailed steps of this process are described. Extensive cross-references from this dataset to other genomic and proteomic data were constructed. For each known gene, a details page is provided containing rich information about the gene, together with extensive links to other relevant genomic, proteomic and pathway data. As of July 2005, the UCSC Known Genes are available for human, mouse and rat genomes. The Known Genes serves as a foundation to support several key programs: the Genome Browser, Proteome Browser, Gene Sorter and Table Browser offered at the UCSC website. All the associated data files and program source code are also available. They can be accessed at http://genome.ucsc.edu. The genomic coverage of UCSC Known Genes, RefSeq, Ensembl Genes, H-Invitational and CCDS is analyzed. Although UCSC Known Genes offers the highest genomic and CDS coverage among major human and mouse gene sets, more detailed analysis suggests all of them could be further improved. Collapse Key Words Collapse MESH Headings Base Sequence California Chromosome Mapping/methods Database Management Systems Databases, Protein Information Storage and Retrieval/methods Molecular Sequence Data Proteome/chemistry Proteome/genetics Proteome/metabolism RNA, Messenger/genetics Universities User-Computer Interface Collapse Grants Collapse
63	Piloting the zebrafish genome browser. Dev Dyn 2005;235:747-53. [PMID: 16372332 DOI: 10.1002/dvdy.20661] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022] Open Abstract This correspondence is a primer for the zebrafish research community on zebrafish tracks available in the UCSC Genome Browser at http://genome.ucsc.edu based on Sanger's Zv4 assembly. A primary capability of this facility is comparative informatics between humans (as well as many other model organisms) and zebrafish. The zebrafish genome sequencing project has played important roles in mutant mapping and cloning, and comparative genomic research projects. This easy-to-use genome browser aims to display and download useful genome sequence information for zebrafish mutant mapping and cloning projects. Its user-friendly interface expedites annotation of the zebrafish genome sequence. Collapse Key Words Collapse MESH Headings Animals Computational Biology Databases, Genetic Genome Genomics Humans Mice Sequence Analysis, DNA Sequence Analysis, Protein Software Zebrafish/genetics Collapse Grants R01 DK05538 NIDDK NIH HHS Collapse
64	Computational screening of conserved genomic DNA in search of functional noncoding elements. Nat Methods 2005;2:535-45. [PMID: 16170870 DOI: 10.1038/nmeth0705-535] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Abstract Collapse Key Words Collapse MESH Headings Animals Computational Biology/methods Conserved Sequence/genetics Databases, Nucleic Acid Humans Transcription Factors/genetics Collapse Grants P41 HG002371 NHGRI NIH HHS Collapse
65	Galaxy: a platform for interactive large-scale genome analysis. Genome Res 2005;15:1451-5. [PMID: 16169926 PMCID: PMC1240089 DOI: 10.1101/gr.4086505] [Citation(s) in RCA: 1395] [Impact Index Per Article: 73.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Abstract Accessing and analyzing the exponentially expanding genomic sequence and functional data pose a challenge for biomedical researchers. Here we describe an interactive system, Galaxy, that combines the power of existing genome annotation databases with a simple Web portal to enable users to search remote resources, combine data from independent queries, and visualize the results. The heart of Galaxy is a flexible history system that stores the queries from each user; performs operations such as intersections, unions, and subtractions; and links to other computational tools. Galaxy can be accessed at http://g2.bx.psu.edu. Collapse Key Words Collapse MESH Headings Biological Evolution Databases, Genetic Genome Internet Promoter Regions, Genetic Collapse Grants HG02238 NHGRI NIH HHS R56 DK065806 NIDDK NIH HHS R01 DK065806 NIDDK NIH HHS DK65806 NIDDK NIH HHS R01 HG002238 NHGRI NIH HHS Collapse
66	Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 2005;15:1034-50. [PMID: 16024819 PMCID: PMC1182216 DOI: 10.1101/gr.3715005] [Citation(s) in RCA: 2776] [Impact Index Per Article: 146.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2005] [Accepted: 06/02/2005] [Indexed: 11/24/2022] Abstract We have conducted a comprehensive search for conserved elements in vertebrate genomes, using genome-wide multiple alignments of five vertebrate species (human, mouse, rat, chicken, and Fugu rubripes). Parallel searches have been performed with multiple alignments of four insect species (three species of Drosophila and Anopheles gambiae), two species of Caenorhabditis, and seven species of Saccharomyces. Conserved elements were identified with a computer program called phastCons, which is based on a two-state phylogenetic hidden Markov model (phylo-HMM). PhastCons works by fitting a phylo-HMM to the data by maximum likelihood, subject to constraints designed to calibrate the model across species groups, and then predicting conserved elements based on this model. The predicted elements cover roughly 3%-8% of the human genome (depending on the details of the calibration procedure) and substantially higher fractions of the more compact Drosophila melanogaster (37%-53%), Caenorhabditis elegans (18%-37%), and Saccharaomyces cerevisiae (47%-68%) genomes. From yeasts to vertebrates, in order of increasing genome size and general biological complexity, increasing fractions of conserved bases are found to lie outside of the exons of known protein-coding genes. In all groups, the most highly conserved elements (HCEs), by log-odds score, are hundreds or thousands of bases long. These elements share certain properties with ultraconserved elements, but they tend to be longer and less perfectly conserved, and they overlap genes of somewhat different functional categories. In vertebrates, HCEs are associated with the 3' UTRs of regulatory genes, stable gene deserts, and megabase-sized regions rich in moderately conserved noncoding sequences. Noncoding HCEs also show strong statistical evidence of an enrichment for RNA secondary structure. Collapse Key Words Collapse MESH Headings 3' Untranslated Regions Animals Base Pairing/genetics Base Sequence Caenorhabditis elegans/genetics Conserved Sequence DNA, Intergenic Evolution, Molecular Genome Humans Insecta/genetics Molecular Sequence Data Saccharomyces/genetics Vertebrates/genetics Yeasts/genetics Collapse Grants P41 HG002371 NHGRI NIH HHS R01 HG002238 NHGRI NIH HHS IP41HG02371 NHGRI NIH HHS HG02238 NHGRI NIH HHS Collapse
67	Exploring relationships and mining data with the UCSC Gene Sorter. Genome Res 2005;15:737-41. [PMID: 15867434 PMCID: PMC1088302 DOI: 10.1101/gr.3694705] [Citation(s) in RCA: 70] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Abstract In parallel with the human genome sequencing and assembly effort, many tools have been developed to examine the structure and function of the human gene set. The University of California Santa Cruz (UCSC) Gene Sorter has been created as a gene-based counterpart to the chromosome-oriented UCSC Genome Browser to facilitate the study of gene function and evolution. This simple, but powerful tool provides a graphical display of related genes that can be sorted and filtered based on a variety of criteria. Genes may be ordered based on such characteristics as expression profiles, proximity in genome, shared Gene Ontology (GO) terms, and protein similarity. The display can be restricted to a gene set meeting a specific set of constraints by filtering on expression levels, gene name or ID, chromosomal position, and so on. The default set of information for each gene entry-gene name, selected expression data, a BLASTP E-value, genomic position, and a description-can be configured to include many other types of data, including expanded expression data, related accession numbers and IDs, orthologs in other species, GO terms, and much more. The Gene Sorter, a CGI-based Web application written in C with a MySQL database, is tightly integrated with the other applications in the UCSC Genome Browser suite. Available on a selected subset of the genome assemblies found in the Genome Browser, it further enhances the usefulness of the UCSC tool set in interactive genomic exploration and analysis. Collapse Key Words Collapse MESH Headings Computational Biology/methods Database Management Systems Databases, Genetic Genome, Human Genomics/methods Humans Software Collapse Grants P41 HG002371 NHGRI NIH HHS 1P41 HG 02371 NHGRI NIH HHS Collapse
68	Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 2005;23:137-44. [PMID: 15637633 DOI: 10.1038/nbt1053] [Citation(s) in RCA: 691] [Impact Index Per Article: 36.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Abstract The prediction of regulatory elements is a problem where computational methods offer great hope. Over the past few years, numerous tools have become available for this task. The purpose of the current assessment is twofold: to provide some guidance to users regarding the accuracy of currently available tools in various settings, and to provide a benchmark of data sets for assessing future tools. Collapse Key Words Collapse MESH Headings Amino Acid Motifs Animals Binding Sites Computational Biology/methods Databases, Protein Drosophila Fungal Proteins/chemistry Gene Expression Humans Internet Mice Reproducibility of Results Software Transcription, Genetic Collapse Grants GP0101Y01 Telethon R01 HG02602 NHGRI NIH HHS 1R01HG03110 NHGRI NIH HHS Collapse
69	The UCSC Proteome Browser. Nucleic Acids Res 2005;33:D454-8. [PMID: 15608236 PMCID: PMC540054 DOI: 10.1093/nar/gki100] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open Abstract The University of California Santa Cruz (UCSC) Proteome Browser provides a wealth of protein information presented in graphical images and with links to other protein-related Internet sites. The Proteome Browser is tightly integrated with the UCSC Genome Browser. For the first time, Genome Browser users have both the genome and proteome worlds at their fingertips simultaneously. The Proteome Browser displays tracks of protein and genomic sequences, exon structure, polarity, hydrophobicity, locations of cysteine and glycosylation potential, Superfamily domains and amino acids that deviate from normal abundance. Histograms show genome-wide distribution of protein properties, including isoelectric point, molecular weight, number of exons, InterPro domains and cysteine locations, together with specific property values of the selected protein. The Proteome Browser also provides links to gene annotations in the Genome Browser, the Known Genes details page and the Gene Sorter; domain information from Superfamily, InterPro and Pfam; three-dimensional structures at the Protein Data Bank and ModBase; and pathway data at KEGG, BioCarta/CGAP and BioCyc. As of August 2004, the Proteome Browser is available for human, mouse and rat proteomes. The browser may be accessed from any Known Genes details page of the Genome Browser at http://genome.ucsc.edu. A user's guide is also available on this website. Collapse Key Words Collapse MESH Headings California Databases, Protein Genomics Humans Proteins/chemistry Proteins/genetics Proteomics Systems Integration User-Computer Interface Collapse Grants Collapse
70	The share of human genomic DNA under selection estimated from human-mouse genomic alignments. COLD SPRING HARBOR SYMPOSIA ON QUANTITATIVE BIOLOGY 2004;68:245-54. [PMID: 15338624 DOI: 10.1101/sqb.2003.68.245] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Abstract Collapse Key Words Collapse MESH Headings Animals Conserved Sequence DNA/genetics Evolution, Molecular Genome, Human Humans Mice Repetitive Sequences, Nucleic Acid Selection, Genetic Sequence Alignment/statistics & numerical data Species Specificity Collapse Grants 1P41HG-02371 NHGRI NIH HHS HG-02238 NHGRI NIH HHS Collapse
71	Environmentally Induced Foregut Remodeling by PHA-4/FoxA and DAF-12/NHR. Science 2004;305:1743-6. [PMID: 15375261 DOI: 10.1126/science.1102216] [Citation(s) in RCA: 141] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022] Abstract Growth and development of the Caenorhabditis elegans foregut (pharynx) depends on coordinated gene expression, mediated by pharynx defective (PHA)-4/FoxA in combination with additional, largely unidentified transcription factors. Here, we used whole genome analysis to establish clusters of genes expressed in different pharyngeal cell types. We created an expectation maximization algorithm to identify cis-regulatory elements that activate expression within the pharyngeal gene clusters. One of these elements mediates the response to environmental conditions within pharyngeal muscles and is recognized by the nuclear hormone receptor (NHR) DAF-12. Our data suggest that PHA-4 and DAF-12 endow the pharynx with transcriptional plasticity to respond to diverse developmental and physiological cues. Our combination of bioinformatics and in vivo analysis has provided a powerful means for genome-wide investigation of transcriptional control. Collapse Key Words Collapse MESH Headings Animals Caenorhabditis elegans/genetics Caenorhabditis elegans/growth & development Caenorhabditis elegans Proteins/genetics Caenorhabditis elegans Proteins/physiology Computational Biology Enhancer Elements, Genetic Food Gene Expression Profiling Gene Expression Regulation, Developmental Genes, Helminth Genes, Regulator Larva/genetics Larva/growth & development Multigene Family Muscle Development Muscles/physiology Oligonucleotide Array Sequence Analysis Pharynx/cytology Pharynx/growth & development Pharynx/physiology Receptors, Cytoplasmic and Nuclear/genetics Receptors, Cytoplasmic and Nuclear/physiology Regulatory Sequences, Nucleic Acid Trans-Activators/genetics Trans-Activators/physiology Collapse Grants 2P303A42014 PHS HHS R01 GM056264 NIGMS NIH HHS R01 GM056264-08S1 NIGMS NIH HHS R01-GM56264 NIGMS NIH HHS R01 GM056264-08 NIGMS NIH HHS R01 GM056264-07 NIGMS NIH HHS Collapse
72	Over 20% of human transcripts might form sense-antisense pairs. Nucleic Acids Res 2004;32:4812-20. [PMID: 15356298 PMCID: PMC519112 DOI: 10.1093/nar/gkh818] [Citation(s) in RCA: 250] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2004] [Revised: 08/23/2004] [Accepted: 08/23/2004] [Indexed: 01/21/2023] Open Abstract The major challenge to identifying natural sense- antisense (SA) transcripts from public databases is how to determine the correct orientation for an expressed sequence, especially an expressed sequence tag sequence. In this study, we established a set of very stringent criteria to identify the correct orientation of each human transcript. We used these orientation-reliable transcripts to create 26 741 transcription clusters in the human genome. Our analysis shows that 22% (5880) of the human transcription clusters form SA pairs, higher than any previous estimates. Our orientation-specific RT-PCR results along with the comparison of experimental data from previous studies confirm that our SA data set is reliable. This study not only demonstrates that our criteria for the prediction of SA transcripts are efficient, but also provides additional convincing data to support the view that antisense transcription is quite pervasive in the human genome. In-depth analyses show that SA transcripts have some significant differences compared with other types of transcripts, with regard to chromosomal distribution and Gene Ontology-annotated categories of physiological roles, functions and spatial localizations of gene products. Collapse Key Words Collapse MESH Headings Base Pairing Chromosomes, Human Genome, Human Humans RNA, Antisense/analysis RNA, Antisense/chemistry RNA, Antisense/genetics RNA, Messenger/chemistry Reverse Transcriptase Polymerase Chain Reaction Transcription, Genetic Collapse Grants R01 CA084405 NCI NIH HHS CA84405 NCI NIH HHS Collapse
73	Transcriptome and genome conservation of alternative splicing events in humans and mice. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2004:66-77. [PMID: 14992493 DOI: 10.1142/9789812704856_0007] [Citation(s) in RCA: 117] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Abstract Combining mRNA and EST data in splicing graphs with whole genome alignments, we discover alternative splicing events that are conserved in both human and mouse transcriptomes. 1,964 of 19,156 (10%) loci examined contain one or more such alternative splicing events, with 2,698 total events. These events represent a lower bound on the amount of alternative splicing in the human genome. Also, as these alternative splicing events are conserved between the human and mouse transcriptomes they should be enriched for functionally significant alternative splicing events, free from much of the noise found in the EST libraries. Further classification of these alternative splicing events reveals that 1,037 (38.4%) are due to exon skipping, 497 (18.4%) are due to alternative 3' splice sites, 214 (7.9%) are due to alternative 5' splice sites, 75 (2.8%) are due to intron retention and the other 875 (32.4%) are due to other, more complicated, alternative splicing events. In addition, genomic sequences nearby these alternative splicing events display increased sequence conservation. Both the alternatively spliced exons and the proximal intron show increased levels of genomic conservation relative to constitutively spliced exons. For exon skipping events both intron regions flanking the exon are conserved while for alternative 5' and 3' splicing events the conservation is greater near the alternative splice site. Collapse Key Words Collapse MESH Headings Algorithms Alternative Splicing Animals Computational Biology Conserved Sequence Databases, Nucleic Acid Expressed Sequence Tags Genome Genome, Human Humans Mice RNA, Messenger/genetics Sequence Alignment/statistics & numerical data Species Specificity Collapse Grants Collapse
74	Ultraconserved elements in the human genome. Science 2004;304:1321-5. [PMID: 15131266 DOI: 10.1126/science.1098119] [Citation(s) in RCA: 1172] [Impact Index Per Article: 58.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Abstract There are 481 segments longer than 200 base pairs (bp) that are absolutely conserved (100% identity with no insertions or deletions) between orthologous regions of the human, rat, and mouse genomes. Nearly all of these segments are also conserved in the chicken and dog genomes, with an average of 95 and 99% identity, respectively. Many are also significantly conserved in fish. These ultraconserved elements of the human genome are most often located either overlapping exons in genes involved in RNA processing or in introns or nearby genes involved in the regulation of transcription and development. Along with more than 5000 sequences of over 100 bp that are absolutely conserved among the three sequenced mammals, these represent a class of genetic elements whose functions and evolutionary origins are yet to be determined, but which are more highly conserved between these species than are proteins and appear to be essential for the ontogeny of mammals and other vertebrates. Collapse Key Words Collapse MESH Headings Alternative Splicing Animals Base Sequence Chickens/genetics Computational Biology Conserved Sequence DNA, Intergenic Dogs/genetics Evolution, Molecular Exons Gene Expression Regulation Genes Genome Genome, Human Humans Introns Mice/genetics Molecular Sequence Data Mutation Nucleic Acid Conformation RNA/chemistry RNA/genetics RNA/metabolism Rats/genetics Takifugu/genetics Collapse Grants 1P41HG02371 NHGRI NIH HHS Collapse
75	Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 2004;14:708-15. [PMID: 15060014 PMCID: PMC383317 DOI: 10.1101/gr.1933104] [Citation(s) in RCA: 1037] [Impact Index Per Article: 51.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Abstract We define a "threaded blockset," which is a novel generalization of the classic notion of a multiple alignment. A new computer program called TBA (for "threaded blockset aligner") builds a threaded blockset under the assumption that all matching segments occur in the same order and orientation in the given sequences; inversions and duplications are not addressed. TBA is designed to be appropriate for aligning many, but by no means all, megabase-sized regions of multiple mammalian genomes. The output of TBA can be projected onto any genome chosen as a reference, thus guaranteeing that different projections present consistent predictions of which genomic positions are orthologous. This capability is illustrated using a new visualization tool to view TBA-generated alignments of vertebrate Hox clusters from both the mammalian and fish perspectives. Experimental evaluation of alignment quality, using a program that simulates evolutionary change in genomic sequences, indicates that TBA is more accurate than earlier programs. To perform the dynamic-programming alignment step, TBA runs a stand-alone program called MULTIZ, which can be used to align highly rearranged or incompletely sequenced genomes. We describe our use of MULTIZ to produce the whole-genome multiple alignments at the Santa Cruz Genome Browser. Collapse Key Words Collapse MESH Headings Animals Base Sequence Cats Cattle Computational Biology/methods Computational Biology/standards Computational Biology/trends Computer Simulation Dogs Evaluation Studies as Topic Evolution, Molecular Genes, Homeobox/genetics Genes, fos/genetics Genome Genome, Human Humans Mice Molecular Sequence Data Multigene Family/genetics Rats Ribosomal Proteins/genetics Sequence Alignment/methods Sequence Alignment/standards Sequence Alignment/trends Software/trends Collapse Grants HG-02238 NHGRI NIH HHS P41 HG002371 NHGRI NIH HHS 1P41HG02371 NHGRI NIH HHS F32 HG002325 NHGRI NIH HHS HG02325 NHGRI NIH HHS R01 HG002238 NHGRI NIH HHS Collapse
76	Hotspots of mammalian chromosomal evolution. Genome Biol 2004;5:R23. [PMID: 15059256 PMCID: PMC395782 DOI: 10.1186/gb-2004-5-4-r23] [Citation(s) in RCA: 182] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2003] [Revised: 02/20/2004] [Accepted: 02/23/2004] [Indexed: 01/22/2023] Open Abstract BACKGROUND Chromosomal evolution is thought to occur through a random process of breakage and rearrangement that leads to karyotype differences and disruption of gene order. With the availability of both the human and mouse genomic sequences, detailed analysis of the sequence properties underlying these breakpoints is now possible. RESULTS We report an abundance of primate-specific segmental duplications at the breakpoints of syntenic blocks in the human genome. Using conservative criteria, we find that 25% (122/461) of all breakpoints contain > or = 10 kb of duplicated sequence. This association is highly significant (p < 0.0001) when compared to a simulated random-breakage model. The significance is robust under a variety of parameters, multiple sets of conserved synteny data, and for orthologous breakpoints between and within chromosomes. A comparison of mouse lineage-specific breakpoints since the divergence of rat and mouse showed a similar association with regions associated with segmental duplications in the primate genome. CONCLUSION These results indicate that segmental duplications are associated with syntenic rearrangements, even when pericentromeric and subtelomeric regions are excluded. However, segmental duplications are not necessarily the cause of the rearrangements. Rather, our analysis supports a nonrandom model of chromosomal evolution that implicates specific regions within the mammalian genome as having been predisposed to both recurrent small-scale duplication and large-scale evolutionary rearrangements. Collapse Key Words Collapse MESH Headings Animals Chromosome Breakage/genetics Chromosome Mapping/methods Chromosomes/genetics Chromosomes, Human/genetics Evolution, Molecular Gene Duplication Genome Genome, Human Gorilla gorilla/genetics Humans Mice Pan troglodytes/genetics Rats Synteny/genetics Collapse Grants R01 GM058815 NIGMS NIH HHS GM58815 NIGMS NIH HHS P41 HG002371 NHGRI NIH HHS 1P41HG02371 NHGRI NIH HHS HG002385 NHGRI NIH HHS CA094816 NCI NIH HHS R01 HG002385 NHGRI NIH HHS T32 GM007250 NIGMS NIH HHS Collapse
77	The UCSC Table Browser data retrieval tool. Nucleic Acids Res 2004. [PMID: 14681465 DOI: 10.1093/nar/gkh103%j] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/05/2023] Open Abstract The University of California Santa Cruz (UCSC) Table Browser (http://genome.ucsc.edu/cgi-bin/hgText) provides text-based access to a large collection of genome assemblies and annotation data stored in the Genome Browser Database. A flexible alternative to the graphical-based Genome Browser, this tool offers an enhanced level of query support that includes restrictions based on field values, free-form SQL queries and combined queries on multiple tables. Output can be filtered to restrict the fields and lines returned, and may be organized into one of several formats, including a simple tab- delimited file that can be loaded into a spreadsheet or database as well as advanced formats that may be uploaded into the Genome Browser as custom annotation tracks. The Table Browser User's Guide located on the UCSC website provides instructions and detailed examples for constructing queries and configuring output. Collapse Key Words Collapse MESH Headings Animals Computational Biology Databases, Genetic Genome Genomics Humans Information Storage and Retrieval Internet Software User-Computer Interface Collapse Grants P41 HG002371 NHGRI NIH HHS 1P41HG02371 NHGRI NIH HHS Collapse
78	The UCSC Table Browser data retrieval tool. Nucleic Acids Res 2004;32:D493-6. [PMID: 14681465 PMCID: PMC308837 DOI: 10.1093/nar/gkh103] [Citation(s) in RCA: 1623] [Impact Index Per Article: 81.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2003] [Revised: 09/30/2003] [Accepted: 10/13/2003] [Indexed: 11/14/2022] Open Abstract The University of California Santa Cruz (UCSC) Table Browser (http://genome.ucsc.edu/cgi-bin/hgText) provides text-based access to a large collection of genome assemblies and annotation data stored in the Genome Browser Database. A flexible alternative to the graphical-based Genome Browser, this tool offers an enhanced level of query support that includes restrictions based on field values, free-form SQL queries and combined queries on multiple tables. Output can be filtered to restrict the fields and lines returned, and may be organized into one of several formats, including a simple tab- delimited file that can be loaded into a spreadsheet or database as well as advanced formats that may be uploaded into the Genome Browser as custom annotation tracks. The Table Browser User's Guide located on the UCSC website provides instructions and detailed examples for constructing queries and configuring output. Collapse Key Words Collapse MESH Headings Animals Computational Biology Databases, Genetic Genome Genomics Humans Information Storage and Retrieval Internet Software User-Computer Interface Collapse Grants P41 HG002371 NHGRI NIH HHS 1P41HG02371 NHGRI NIH HHS Collapse
79	Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A 2003;100:11484-9. [PMID: 14500911 PMCID: PMC208784 DOI: 10.1073/pnas.1932072100] [Citation(s) in RCA: 593] [Impact Index Per Article: 28.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2003] [Indexed: 11/18/2022] Open Abstract This study examines genomic duplications, deletions, and rearrangements that have happened at scales ranging from a single base to complete chromosomes by comparing the mouse and human genomes. From whole-genome sequence alignments, 344 large (>100-kb) blocks of conserved synteny are evident, but these are further fragmented by smaller-scale evolutionary events. Excluding transposon insertions, on average in each megabase of genomic alignment we observe two inversions, 17 duplications (five tandem or nearly tandem), seven transpositions, and 200 deletions of 100 bases or more. This includes 160 inversions and 75 duplications or transpositions of length >100 kb. The frequencies of these smaller events are not substantially higher in finished portions in the assembly. Many of the smaller transpositions are processed pseudogenes; we define a "syntenic" subset of the alignments that excludes these and other small-scale transpositions. These alignments provide evidence that approximately 2% of the genes in the human/mouse common ancestor have been deleted or partially deleted in the mouse. There also appears to be slightly less nontransposon-induced genome duplication in the mouse than in the human lineage. Although some of the events we detect are possibly due to misassemblies or missing data in the current genome sequence or to the limitations of our methods, most are likely to represent genuine evolutionary events. To make these observations, we developed new alignment techniques that can handle large gaps in a robust fashion and discriminate between orthologous and paralogous alignments. Collapse Key Words comparative genomics cross-species alignments synteny chromosomal inversion breakpoints Collapse MESH Headings Animals Evolution, Molecular Gene Deletion Gene Duplication Genome Mice Collapse Grants P41 HG002371 NHGRI NIH HHS R01 HG002238 NHGRI NIH HHS 1P41HG-02371 NHGRI NIH HHS HG-02238 NHGRI NIH HHS Collapse
80	Comparative analyses of multi-species sequences from targeted genomic regions. Nature 2003;424:788-93. [PMID: 12917688 DOI: 10.1038/nature01858] [Citation(s) in RCA: 482] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2003] [Accepted: 06/16/2003] [Indexed: 11/08/2022] Abstract The systematic comparison of genomic sequences from different organisms represents a central focus of contemporary genome analysis. Comparative analyses of vertebrate sequences can identify coding and conserved non-coding regions, including regulatory elements, and provide insight into the forces that have rendered modern-day genomes. As a complement to whole-genome sequencing efforts, we are sequencing and comparing targeted genomic regions in multiple, evolutionarily diverse vertebrates. Here we report the generation and analysis of over 12 megabases (Mb) of sequence from 12 species, all derived from the genomic region orthologous to a segment of about 1.8 Mb on human chromosome 7 containing ten genes, including the gene mutated in cystic fibrosis. These sequences show conservation reflecting both functional constraints and the neutral mutational events that shaped this genomic region. In particular, we identify substantial numbers of conserved non-coding segments beyond those previously identified experimentally, most of which are not detectable by pair-wise sequence comparisons alone. Analysis of transposable element insertions highlights the variation in genome dynamics among these species and confirms the placement of rodents as a sister group to the primates. Collapse Key Words Collapse MESH Headings Animals Chromosomes, Human, Pair 7/genetics Conserved Sequence/genetics Cystic Fibrosis Transmembrane Conductance Regulator/genetics DNA Transposable Elements/genetics Evolution, Molecular Genome Genomics Humans Mammals/genetics Mutagenesis/genetics Phylogeny Sequence Alignment Sequence Homology, Nucleic Acid Species Specificity Vertebrates/genetics Collapse Grants Collapse
81	The DNA sequence of human chromosome 7. Nature 2003;424:157-64. [PMID: 12853948 DOI: 10.1038/nature01782] [Citation(s) in RCA: 198] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2003] [Accepted: 04/23/2003] [Indexed: 11/09/2022] Abstract Human chromosome 7 has historically received prominent attention in the human genetics community, primarily related to the search for the cystic fibrosis gene and the frequent cytogenetic changes associated with various forms of cancer. Here we present more than 153 million base pairs representing 99.4% of the euchromatic sequence of chromosome 7, the first metacentric chromosome completed so far. The sequence has excellent concordance with previously established physical and genetic maps, and it exhibits an unusual amount of segmentally duplicated sequence (8.2%), with marked differences between the two arms. Our initial analyses have identified 1,150 protein-coding genes, 605 of which have been confirmed by complementary DNA sequences, and an additional 941 pseudogenes. Of genes confirmed by transcript sequences, some are polymorphic for mutations that disrupt the reading frame. Collapse Key Words Collapse MESH Headings Animals Base Sequence Chromosomes, Human, Pair 7 Gene Duplication Humans Mice Molecular Sequence Data Physical Chromosome Mapping Proteins/genetics Pseudogenes RNA, Untranslated Sequence Analysis, DNA Species Specificity Williams Syndrome/genetics Collapse Grants Collapse
82	The UCSC Genome Browser. ACTA ACUST UNITED AC 2003. [DOI: 10.1002/0471250953.bi0104s00] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Abstract Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
83	Global predictions and tests of erythroid regulatory regions. COLD SPRING HARBOR SYMPOSIA ON QUANTITATIVE BIOLOGY 2003;68:335-44. [PMID: 15338635 DOI: 10.1101/sqb.2003.68.335] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/30/2023] Abstract Collapse Key Words Collapse MESH Headings Animals Conserved Sequence DNA/genetics Erythropoiesis/genetics Evolution, Molecular Genes, Regulator Genome Genomics/methods Humans Mice Rats Selection, Genetic Sequence Alignment Collapse Grants 1P41HG-02371 NHGRI NIH HHS HG-02238 NHGRI NIH HHS HG-02325 NHGRI NIH HHS R01 DK-27635 NIDDK NIH HHS Collapse
84	Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Res 2003;13:13-26. [PMID: 12529302 PMCID: PMC430971 DOI: 10.1101/gr.844103] [Citation(s) in RCA: 239] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2002] [Accepted: 11/14/2002] [Indexed: 11/24/2022] Abstract Six measures of evolutionary change in the human genome were studied, three derived from the aligned human and mouse genomes in conjunction with the Mouse Genome Sequencing Consortium, consisting of (1) nucleotide substitution per fourfold degenerate site in coding regions, (2) nucleotide substitution per site in relics of transposable elements active only before the human-mouse speciation, and (3) the nonaligning fraction of human DNA that is nonrepetitive or in ancestral repeats; and three derived from human genome data alone, consisting of (4) SNP density, (5) frequency of insertion of transposable elements, and (6) rate of recombination. Features 1 and 2 are measures of nucleotide substitutions at two classes of "neutral" sites, whereas 4 is a measure of recent mutations. Feature 3 is a measure dominated by deletions in mouse, whereas 5 represents insertions in human. It was found that all six vary significantly in megabase-sized regions genome-wide, and many vary together. This indicates that some regions of a genome change slowly by all processes that alter DNA, and others change faster. Regional variation in all processes is correlated with, but not completely accounted for, by GC content in human and the difference between GC content in human and mouse. Collapse Key Words Collapse MESH Headings Animals Chromosome Deletion Chromosomes/genetics Chromosomes, Human/genetics DNA Transposable Elements/genetics Evolution, Molecular GC Rich Sequence/genetics Genetic Linkage/genetics Genetic Variation/genetics Genetics, Population/methods Genome, Human Humans Mice Mutagenesis, Insertional/genetics Polymorphism, Genetic/genetics Recombination, Genetic/genetics Collapse Grants HG02238 NHGRI NIH HHS R01 DK27635 NIDDK NIH HHS P41 HG002371 NHGRI NIH HHS 1P41HG02371 NHGRI NIH HHS F32 HG002325 NHGRI NIH HHS HG02325 NHGRI NIH HHS Wellcome Trust R01 HG002238 NHGRI NIH HHS Collapse
85	The UCSC Genome Browser Database. Nucleic Acids Res 2003;31:51-4. [PMID: 12519945 PMCID: PMC165576 DOI: 10.1093/nar/gkg129] [Citation(s) in RCA: 1155] [Impact Index Per Article: 55.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open Abstract The University of California Santa Cruz (UCSC) Genome Browser Database is an up to date source for genome sequence data integrated with a large collection of related annotations. The database is optimized to support fast interactive performance with the web-based UCSC Genome Browser, a tool built on top of the database for rapid visualization and querying of the data at many levels. The annotations for a given genome are displayed in the browser as a series of tracks aligned with the genomic sequence. Sequence data and annotations may also be viewed in a text-based tabular format or downloaded as tab-delimited flat files. The Genome Browser Database, browsing tools and downloadable data files can all be found on the UCSC Genome Bioinformatics website (http://genome.ucsc.edu), which also contains links to documentation and related technical information. Collapse Key Words Collapse MESH Headings Animals California Database Management Systems Databases, Genetic Genome, Human Genomics Humans Information Storage and Retrieval Mice Collapse Grants Collapse
86	Human-mouse alignments with BLASTZ. Genome Res 2003;13:103-7. [PMID: 12529312 PMCID: PMC430961 DOI: 10.1101/gr.809403] [Citation(s) in RCA: 851] [Impact Index Per Article: 40.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Abstract The Mouse Genome Analysis Consortium aligned the human and mouse genome sequences for a variety of purposes, using alignment programs that suited the various needs. For investigating issues regarding genome evolution, a particularly sensitive method was needed to permit alignment of a large proportion of the neutrally evolving regions. We selected a program called BLASTZ, an independent implementation of the Gapped BLAST algorithm specifically designed for aligning two long genomic sequences. BLASTZ was subsequently modified, both to attain efficiency adequate for aligning entire mammalian genomes and to increase its sensitivity. This work describes BLASTZ, its modifications, the hardware environment on which we run it, and several empirical studies to validate its results. Collapse Key Words Collapse MESH Headings Animals Database Management Systems/instrumentation Genome Genome, Human Humans Mice Sequence Alignment/instrumentation Sequence Alignment/methods Software Design Software Validation Collapse Grants R01 DK27635 NIDDK NIH HHS HG-02238 NHGRI NIH HHS P41 HG002371 NHGRI NIH HHS 1P41HG02371 NHGRI NIH HHS R01 HG002238 NHGRI NIH HHS Collapse
87	Initial sequencing and comparative analysis of the mouse genome. Nature 2002;420:520-62. [PMID: 12466850 DOI: 10.1038/nature01262] [Citation(s) in RCA: 4791] [Impact Index Per Article: 217.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2002] [Accepted: 10/31/2002] [Indexed: 12/18/2022] Abstract The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of the genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism. Collapse Key Words Collapse MESH Headings Animals Base Composition Chromosomes, Mammalian/genetics Conserved Sequence/genetics CpG Islands/genetics Evolution, Molecular Gene Expression Regulation Genes/genetics Genetic Variation/genetics Genome Genome, Human Genomics Humans Mice/classification Mice/genetics Mice, Knockout Mice, Transgenic Models, Animal Multigene Family/genetics Mutagenesis Neoplasms/genetics Physical Chromosome Mapping Proteome/genetics Pseudogenes/genetics Quantitative Trait Loci/genetics RNA, Untranslated/genetics Repetitive Sequences, Nucleic Acid/genetics Selection, Genetic Sequence Analysis, DNA Sex Chromosomes/genetics Species Specificity Synteny Collapse Grants Collapse
88	Analysis of the role of Caenorhabditis elegans GC-AG introns in regulated splicing. Nucleic Acids Res 2002;30:3360-7. [PMID: 12140320 PMCID: PMC137088 DOI: 10.1093/nar/gkf465] [Citation(s) in RCA: 40] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open Abstract GC-AG introns represent 0.7% of total human pre-mRNA introns. To study the function of GC-AG introns in splicing regulation, 196 cDNA-confirmed GC-AG introns were identified in Caenorhabditis elegans. These represent 0.6% of the cDNA- confirmed intron data set for this organism. Eleven of these GC-AG introns are involved in alternative splicing. In a comparison of the genomic sequences of homologous genes between C.elegans and Caenorhabditis briggsae for 26 GC-AG introns, the C at the +2 position is conserved in only five of these introns. A system to experimentally test the function of GC-AG introns in alternative splicing was developed. Results from these experiments indicate that the conserved C at the +2 position of the tenth intron of the let-2 gene is essential for developmentally regulated alternative splicing. This C allows the splice donor to function as a very weak splice site that works in balance with an alternative GT splice donor. A weak GT splice donor can functionally replace the GC splice donor and allow for splicing regulation. These results indicate that while the majority of GC-AG introns appear to be constitutively spliced and have no evolutionary constraints to prevent them from being GT-AG introns, a subset of GC-AG introns is involved in alternative splicing and the C at the +2 position of these introns can have an important role in splicing regulation. Collapse Key Words Collapse MESH Headings Alternative Splicing Animals Base Sequence Caenorhabditis elegans/genetics Conserved Sequence Genes, Helminth Introns/physiology Models, Genetic RNA, Helminth/chemistry RNA, Helminth/metabolism Collapse Grants R01 GM061646 NIGMS NIH HHS 1R01GM61646 NIGMS NIH HHS Collapse
89	The human genome browser at UCSC. Genome Res 2002;12:996-1006. [PMID: 12045153 PMCID: PMC186604 DOI: 10.1101/gr.229102] [Citation(s) in RCA: 6660] [Impact Index Per Article: 302.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Abstract As vertebrate genome sequences near completion and research refocuses to their analysis, the issue of effective genome annotation display becomes critical. A mature web tool for rapid and reliable display of any requested portion of the genome at any scale, together with several dozen aligned annotation tracks, is provided at http://genome.ucsc.edu. This browser displays assembly contigs and gaps, mRNA and expressed sequence tag alignments, multiple gene predictions, cross-species homologies, single nucleotide polymorphisms, sequence-tagged sites, radiation hybrid data, transposon repeats, and more as a stack of coregistered tracks. Text and sequence-based searches provide quick and precise access to any region of specific interest. Secondary links from individual features lead to sequence details and supplementary off-site databases. One-half of the annotation tracks are computed at the University of California, Santa Cruz from publicly available sequence data; collaborators worldwide provide the rest. Users can stably add their own custom tracks to the browser for educational or research purposes. The conceptual and technical framework of the browser, its underlying MYSQL database, and overall use are described. The web site currently serves over 50,000 pages per day to over 3000 different users. Collapse Key Words Collapse MESH Headings California Database Management Systems Databases, Genetic Gene Expression Genes Genome, Human Humans RNA, Messenger Sequence Homology, Nucleic Acid Software Universities/trends Collapse Grants P41 HG002371 NHGRI NIH HHS 1P41 HG 02371-01 NHGRI NIH HHS Collapse
90	The human genome browser at UCSC. Genome Res 2002. [PMID: 12045153 DOI: 10.1101/gr.229102.articlepublishedonlinebeforeprintinmay2002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/15/2023] Abstract As vertebrate genome sequences near completion and research refocuses to their analysis, the issue of effective genome annotation display becomes critical. A mature web tool for rapid and reliable display of any requested portion of the genome at any scale, together with several dozen aligned annotation tracks, is provided at http://genome.ucsc.edu. This browser displays assembly contigs and gaps, mRNA and expressed sequence tag alignments, multiple gene predictions, cross-species homologies, single nucleotide polymorphisms, sequence-tagged sites, radiation hybrid data, transposon repeats, and more as a stack of coregistered tracks. Text and sequence-based searches provide quick and precise access to any region of specific interest. Secondary links from individual features lead to sequence details and supplementary off-site databases. One-half of the annotation tracks are computed at the University of California, Santa Cruz from publicly available sequence data; collaborators worldwide provide the rest. Users can stably add their own custom tracks to the browser for educational or research purposes. The conceptual and technical framework of the browser, its underlying MYSQL database, and overall use are described. The web site currently serves over 50,000 pages per day to over 3000 different users. Collapse Key Words Collapse MESH Headings California Database Management Systems Databases, Genetic Gene Expression Genes Genome, Human Humans RNA, Messenger Sequence Homology, Nucleic Acid Software Universities/trends Collapse Grants P41 HG002371 NHGRI NIH HHS 1P41 HG 02371-01 NHGRI NIH HHS Collapse
91	BLAT--the BLAST-like alignment tool. Genome Res 2002. [PMID: 11932250 DOI: 10.1101/gr.229202.articlepublishedonlinebeforemarch2002] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/15/2023] Abstract Analyzing vertebrate genomes requires rapid mRNA/DNA and cross-species protein alignments. A new tool, BLAT, is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences. BLAT's speed stems from an index of all nonoverlapping K-mers in the genome. This index fits inside the RAM of inexpensive computers, and need only be computed once for each genome assembly. BLAT has several major stages. It uses the index to find regions in the genome likely to be homologous to the query sequence. It performs an alignment between homologous regions. It stitches together these aligned regions (often exons) into larger alignments (typically genes). Finally, BLAT revisits small internal exons possibly missed at the first stage and adjusts large gap boundaries that have canonical splice sites where feasible. This paper describes how BLAT was optimized. Effects on speed and sensitivity are explored for various K-mer sizes, mismatch schemes, and number of required index matches. BLAT is compared with other alignment programs on various test sets and then used in several genome-wide applications. http://genome.ucsc.edu hosts a web-based BLAT server for the human genome. Collapse Key Words Collapse MESH Headings Animals Computational Biology/methods Computational Biology/statistics & numerical data DNA/genetics Humans Mice Protein Biosynthesis Proteins/chemistry RNA, Messenger/genetics Sequence Alignment/methods Sequence Alignment/statistics & numerical data Software Collapse Grants Collapse
92	BLAT--the BLAST-like alignment tool. Genome Res 2002. [PMID: 11932250 DOI: 10.1101/gr.229102.article published online before print in may 2002] [Citation(s) in RCA: 1936] [Impact Index Per Article: 88.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/26/2023] Abstract Analyzing vertebrate genomes requires rapid mRNA/DNA and cross-species protein alignments. A new tool, BLAT, is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences. BLAT's speed stems from an index of all nonoverlapping K-mers in the genome. This index fits inside the RAM of inexpensive computers, and need only be computed once for each genome assembly. BLAT has several major stages. It uses the index to find regions in the genome likely to be homologous to the query sequence. It performs an alignment between homologous regions. It stitches together these aligned regions (often exons) into larger alignments (typically genes). Finally, BLAT revisits small internal exons possibly missed at the first stage and adjusts large gap boundaries that have canonical splice sites where feasible. This paper describes how BLAT was optimized. Effects on speed and sensitivity are explored for various K-mer sizes, mismatch schemes, and number of required index matches. BLAT is compared with other alignment programs on various test sets and then used in several genome-wide applications. http://genome.ucsc.edu hosts a web-based BLAT server for the human genome. Collapse Key Words Collapse MESH Headings Animals Computational Biology/methods Computational Biology/statistics & numerical data DNA/genetics Humans Mice Protein Biosynthesis Proteins/chemistry RNA, Messenger/genetics Sequence Alignment/methods Sequence Alignment/statistics & numerical data Software Collapse Grants Collapse
93	BLAT--the BLAST-like alignment tool. Genome Res 2002;12:656-64. [PMID: 11932250 PMCID: PMC187518 DOI: 10.1101/gr.229202] [Citation(s) in RCA: 5150] [Impact Index Per Article: 234.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Abstract Analyzing vertebrate genomes requires rapid mRNA/DNA and cross-species protein alignments. A new tool, BLAT, is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences. BLAT's speed stems from an index of all nonoverlapping K-mers in the genome. This index fits inside the RAM of inexpensive computers, and need only be computed once for each genome assembly. BLAT has several major stages. It uses the index to find regions in the genome likely to be homologous to the query sequence. It performs an alignment between homologous regions. It stitches together these aligned regions (often exons) into larger alignments (typically genes). Finally, BLAT revisits small internal exons possibly missed at the first stage and adjusts large gap boundaries that have canonical splice sites where feasible. This paper describes how BLAT was optimized. Effects on speed and sensitivity are explored for various K-mer sizes, mismatch schemes, and number of required index matches. BLAT is compared with other alignment programs on various test sets and then used in several genome-wide applications. http://genome.ucsc.edu hosts a web-based BLAT server for the human genome. Collapse Key Words Collapse MESH Headings Animals Computational Biology/methods Computational Biology/statistics & numerical data DNA/genetics Humans Mice Protein Biosynthesis Proteins/chemistry RNA, Messenger/genetics Sequence Alignment/methods Sequence Alignment/statistics & numerical data Software Collapse Grants Collapse
94	Assembly of the working draft of the human genome with GigAssembler. Genome Res 2001;11:1541-8. [PMID: 11544197 PMCID: PMC311095 DOI: 10.1101/gr.183201] [Citation(s) in RCA: 105] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Abstract The data for the public working draft of the human genome contains roughly 400,000 initial sequence contigs in approximately 30,000 large insert clones. Many of these initial sequence contigs overlap. A program, GigAssembler, was built to merge them and to order and orient the resulting larger sequence contigs based on mRNA, paired plasmid ends, EST, BAC end pairs, and other information. This program produced the first publicly available assembly of the human genome, a working draft containing roughly 2.7 billion base pairs and covering an estimated 88% of the genome that has been used for several recent studies of the genome. Here we describe the algorithm used by GigAssembler. Collapse Key Words Collapse MESH Headings Algorithms Chromosomes, Artificial, Bacterial/genetics Computational Biology/methods Contig Mapping/methods Expressed Sequence Tags Genome, Human Human Genome Project Humans RNA, Messenger/genetics Repetitive Sequences, Nucleic Acid Sequence Alignment/methods Software Collapse Grants Collapse
95	Initial sequencing and analysis of the human genome. Nature 2001;409:860-921. [PMID: 11237011 DOI: 10.1038/35057062] [Citation(s) in RCA: 14518] [Impact Index Per Article: 631.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Abstract The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence. Collapse Key Words Collapse MESH Headings Animals Chromosome Mapping Conserved Sequence CpG Islands DNA Transposable Elements Databases, Factual Drug Industry Evolution, Molecular Forecasting GC Rich Sequence Gene Duplication Genes Genetic Diseases, Inborn Genetics, Medical Genome, Human Human Genome Project Humans Mutation Private Sector Proteins/genetics Proteome Public Sector RNA/genetics Repetitive Sequences, Nucleic Acid Sequence Analysis, DNA/methods Species Specificity Collapse Grants Collapse
96	A physical map of the human genome. Nature 2001;409:934-41. [PMID: 11237014 DOI: 10.1038/35057157] [Citation(s) in RCA: 549] [Impact Index Per Article: 23.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Abstract The human genome is by far the largest genome to be sequenced, and its size and complexity present many challenges for sequence assembly. The International Human Genome Sequencing Consortium constructed a map of the whole genome to enable the selection of clones for sequencing and for the accurate assembly of the genome sequence. Here we report the construction of the whole-genome bacterial artificial chromosome (BAC) map and its integration with previous landmark maps and information from mapping efforts focused on specific chromosomal regions. We also describe the integration of sequence data with the map. Collapse Key Words Collapse MESH Headings Chromosomes, Artificial, Bacterial Cloning, Molecular Contig Mapping DNA Fingerprinting Gene Duplication Genome, Human Humans In Situ Hybridization, Fluorescence Repetitive Sequences, Nucleic Acid Collapse Grants Collapse
97	Conservation, regulation, synteny, and introns in a large-scale C. briggsae-C. elegans genomic alignment. Genome Res 2000;10:1115-25. [PMID: 10958630 DOI: 10.1101/gr.10.8.1115] [Citation(s) in RCA: 191] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Abstract A new algorithm, WABA, was developed for doing large-scale alignments between genomic DNA of different species. WABA was used to align 8 million bases of Caenorhabditis briggsae genomic DNA against the entire 97-million-base Caenorhabditis elegans genome. The alignment, including C. briggsae homologs of 154 genetically characterized C. elegans genes and many times this number of largely uncharacterized ORFs, can be browsed and searched on the Web (http://www.cse.ucsc.edu/ approximately kent/intronerator). The alignment confirms that patterns of conservation can be useful in identifying regulatory regions and rarely expressed coding regions. Conserved regulatory elements can be identified inside coding exons by examining the level of divergence at the wobble position of codons. The alignment reveals a bimodal size distribution of syntenic regions. Over 250 introns are present in one species but not the other. The 3' and 5' intron splice sites have more similarity to each other in introns unique to one species than in C. elegans introns as a whole, suggesting a possible mechanism for intron removal. Collapse Key Words Collapse MESH Headings Algorithms Alternative Splicing/genetics Animals Caenorhabditis elegans/genetics Chromosome Mapping/methods Conserved Sequence/genetics Exons Gene Expression Regulation Genome Internet Introns Molecular Sequence Data Promoter Regions, Genetic RNA Splicing Sequence Alignment/methods Species Specificity Collapse Grants 1R01GM52848 NIGMS NIH HHS Collapse
98	The intronerator: exploring introns and alternative splicing in Caenorhabditis elegans. Nucleic Acids Res 2000;28:91-3. [PMID: 10592190 PMCID: PMC102389 DOI: 10.1093/nar/28.1.91] [Citation(s) in RCA: 65] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/1999] [Accepted: 09/21/1999] [Indexed: 11/13/2022] Open Abstract The Intronerator (http://www.cse.ucsc.edu/ approximately kent/intronerator/ ) is a set of web-based tools for exploring RNA splicing and gene structure in Caenorhabditis elegans. It includes a display of cDNA alignments with the genomic sequence, a catalog of alternatively spliced genes and a database of introns. The cDNA alignments include >100 000 ESTs and almost 1000 full-length cDNAs. ESTs from embryos and mixed stage animals as well as full-length cDNAs can be compared in the alignment display with each other and with predicted genes. The alt-splicing catalog includes 844 open reading frames for which there is evidence of alternative splicing of pre-mRNA. The intron database includes 28 478 introns, and can be searched for patterns near the splice junctions. Collapse Key Words Collapse MESH Headings Alternative Splicing Animals Base Sequence Caenorhabditis elegans/genetics DNA Primers Database Management Systems Databases, Factual Internet Introns Sequence Alignment Collapse Grants 1RO1GM52848 NIGMS NIH HHS Collapse