51
|
Johnson LS, Eddy SR, Portugaly E. Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinformatics 2010. [PMID: 20718988 DOI: 10.1186/1471‐2105‐11‐431] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Profile hidden Markov models (profile-HMMs) are sensitive tools for remote protein homology detection, but the main scoring algorithms, Viterbi or Forward, require considerable time to search large sequence databases. RESULTS We have designed a series of database filtering steps, HMMERHEAD, that are applied prior to the scoring algorithms, as implemented in the HMMER package, in an effort to reduce search time. Using this heuristic, we obtain a 20-fold decrease in Forward and a 6-fold decrease in Viterbi search time with a minimal loss in sensitivity relative to the unfiltered approaches. We then implemented an iterative profile-HMM search method, JackHMMER, which employs the HMMERHEAD heuristic. Due to our search heuristic, we eliminated the subdatabase creation that is common in current iterative profile-HMM approaches. On our benchmark, JackHMMER detects 14% more remote protein homologs than SAM's iterative method T2K. CONCLUSIONS Our search heuristic, HMMERHEAD, significantly reduces the time needed to score a profile-HMM against large sequence databases. This search heuristic allowed us to implement an iterative profile-HMM search method, JackHMMER, which detects significantly more remote protein homologs than SAM's T2K and NCBI's PSI-BLAST.
Collapse
|
52
|
Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer ELL, Eddy SR, Bateman A. The Pfam protein families database. Nucleic Acids Res 2009; 38:D211-22. [PMID: 19920124 PMCID: PMC2808889 DOI: 10.1093/nar/gkp985] [Citation(s) in RCA: 2337] [Impact Index Per Article: 155.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Pfam is a widely used database of protein families and domains. This article describes a set of major updates that we have implemented in the latest release (version 24.0). The most important change is that we now use HMMER3, the latest version of the popular profile hidden Markov model package. This software is ∼100 times faster than HMMER2 and is more sensitive due to the routine use of the forward algorithm. The move to HMMER3 has necessitated numerous changes to Pfam that are described in detail. Pfam release 24.0 contains 11 912 families, of which a large number have been significantly updated during the past two years. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/).
Collapse
|
53
|
Eddy SR. A new generation of homology search tools based on probabilistic inference. GENOME INFORMATICS. INTERNATIONAL CONFERENCE ON GENOME INFORMATICS 2009. [PMID: 20180275 DOI: 10.1142/9781848165632_0019] [Citation(s) in RCA: 238] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/28/2023]
Abstract
Many theoretical advances have been made in applying probabilistic inference methods to improve the power of sequence homology searches, yet the BLAST suite of programs is still the workhorse for most of the field. The main reason for this is practical: BLAST's programs are about 100-fold faster than the fastest competing implementations of probabilistic inference methods. I describe recent work on the HMMER software suite for protein sequence analysis, which implements probabilistic inference using profile hidden Markov models. Our aim in HMMER3 is to achieve BLAST's speed while further improving the power of probabilistic inference based methods. HMMER3 implements a new probabilistic model of local sequence alignment and a new heuristic acceleration algorithm. Combined with efficient vector-parallel implementations on modern processors, these improvements synergize. HMMER3 uses more powerful log-odds likelihood scores (scores summed over alignment uncertainty, rather than scoring a single optimal alignment); it calculates accurate expectation values (E-values) for those scores without simulation using a generalization of Karlin/Altschul theory; it computes posterior distributions over the ensemble of possible alignments and returns posterior probabilities (confidences) in each aligned residue; and it does all this at an overall speed comparable to BLAST. The HMMER project aims to usher in a new generation of more powerful homology search tools based on probabilistic inference methods.
Collapse
|
54
|
Davis FP, Eddy SR. A tool for identification of genes expressed in patterns of interest using the Allen Brain Atlas. ACTA ACUST UNITED AC 2009; 25:1647-54. [PMID: 19414530 DOI: 10.1093/bioinformatics/btp288] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
MOTIVATION Gene expression patterns can be useful in understanding the structural organization of the brain and the regulatory logic that governs its myriad cell types. A particularly rich source of spatial expression data is the Allen Brain Atlas (ABA), a comprehensive genome-wide in situ hybridization study of the adult mouse brain. Here, we present an open-source program, ALLENMINER, that searches the ABA for genes that are expressed, enriched, patterned or graded in a user-specified region of interest. RESULTS Regionally enriched genes identified by ALLENMINER accurately reflect the in situ data (95-99% concordance with manual curation) and compare with regional microarray studies as expected from previous comparisons (61-80% concordance). We demonstrate the utility of ALLENMINER by identifying genes that exhibit patterned expression in the caudoputamen and neocortex. We discuss general characteristics of gene expression in the mouse brain and the potential application of ALLENMINER to design strategies for specific genetic access to brain regions and cell types. AVAILABILITY ALLENMINER is freely available on the Internet at http://research.janelia.org/davis/allenminer.
Collapse
|
55
|
Abstract
SUMMARY INFERNAL builds consensus RNA secondary structure profiles called covariance models (CMs), and uses them to search nucleic acid sequence databases for homologous RNAs, or to create new sequence- and structure-based multiple sequence alignments. AVAILABILITY Source code, documentation and benchmark downloadable from http://infernal.janelia.org. INFERNAL is freely licensed under the GNU GPLv3 and should be portable to any POSIX-compliant operating system, including Linux and Mac OS/X.
Collapse
|
56
|
Abstract
MOTIVATION Accuracy of automated structural RNA alignment is improved by using models that consider not only primary sequence but also secondary structure information. However, current RNA structural alignment approaches tend to perform poorly on incomplete sequence fragments, such as single reads from metagenomic environmental surveys, because nucleotides that are expected to be base paired are missing. RESULTS We present a local RNA structural alignment algorithm, trCYK, for aligning and scoring incomplete sequences under a model using primary sequence conservation and secondary structure information when possible. The trCYK algorithm improves alignment accuracy and coverage of sequence fragments of structural RNAs in simulated metagenomic shotgun datasets. AVAILABILITY The source code for Infernal 1.0, which includes trCYK, is available at http://infernal.janelia.org.
Collapse
|
57
|
Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S, Wilkinson AC, Finn RD, Griffiths-Jones S, Eddy SR, Bateman A. Rfam: updates to the RNA families database. Nucleic Acids Res 2008; 37:D136-40. [PMID: 18953034 PMCID: PMC2686503 DOI: 10.1093/nar/gkn766] [Citation(s) in RCA: 684] [Impact Index Per Article: 42.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Rfam is a collection of RNA sequence families, represented by multiple sequence alignments and covariance models (CMs). The primary aim of Rfam is to annotate new members of known RNA families on nucleotide sequences, particularly complete genomes, using sensitive BLAST filters in combination with CMs. A minority of families with a very broad taxonomic range (e.g. tRNA and rRNA) provide the majority of the sequence annotations, whilst the majority of Rfam families (e.g. snoRNAs and miRNAs) have a limited taxonomic range and provide a limited number of annotations. Recent improvements to the website, methodologies and data used by Rfam are discussed. Rfam is freely available on the Web at http://rfam.sanger.ac.uk/and http://rfam.janelia.org/.
Collapse
|
58
|
Eddy SR. A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol 2008; 4:e1000069. [PMID: 18516236 PMCID: PMC2396288 DOI: 10.1371/journal.pcbi.1000069] [Citation(s) in RCA: 221] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2007] [Accepted: 03/26/2008] [Indexed: 11/19/2022] Open
Abstract
Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (λ) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (“Forward” scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (“Viterbi” scores) are Gumbel-distributed with constant λ = log 2, and the high scoring tail of Forward scores is exponential with the same constant λ. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments. Sequence database searches are a fundamental tool of molecular biology, enabling researchers to identify related sequences in other organisms, which often provides invaluable clues to the function and evolutionary history of genes. The power of database searches to detect more and more remote evolutionary relationships – essentially, to look back deeper in time – has improved steadily, with the adoption of more complex and realistic models. However, database searches require not just a realistic scoring model, but also the ability to distinguish good scores from bad ones – the ability to calculate the statistical significance of scores. For many models and scoring schemes, accurate statistical significance calculations have either involved expensive computational simulations, or not been feasible at all. Here, I introduce a probabilistic model of local sequence alignment that has readily predictable score statistics for position-specific profile scoring systems, and not just for traditional optimal alignment scores, but also for more powerful log-likelihood ratio scores derived in a full probabilistic inference framework. These results remove one of the main obstacles that have impeded the use of more powerful and biologically realistic statistical inference methods in sequence homology searches.
Collapse
|
59
|
Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer ELL, Bateman A. The Pfam protein families database. Nucleic Acids Res 2007; 36:D281-8. [PMID: 18039703 PMCID: PMC2238907 DOI: 10.1093/nar/gkm960] [Citation(s) in RCA: 1671] [Impact Index Per Article: 98.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Pfam is a comprehensive collection of protein domains and families, represented as multiple sequence alignments and as profile hidden Markov models. The current release of Pfam (22.0) contains 9318 protein families. Pfam is now based not only on the UniProtKB sequence database, but also on NCBI GenPept and on sequences from selected metagenomics projects. Pfam is available on the web from the consortium members using a new, consistent and improved website design in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/), as well as from mirror sites in France (http://pfam.jouy.inra.fr/) and South Korea (http://pfam.ccbb.re.kr/).
Collapse
|
60
|
Nawrocki EP, Eddy SR. Query-dependent banding (QDB) for faster RNA similarity searches. PLoS Comput Biol 2007; 3:e56. [PMID: 17397253 PMCID: PMC1847999 DOI: 10.1371/journal.pcbi.0030056] [Citation(s) in RCA: 219] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2006] [Accepted: 02/06/2007] [Indexed: 11/19/2022] Open
Abstract
When searching sequence databases for RNAs, it is desirable to score both primary sequence and RNA secondary structure similarity. Covariance models (CMs) are probabilistic models well-suited for RNA similarity search applications. However, the computational complexity of CM dynamic programming alignment algorithms has limited their practical application. Here we describe an acceleration method called query-dependent banding (QDB), which uses the probabilistic query CM to precalculate regions of the dynamic programming lattice that have negligible probability, independently of the target database. We have implemented QDB in the freely available Infernal software package. QDB reduces the average case time complexity of CM alignment from LN(2.4) to LN(1.3) for a query RNA of N residues and a target database of L residues, resulting in a 4-fold speedup for typical RNA queries. Combined with other improvements to Infernal, including informative mixture Dirichlet priors on model parameters, benchmarks also show increased sensitivity and specificity resulting from improved parameterization.
Collapse
|
61
|
Dowell RD, Eddy SR. Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics 2006; 7:400. [PMID: 16952317 PMCID: PMC1579236 DOI: 10.1186/1471-2105-7-400] [Citation(s) in RCA: 96] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2006] [Accepted: 09/04/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We are interested in the problem of predicting secondary structure for small sets of homologous RNAs, by incorporating limited comparative sequence information into an RNA folding model. The Sankoff algorithm for simultaneous RNA folding and alignment is a basis for approaches to this problem. There are two open problems in applying a Sankoff algorithm: development of a good unified scoring system for alignment and folding and development of practical heuristics for dealing with the computational complexity of the algorithm. RESULTS We use probabilistic models (pair stochastic context-free grammars, pairSCFGs) as a unifying framework for scoring pairwise alignment and folding. A constrained version of the pairSCFG structural alignment algorithm was developed which assumes knowledge of a few confidently aligned positions (pins). These pins are selected based on the posterior probabilities of a probabilistic pairwise sequence alignment. CONCLUSION Pairwise RNA structural alignment improves on structure prediction accuracy relative to single sequence folding. Constraining on alignment is a straightforward method of reducing the runtime and memory requirements of the algorithm. Five practical implementations of the pairwise Sankoff algorithm - this work (Consan), David Mathews' Dynalign, Ian Holmes' Stemloc, Ivo Hofacker's PMcomp, and Jan Gorodkin's FOLDALIGN - have comparable overall performance with different strengths and weaknesses.
Collapse
|
62
|
|
63
|
Abstract
Genome sequence analysis of RNAs presents special challenges to computational biology, because conserved RNA secondary structure plays a large part in RNA analysis. Algorithms well suited for RNA secondary structure and sequence analysis have been borrowed from computational linguistics. These "stochastic context-free grammar" (SCFG) algorithms have enabled the development of new RNA gene-finding and RNA homology search software. The aim of this paper is to provide an accessible introduction to the strengths and weaknesses of SCFG methods and to describe the state of the art in one particular kind of application: SCFG-based RNA similarity searching. The INFERNAL and RSEARCH programs are capable of identifying distant RNA homologs in a database search by looking for both sequence and secondary structure conservation.
Collapse
|
64
|
Finn RD, Mistry J, Schuster-Böckler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer ELL, Bateman A. Pfam: clans, web tools and services. Nucleic Acids Res 2006; 34:D247-51. [PMID: 16381856 PMCID: PMC1347511 DOI: 10.1093/nar/gkj149] [Citation(s) in RCA: 1672] [Impact Index Per Article: 92.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2005] [Revised: 10/19/2005] [Accepted: 10/28/2005] [Indexed: 11/13/2022] Open
Abstract
Pfam is a database of protein families that currently contains 7973 entries (release 18.0). A recent development in Pfam has enabled the grouping of related families into clans. Pfam clans are described in detail, together with the new associated web pages. Improvements to the range of Pfam web tools and the first set of Pfam web services that allow programmatic access to the database and associated tools are also presented. Pfam is available on the web in the UK (http://www.sanger.ac.uk/Software/Pfam/), the USA (http://pfam.wustl.edu/), France (http://pfam.jouy.inra.fr/) and Sweden (http://pfam.cgb.ki.se/).
Collapse
|
65
|
|
66
|
Abstract
The C. elegans genome contains approximately 1300 genes that produce functional noncoding RNA (ncRNA) transcripts. Here we describe what is currently known about these ncRNA genes, from the perspective of the annotation of the finished genome sequence. We have collated a reference set of C. elegans ncRNA gene annotation relative to the WS130 version of the genome assembly, and made these data available in several formats.
Collapse
|
67
|
Hillier LW, Graves TA, Fulton RS, Fulton LA, Pepin KH, Minx P, Wagner-McPherson C, Layman D, Wylie K, Sekhon M, Becker MC, Fewell GA, Delehaunty KD, Miner TL, Nash WE, Kremitzki C, Oddy L, Du H, Sun H, Bradshaw-Cordum H, Ali J, Carter J, Cordes M, Harris A, Isak A, van Brunt A, Nguyen C, Du F, Courtney L, Kalicki J, Ozersky P, Abbott S, Armstrong J, Belter EA, Caruso L, Cedroni M, Cotton M, Davidson T, Desai A, Elliott G, Erb T, Fronick C, Gaige T, Haakenson W, Haglund K, Holmes A, Harkins R, Kim K, Kruchowski SS, Strong CM, Grewal N, Goyea E, Hou S, Levy A, Martinka S, Mead K, McLellan MD, Meyer R, Randall-Maher J, Tomlinson C, Dauphin-Kohlberg S, Kozlowicz-Reilly A, Shah N, Swearengen-Shahid S, Snider J, Strong JT, Thompson J, Yoakum M, Leonard S, Pearman C, Trani L, Radionenko M, Waligorski JE, Wang C, Rock SM, Tin-Wollam AM, Maupin R, Latreille P, Wendl MC, Yang SP, Pohl C, Wallis JW, Spieth J, Bieri TA, Berkowicz N, Nelson JO, Osborne J, Ding L, Meyer R, Sabo A, Shotland Y, Sinha P, Wohldmann PE, Cook LL, Hickenbotham MT, Eldred J, Williams D, Jones TA, She X, Ciccarelli FD, Izaurralde E, Taylor J, Schmutz J, Myers RM, Cox DR, Huang X, McPherson JD, Mardis ER, Clifton SW, Warren WC, Chinwalla AT, Eddy SR, Marra MA, Ovcharenko I, Furey TS, Miller W, Eichler EE, Bork P, Suyama M, Torrents D, Waterston RH, Wilson RK. Generation and annotation of the DNA sequences of human chromosomes 2 and 4. Nature 2005; 434:724-31. [PMID: 15815621 DOI: 10.1038/nature03466] [Citation(s) in RCA: 62] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2004] [Accepted: 02/11/2005] [Indexed: 12/27/2022]
Abstract
Human chromosome 2 is unique to the human lineage in being the product of a head-to-head fusion of two intermediate-sized ancestral chromosomes. Chromosome 4 has received attention primarily related to the search for the Huntington's disease gene, but also for genes associated with Wolf-Hirschhorn syndrome, polycystic kidney disease and a form of muscular dystrophy. Here we present approximately 237 million base pairs of sequence for chromosome 2, and 186 million base pairs for chromosome 4, representing more than 99.6% of their euchromatic sequences. Our initial analyses have identified 1,346 protein-coding genes and 1,239 pseudogenes on chromosome 2, and 796 protein-coding genes and 778 pseudogenes on chromosome 4. Extensive analyses confirm the underlying construction of the sequence, and expand our understanding of the structure and evolution of mammalian chromosomes, including gene deserts, segmental duplications and highly variant regions.
Collapse
MESH Headings
- Animals
- Base Composition
- Base Sequence
- Centromere/genetics
- Chromosomes, Human, Pair 2/genetics
- Chromosomes, Human, Pair 4/genetics
- Conserved Sequence/genetics
- CpG Islands/genetics
- Euchromatin/genetics
- Expressed Sequence Tags
- Gene Duplication
- Genetic Variation/genetics
- Genomics
- Humans
- Molecular Sequence Data
- Physical Chromosome Mapping
- Polymorphism, Genetic/genetics
- Primates/genetics
- Proteins/genetics
- Pseudogenes/genetics
- RNA, Messenger/analysis
- RNA, Messenger/genetics
- RNA, Untranslated/analysis
- RNA, Untranslated/genetics
- Recombination, Genetic/genetics
- Sequence Analysis, DNA
Collapse
|
68
|
Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res 2005; 33:D121-4. [PMID: 15608160 PMCID: PMC540035 DOI: 10.1093/nar/gki081] [Citation(s) in RCA: 1054] [Impact Index Per Article: 55.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Rfam is a comprehensive collection of non-coding RNA (ncRNA) families, represented by multiple sequence alignments and profile stochastic context-free grammars. Rfam aims to facilitate the identification and classification of new members of known sequence families, and distributes annotation of ncRNAs in over 200 complete genome sequences. The data provide the first glimpses of conservation of multiple ncRNA families across a wide taxonomic range. A small number of large families are essential in all three kingdoms of life, with large numbers of smaller families specific to certain taxa. Recent improvements in the database are discussed, together with challenges for the future. Rfam is available on the Web at http://www.sanger.ac.uk/Software/Rfam/ and http://rfam.wustl.edu/.
Collapse
|
69
|
Darnell JC, Fraser CE, Mostovetsky O, Stefani G, Jones TA, Eddy SR, Darnell RB. Kissing complex RNAs mediate interaction between the Fragile-X mental retardation protein KH2 domain and brain polyribosomes. Genes Dev 2005; 19:903-18. [PMID: 15805463 PMCID: PMC1080130 DOI: 10.1101/gad.1276805] [Citation(s) in RCA: 221] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Fragile-X mental retardation is caused by loss of function of a single gene encoding the Fragile-X mental retardation protein, FMRP, an RNA-binding protein that harbors two KH-type and one RGG-type RNA-binding domains. Previous studies identified intramolecular G-quartet RNAs as high-affinity targets for the RGG box, but the relationship of RNA binding to FMRP function and mental retardation remains unclear. One severely affected patient harbors a missense mutation (I304N) within the second KH domain (KH2), and some evidence suggests this domain may be involved in the proposed role of FMRP in translational regulation. We now identify the RNA target for the KH2 domain as a sequence-specific element within a complex tertiary structure termed the FMRP kissing complex. We demonstrate that the association of FMRP with brain polyribosomes is abrogated by competition with the FMRP kissing complex RNA, but not by high-affinity G-quartet RNAs. We conclude that mental retardation associated with the I304N mutation, and likely the Fragile-X syndrome more generally, may relate to a crucial role for RNAs harboring the kissing complex motif as targets for FMRP translational regulation.
Collapse
|
70
|
|
71
|
Eddy SR. A model of the statistical power of comparative genome sequence analysis. PLoS Biol 2005; 3:e10. [PMID: 15660152 PMCID: PMC539325 DOI: 10.1371/journal.pbio.0030010] [Citation(s) in RCA: 94] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2004] [Accepted: 11/02/2004] [Indexed: 01/30/2023] Open
Abstract
Comparative genome sequence analysis is powerful, but sequencing genomes is expensive. It is desirable to be able to predict how many genomes are needed for comparative genomics, and at what evolutionary distances. Here I describe a simple mathematical model for the common problem of identifying conserved sequences. The model leads to some useful rules of thumb. For a given evolutionary distance, the number of comparative genomes needed for a constant level of statistical stringency in identifying conserved regions scales inversely with the size of the conserved feature to be detected. At short evolutionary distances, the number of comparative genomes required also scales inversely with distance. These scaling behaviors provide some intuition for future comparative genome sequencing needs, such as the proposed use of “phylogenetic shadowing” methods using closely related comparative genomes, and the feasibility of high-resolution detection of small conserved features. The mathematical model presented in this work will help to inform comparative genomics strategies for identifying conserved DNA sequences
Collapse
|
72
|
Abstract
Programs such as MFOLD and ViennaRNA are widely used to predict RNA secondary structures. How do these algorithms work? Why can't they predict RNA pseudoknots? How accurate are they, and will they get better?
Collapse
|
73
|
Starostina NG, Marshburn S, Johnson LS, Eddy SR, Terns RM, Terns MP. Circular box C/D RNAs in Pyrococcus furiosus. Proc Natl Acad Sci U S A 2004; 101:14097-101. [PMID: 15375211 PMCID: PMC521125 DOI: 10.1073/pnas.0403520101] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Box C/D RNAs are small, noncoding RNAs that function in RNA modification in eukaryotes and archaea. Here, we report that box C/D RNAs exist in the rare biological form of RNA circles in the hyperthermophilic archaeon Pyrococcus furiosus. Northern analysis of box C/D RNAs reveals two prominent RNA species of different electrophoretic mobilities in total P. furiosus RNA preparations. Together, the results of Northern, ribozyme, RT-PCR, and lariat debranching analyses indicate that the two species are circular and linear RNAs of similar length and abundance. It seems that most, if not all, species of box C/D RNAs exist as circles in P. furiosus. In addition, the circular RNAs are found in complexes with proteins required for box C/D RNA function. Our finding places box C/D RNAs among the extremely few circular RNAs known to exist in nature. Moreover, the unexpected discovery of circular box C/D RNAs points to the existence of a previously unrecognized biogenesis pathway for box C/D RNAs in archaea.
Collapse
MESH Headings
- Animals
- Base Sequence
- Blotting, Northern
- Blotting, Western
- Conserved Sequence
- Immunoprecipitation/methods
- Molecular Sequence Data
- Nucleic Acid Conformation
- Pyrococcus furiosus/chemistry
- Pyrococcus furiosus/genetics
- RNA/genetics
- RNA, Archaeal/chemistry
- RNA, Archaeal/genetics
- RNA, Catalytic/analysis
- RNA, Catalytic/genetics
- RNA, Circular
- RNA, Small Nucleolar/chemistry
- RNA, Small Nucleolar/genetics
- Rabbits
- Recombinant Proteins/genetics
- Reverse Transcriptase Polymerase Chain Reaction
- Ribonucleoproteins, Small Nucleolar/genetics
Collapse
|
74
|
Jiang N, Bao Z, Zhang X, Eddy SR, Wessler SR. Pack-MULE transposable elements mediate gene evolution in plants. Nature 2004; 431:569-73. [PMID: 15457261 DOI: 10.1038/nature02953] [Citation(s) in RCA: 373] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2004] [Accepted: 08/13/2004] [Indexed: 11/09/2022]
Abstract
Mutator-like transposable elements (MULEs) are found in many eukaryotic genomes and are especially prevalent in higher plants. In maize, rice and Arabidopsis a few MULEs were shown to carry fragments of cellular genes. These chimaeric elements are called Pack-MULEs in this study. The abundance of MULEs in rice and the availability of most of the genome sequence permitted a systematic analysis of the prevalence and nature of Pack-MULEs in an entire genome. Here we report that there are over 3,000 Pack-MULEs in rice containing fragments derived from more than 1,000 cellular genes. Pack-MULEs frequently contain fragments from multiple chromosomal loci that are fused to form new open reading frames, some of which are expressed as chimaeric transcripts. About 5% of the Pack-MULEs are represented in collections of complementary DNA. Functional analysis of amino acid sequences and proteomic data indicate that some captured gene fragments might be functional. Comparison of the cellular genes and Pack-MULE counterparts indicates that fragments of genomic DNA have been captured, rearranged and amplified over millions of years. Given the abundance of Pack-MULEs in rice and the widespread occurrence of MULEs in all characterized plant genomes, gene fragment acquisition by Pack-MULEs might represent an important new mechanism for the evolution of genes in higher plants.
Collapse
|
75
|
|