51
|
Abstract
Background As orthologous proteins are expected to retain function more often than other homologs, they are often used for functional annotation transfer between species. However, ortholog identification methods do not take into account changes in domain architecture, which are likely to modify a protein's function. By domain architecture we refer to the sequential arrangement of domains along a protein sequence. To assess the level of domain architecture conservation among orthologs, we carried out a large-scale study of such events between human and 40 other species spanning the entire evolutionary range. We designed a score to measure domain architecture similarity and used it to analyze differences in domain architecture conservation between orthologs and paralogs relative to the conservation of primary sequence. We also statistically characterized the extents of different types of domain swapping events across pairs of orthologs and paralogs. Results The analysis shows that orthologs exhibit greater domain architecture conservation than paralogous homologs, even when differences in average sequence divergence are compensated for, for homologs that have diverged beyond a certain threshold. We interpret this as an indication of a stronger selective pressure on orthologs than paralogs to retain the domain architecture required for the proteins to perform a specific function. In general, orthologs as well as the closest paralogous homologs have very similar domain architectures, even at large evolutionary separation. The most common domain architecture changes observed in both ortholog and paralog pairs involved insertion/deletion of new domains, while domain shuffling and segment duplication/deletion were very infrequent. Conclusions On the whole, our results support the hypothesis that function conservation between orthologs demands higher domain architecture conservation than other types of homologs, relative to primary sequence conservation. This supports the notion that orthologs are functionally more similar than other types of homologs at the same evolutionary distance.
Collapse
Affiliation(s)
- Kristoffer Forslund
- Stockholm Bioinformatics Centre, Science for Life Laboratory, Box 1031, Solna, 17121 Sweden
| | | | | |
Collapse
|
52
|
Schmitt T, Messina DN, Schreiber F, Sonnhammer ELL. Letter to the editor: SeqXML and OrthoXML: standards for sequence and orthology information. Brief Bioinform 2011; 12:485-8. [PMID: 21666252 DOI: 10.1093/bib/bbr025] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
There is a great need for standards in the orthology field. Users must contend with different ortholog data representations from each provider, and the providers themselves must independently gather and parse the input sequence data. These burdensome and redundant procedures make data comparison and integration difficult. We have designed two XML-based formats, SeqXML and OrthoXML, to solve these problems. SeqXML is a lightweight format for sequence records-the input for orthology prediction. It stores the same sequence and metadata as typical FASTA format records, but overcomes common problems such as unstructured metadata in the header and erroneous sequence content. XML provides validation to prevent data integrity problems that are frequent in FASTA files. The range of applications for SeqXML is broad and not limited to ortholog prediction. We provide read/write functions for BioJava, BioPerl, and Biopython. OrthoXML was designed to represent ortholog assignments from any source in a consistent and structured way, yet cater to specific needs such as scoring schemes or meta-information. A unified format is particularly valuable for ortholog consumers that want to integrate data from numerous resources, e.g. for gene annotation projects. Reference proteomes for 61 organisms are already available in SeqXML, and 10 orthology databases have signed on to OrthoXML. Adoption by the entire field would substantially facilitate exchange and quality control of sequence and orthology information.
Collapse
|
53
|
Abstract
Orthology is one of the most important tools available to modern biology, as it allows making inferences from easily studied model systems to much less tractable systems of interest, such as ourselves. This becomes important not least in the study of genetic diseases. We here review work on the orthology of disease-associated genes and also present an updated version of the InParanoid-based disease orthology database and web site OrthoDisease, with 14-fold increased species coverage since the previous version. Using this resource, we survey the taxonomic distribution of orthologs of human genes involved in different disease categories. The hypothesis that paralogs can mask the effect of deleterious mutations predicts that known heritable disease genes should have fewer close paralogs. We found large-scale support for this hypothesis as significantly fewer duplications were observed for disease genes in the OrthoDisease ortholog groups.
Collapse
Affiliation(s)
- Kristoffer Forslund
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Albanova, 10691 Stockholm, Sweden
| | | | | | | |
Collapse
|
54
|
Abstract
Background With the wealth of genomic data available it has become increasingly important to assign putative protein function through functional transfer between orthologs. Therefore, correct elucidation of the evolutionary relationships among genes is a critical task, and attempts should be made to further improve the phylogenetic inference by adding relevant discriminating features. It has been shown that introns can maintain their position over long evolutionary timescales. For this reason, it could be possible to use conservation of intron positions as a discriminating factor when assigning orthology. Therefore, we wanted to investigate whether orthologs have a higher degree of intron position conservation (IPC) compared to non-orthologous sequences that are equally similar in sequence. Results To this end, we developed a new score for IPC and applied it to ortholog groups between human and six other species. For comparison, we also gathered the closest non-orthologs, meaning sequences close in sequence space, yet falling just outside the ortholog cluster. We found that ortholog-ortholog gene pairs on average have a significantly higher degree of IPC compared to ortholog-closest non-ortholog pairs. Also pairs of inparalogs were found to have a higher IPC score than inparalog-closest non-inparalog pairs. We verified that these differences can not simply be attributed to the generally higher sequence identity of the ortholog-ortholog and the inparalog-inparalog pairs. Furthermore, we analyzed the agreement between IPC score and the ortholog score assigned by the InParanoid algorithm, and found that it was consistently high for all species comparisons. In a minority of cases, the IPC and InParanoid score ranked inparalogs differently. These represent cases where sequence and intron position divergence are discordant. We further analyzed the discordant clusters to identify any possible preference for protein functions by looking for enriched GO terms and Pfam protein domains. They were enriched for functions important for multicellularity, which implies a connection between shifts in intronic structure and the origin of multicellularity. Conclusions We conclude that orthologous genes tend to have more conserved intron positions compared to non-orthologous genes. As a consequence, our IPC score is useful as an additional discriminating factor when assigning orthology.
Collapse
Affiliation(s)
- Anna Henricson
- Department of Cell and Molecular Biology, Karolinska Institutet, SE-17177 Stockholm, Sweden
| | | | | |
Collapse
|
55
|
Kemmer D, Faxén M, Hodges E, Lim J, Herzog E, Ljungström E, Lundmark A, Olsen MK, Podowski R, Sonnhammer ELL, Nilsson P, Reimers M, Lenhard B, Roberds SL, Wahlestedt C, Höög C, Agarwal P, Wasserman WW. Exploring the foundation of genomics: a northern blot reference set for the comparative analysis of transcript profiling technologies. Comp Funct Genomics 2010; 5:584-95. [PMID: 18629180 PMCID: PMC2447472 DOI: 10.1002/cfg.443] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/19/2004] [Indexed: 02/02/2023] Open
Abstract
In this paper we aim to create a reference data collection of Northern blot results
and demonstrate how such a collection can enable a quantitative comparison of
modern expression profiling techniques, a central component of functional genomics
studies. Historically, Northern blots were the de facto standard for determining RNA
transcript levels. However, driven by the demand for analysis of large sets of genes in
parallel, high-throughput methods, such as microarrays, dominate modern profiling
efforts. To facilitate assessment of these methods, in comparison to Northern blots,
we created a database of published Northern results obtained with a standardized
commercial multiple tissue blot (dbMTN). In order to demonstrate the utility of the
dbMTN collection for technology comparison, we also generated expression profiles
for genes across a set of human tissues, using multiple profiling techniques. No method
produced profiles that were strongly correlated with the Northern blot data. The
highest correlations to the Northern blot data were determined with microarrays
for the subset of genes observed to be specifically expressed in a single tissue in
the Northern analyses. The database and expression profiling data are available
via the project website (http://www.cisreg.ca). We believe that emphasis on multitechnique
validation of expression profiles is justified, as the correlation results
between platforms are not encouraging on the whole. Supplementary material for this
article can be found at: http://www.interscience.wiley.com/jpages/1531-6912/suppmat
Collapse
Affiliation(s)
- Danielle Kemmer
- Center for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
56
|
Alexeyenko A, Wassenberg DM, Lobenhofer EK, Yen J, Linney E, Sonnhammer ELL, Meyer JN. Dynamic zebrafish interactome reveals transcriptional mechanisms of dioxin toxicity. PLoS One 2010; 5:e10465. [PMID: 20463971 PMCID: PMC2864754 DOI: 10.1371/journal.pone.0010465] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2009] [Accepted: 03/17/2010] [Indexed: 01/09/2023] Open
Abstract
Background In order to generate hypotheses regarding the mechanisms by which 2,3,7,8-tetrachlorodibenzo-p-dioxin (dioxin) causes toxicity, we analyzed global gene expression changes in developing zebrafish embryos exposed to this potent toxicant in the context of a dynamic gene network. For this purpose, we also computationally inferred a zebrafish (Danio rerio) interactome based on orthologs and interaction data from other eukaryotes. Methodology/Principal Findings Using novel computational tools to analyze this interactome, we distinguished between dioxin-dependent and dioxin-independent interactions between proteins, and tracked the temporal propagation of dioxin-dependent transcriptional changes from a few genes that were altered initially, to large groups of biologically coherent genes at later times. The most notable processes altered at later developmental stages were calcium and iron metabolism, embryonic morphogenesis including neuronal and retinal development, a variety of mitochondria-related functions, and generalized stress response (not including induction of antioxidant genes). Within the interactome, many of these responses were connected to cytochrome P4501A (cyp1a) as well as other genes that were dioxin-regulated one day after exposure. This suggests that cyp1a may play a key role initiating the toxic dysregulation of those processes, rather than serving simply as a passive marker of dioxin exposure, as suggested by earlier research. Conclusions/Significance Thus, a powerful microarray experiment coupled with a flexible interactome and multi-pronged interactome tools (which are now made publicly available for microarray analysis and related work) suggest the hypothesis that dioxin, best known in fish as a potent cardioteratogen, has many other targets. Many of these types of toxicity have been observed in mammalian species and are potentially caused by alterations to cyp1a.
Collapse
Affiliation(s)
- Andrey Alexeyenko
- Stockholm Bioinformatics Centre, Stockholm University, Stockholm, Sweden
| | - Deena M. Wassenberg
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, North Carolina, United States of America
| | | | - Jerry Yen
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, North Carolina, United States of America
| | - Elwood Linney
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, North Carolina, United States of America
| | | | - Joel N. Meyer
- Nicholas School of the Environment, Duke University, Durham, North Carolina, United States of America
- * E-mail:
| |
Collapse
|
57
|
Abstract
Genes involved in cancer susceptibility and progression can serve as templates for searching protein networks for novel cancer genes. To this end, we introduce a general network searching method, MaxLink, and apply it to find and rank cancer gene candidates by their connectivity to known cancer genes. Using a comprehensive protein interaction network, we searched for genes connected to known cancer genes. First, we compiled a new set of 812 genes involved in cancer, more than twice the number in the Cancer Gene Census. Their network neighbors were then extracted. This candidate list was refined by selecting genes with unexpectedly high levels of connectivity to cancer genes and without previous association to cancer. This produced a list of 1891 new cancer candidates with up to 55 connections to known cancer genes. We validated our method by cross-validation, Gene Ontology term bias, and differential expression in cancer versus normal tissue. An example novel cancer gene candidate is presented with detailed analysis of the local network and neighbor annotation. Our study provides a ranked list of high priority targets for further studies in cancer research. Supplemental material is included.
Collapse
Affiliation(s)
- Gabriel Ostlund
- Stockholm Bioinformatics Centre, Stockholm University, Stockholm, Sweden.
| | | | | |
Collapse
|
58
|
Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer ELL, Eddy SR, Bateman A. The Pfam protein families database. Nucleic Acids Res 2009; 38:D211-22. [PMID: 19920124 PMCID: PMC2808889 DOI: 10.1093/nar/gkp985] [Citation(s) in RCA: 2329] [Impact Index Per Article: 155.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Pfam is a widely used database of protein families and domains. This article describes a set of major updates that we have implemented in the latest release (version 24.0). The most important change is that we now use HMMER3, the latest version of the popular profile hidden Markov model package. This software is ∼100 times faster than HMMER2 and is more sensitive due to the routine use of the forward algorithm. The move to HMMER3 has necessitated numerous changes to Pfam that are described in detail. Pfam release 24.0 contains 11 912 families, of which a large number have been significantly updated during the past two years. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/).
Collapse
Affiliation(s)
- Robert D Finn
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
59
|
Ostlund G, Schmitt T, Forslund K, Köstler T, Messina DN, Roopra S, Frings O, Sonnhammer ELL. InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res 2009; 38:D196-203. [PMID: 19892828 PMCID: PMC2808972 DOI: 10.1093/nar/gkp931] [Citation(s) in RCA: 461] [Impact Index Per Article: 30.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
The InParanoid project gathers proteomes of completely sequenced eukaryotic species plus Escherichia coli and calculates pairwise ortholog relationships among them. The new release 7.0 of the database has grown by an order of magnitude over the previous version and now includes 100 species and their collective 1.3 million proteins organized into 42.7 million pairwise ortholog groups. The InParanoid algorithm itself has been revised and is now both more specific and sensitive. Based on results from our recent benchmarking of low-complexity filters in homology assignment, a two-pass BLAST approach was developed that makes use of high-precision compositional score matrix adjustment, but avoids the alignment truncation that sometimes follows. We have also updated the InParanoid web site (http://InParanoid.sbc.su.se). Several features have been added, the response times have been improved and the site now sports a new, clearer look. As the number of ortholog databases has grown, it has become difficult to compare among these resources due to a lack of standardized source data and incompatible representations of ortholog relationships. To facilitate data exchange and comparisons among ortholog databases, we have developed and are making available two XML schemas: SeqXML for the input sequences and OrthoXML for the output ortholog clusters.
Collapse
Affiliation(s)
- Gabriel Ostlund
- Department of Biochemistry and Biophysics, Stockholm Bioinformatics Centre, AlbaNova University Centre, Stockholm University, SE-10691 Stockholm, Sweden.
| | | | | | | | | | | | | | | |
Collapse
|
60
|
Klammer M, Messina DN, Schmitt T, Sonnhammer ELL. MetaTM - a consensus method for transmembrane protein topology prediction. BMC Bioinformatics 2009; 10:314. [PMID: 19785723 PMCID: PMC2761906 DOI: 10.1186/1471-2105-10-314] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2009] [Accepted: 09/28/2009] [Indexed: 02/06/2023] Open
Abstract
Background Transmembrane (TM) proteins are proteins that span a biological membrane one or more times. As their 3-D structures are hard to determine, experiments focus on identifying their topology (i. e. which parts of the amino acid sequence are buried in the membrane and which are located on either side of the membrane), but only a few topologies are known. Consequently, various computational TM topology predictors have been developed, but their accuracies are far from perfect. The prediction quality can be improved by applying a consensus approach, which combines results of several predictors to yield a more reliable result. Results A novel TM consensus method, named MetaTM, is proposed in this work. MetaTM is based on support vector machine models and combines the results of six TM topology predictors and two signal peptide predictors. On a large data set comprising 1460 sequences of TM proteins with known topologies and 2362 globular protein sequences it correctly predicts 86.7% of all topologies. Conclusion Combining several TM predictors in a consensus prediction framework improves overall accuracy compared to any of the individual methods. Our proposed SVM-based system also has higher accuracy than a previous consensus predictor. MetaTM is made available both as downloadable source code and as DAS server at
Collapse
Affiliation(s)
- Martin Klammer
- Stockholm Bioinformatics Centre, Albanova, Stockholm University, 10691 Stockholm, Sweden.
| | | | | | | |
Collapse
|
61
|
|
62
|
Abstract
BACKGROUND Low-complexity sequence regions present a common problem in finding true homologs to a protein query sequence. Several solutions to this have been suggested, but a detailed comparison between these on challenging data has so far been lacking. A common benchmark for homology detection procedures is to use SCOP/ASTRAL domain sequences belonging to the same or different superfamilies, but these contain almost no low complexity sequences. RESULTS We here introduce an alternative benchmarking strategy based around Pfam domains and clans on whole-proteome data sets. This gives a realistic level of low complexity sequences. We used it to evaluate all six built-in BLAST low complexity filter settings as well as a range of settings in the MSPcrunch post-processing filter. The effect on alignment length was also assessed. CONCLUSION Score matrix adjustment methods provide a low false positive rate at a relatively small loss in sensitivity relative to no filtering, across the range of test conditions we apply. MSPcrunch achieved even less loss in sensitivity, but at a higher false positive rate. A drawback of the score matrix adjustment methods is however that the alignments often become truncated. AVAILABILITY Perl scripts for MSPcrunch BLAST filtering and for generating the benchmark dataset are available at http://sonnhammer.sbc.su.se/download/software/MSPcrunch+Blixem/benchmark.tar.gz
Collapse
Affiliation(s)
- Kristoffer Forslund
- Stockholm Bioinformatics Center, Stockholm University, SE-10691 Stockholm, Sweden.
| | | |
Collapse
|
63
|
Abstract
SUMMARY The rise in biological sequence data has led to a proliferation of separate, specialized databases. While there is great value in having many independent annotations, it is critical that there be a way to integrate them in one combined view. The Distributed Annotation System (DAS) was developed for that very purpose. There are currently no DAS clients that are open source, specialized for aggregating and comparing protein sequence annotation, and that can run as a self-contained application outside of a web browser. The speed, flexibility and extensibility that come with a stand-alone application motivated us to create DASher, an open-source Java DAS client. Given a UniProt sequence identifier, DASher automatically queries DAS-supporting servers worldwide for any information on that sequence and then displays the annotations in an interactive viewer for easy comparison. DASher is a fast, Java-based DAS client optimized for viewing protein sequence annotation and compliant with the latest DAS protocol specification 1.53E. AVAILABILITY DASher is available for direct use and download at http://dasher.sbc.su.se including examples and source code under the GPLv3 licence. Java version 6 or higher is required.
Collapse
Affiliation(s)
- David N Messina
- Stockholm Bioinformatics Centre, Stockholm University, 10691 Stockholm, Sweden
| | | |
Collapse
|
64
|
Lassmann T, Frings O, Sonnhammer ELL. Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features. Nucleic Acids Res 2008; 37:858-65. [PMID: 19103665 PMCID: PMC2647288 DOI: 10.1093/nar/gkn1006] [Citation(s) in RCA: 179] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In the growing field of genomics, multiple alignment programs are confronted with ever increasing amounts of data. To address this growing issue we have dramatically improved the running time and memory requirement of Kalign, while maintaining its high alignment accuracy. Kalign version 2 also supports nucleotide alignment, and a newly introduced extension allows for external sequence annotation to be included into the alignment procedure. We demonstrate that Kalign2 is exceptionally fast and memory-efficient, permitting accurate alignment of very large numbers of sequences. The accuracy of Kalign2 compares well to the best methods in the case of protein alignments while its accuracy on nucleotide alignments is generally superior. In addition, we demonstrate the potential of using known or predicted sequence annotation to improve the alignment accuracy. Kalign2 is freely available for download from the Kalign web site (http://msa.sbc.su.se/).
Collapse
Affiliation(s)
- Timo Lassmann
- Department of Cell and Molecular Biology, Karolinska Institutet, SE-17177 Stockholm, Sweden
| | | | | |
Collapse
|
65
|
Abstract
MOTIVATION Computational assignment of protein function may be the single most vital application of bioinformatics in the post-genome era. These assignments are made based on various protein features, where one is the presence of identifiable domains. The relationship between protein domain content and function is important to investigate, to understand how domain combinations encode complex functions. RESULTS Two different models are presented on how protein domain combinations yield specific functions: one rule-based and one probabilistic. We demonstrate how these are useful for Gene Ontology annotation transfer. The first is an intuitive generalization of the Pfam2GO mapping, and detects cases of strict functional implications of sets of domains. The second uses a probabilistic model to represent the relationship between domain content and annotation terms, and was found to be better suited for incomplete training sets. We implemented these models as predictors of Gene Ontology functional annotation terms. Both predictors were more accurate than conventional best BLAST-hit annotation transfer and more sensitive than a single-domain model on a large-scale dataset. We present a number of cases where combinations of Pfam-A protein domains predict functional terms that do not follow from the individual domains. AVAILABILITY Scripts and documentation are available for download at http://sonnhammer.sbc.su.se/multipfam2go_source_docs.tar
Collapse
Affiliation(s)
- Kristoffer Forslund
- Stockholm Bioinformatics Centre, Stockholm University, 10691 Stockholm, Sweden.
| | | |
Collapse
|
66
|
Abstract
UNLABELLED jSquid is a graph visualization tool for exploring graphs from protein-protein interaction or functional coupling networks. The tool was designed for the FunCoup web site, but can be used for any similar network exploring purpose. The program offers various visualization and graph manipulation techniques to increase the utility for the user. AVAILABILITY jSquid is available for direct usage and download at http://jSquid.sbc.su.se including source code under the GPLv3 license, and input examples. It requires Java version 5 or higher to run properly. CONTACT erik.sonnhammer@sbc.su.se SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Martin Klammer
- Stockholm Bioinformatics Centre, Stockholm University, 10691 Stockholm, Sweden
| | | | | |
Collapse
|
67
|
|
68
|
Hong J, Wei N, Chalk A, Wang J, Song Y, Yi F, Qiao RP, Sonnhammer ELL, Wahlestedt C, Liang Z, Du Q. Focusing on RISC assembly in mammalian cells. Biochem Biophys Res Commun 2008; 368:703-8. [PMID: 18252196 DOI: 10.1016/j.bbrc.2008.01.116] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2008] [Accepted: 01/26/2008] [Indexed: 10/22/2022]
Abstract
RISC (RNA-induced silencing complex) is a central protein complex in RNAi, into which a siRNA strand is assembled to become effective in gene silencing. By using an in vitro RNAi reaction based on Drosophila embryo extract, an asymmetric model was recently proposed for RISC assembly of siRNA strands, suggesting that the strand that is more loosely paired at its 5' end is selectively assembled into RISC and results in target gene silencing. However, in the present study, we were unable to establish such a correlation in cell-based RNAi assays, as well as in large-scale RNAi data analyses. This suggests that the thermodynamic stability of siRNA is not a major determinant of gene silencing in mammalian cells. Further studies on fork siRNAs showed that mismatch at the 5' end of the siRNA sense strand decreased RISC assembly of the antisense strand, but surprisingly did not increase RISC assembly of the sense strand. More interestingly, measurements of melting temperature showed that the terminal stability of fork siRNAs correlated with the positions of the mismatches, but not gene silencing efficacy. In summary, our data demonstrate that there is no definite correlation between siRNA stability and gene silencing in mammalian cells, which suggests that instead of thermodynamic stability, other features of the siRNA duplex contribute to RISC assembly in RNAi.
Collapse
Affiliation(s)
- Junmei Hong
- Institute of Molecular Medicine, Peking University, 100871 Beijing, PR China
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
69
|
Abstract
The InParanoid eukaryotic ortholog database (http://InParanoid.sbc.su.se/) has been updated to version 6 and is now based on 35 species. We collected all available ‘complete’ eukaryotic proteomes and Escherichia coli, and calculated ortholog groups for all 595 species pairs using the InParanoid program. This resulted in 2 642 187 pairwise ortholog groups in total. The orthology-based species relations are presented in an orthophylogram. InParanoid clusters contain one or more orthologs from each of the two species. Multiple orthologs in the same species, i.e. inparalogs, result from gene duplications after the species divergence. A new InParanoid website has been developed which is optimized for speed both for users and for updating the system. The XML output format has been improved for efficient processing of the InParanoid ortholog clusters.
Collapse
Affiliation(s)
- Ann-Charlotte Berglund
- Linnaeus Centre for Bioinformatics, Uppsala University, BMC Box 598, 75124, Uppsala and Stockholm Bioinformatics Center, Albanova, Stockholm University, SE-10691 Stockholm, Sweden
| | - Erik Sjölund
- Linnaeus Centre for Bioinformatics, Uppsala University, BMC Box 598, 75124, Uppsala and Stockholm Bioinformatics Center, Albanova, Stockholm University, SE-10691 Stockholm, Sweden
| | - Gabriel Östlund
- Linnaeus Centre for Bioinformatics, Uppsala University, BMC Box 598, 75124, Uppsala and Stockholm Bioinformatics Center, Albanova, Stockholm University, SE-10691 Stockholm, Sweden
| | - Erik L. L. Sonnhammer
- Linnaeus Centre for Bioinformatics, Uppsala University, BMC Box 598, 75124, Uppsala and Stockholm Bioinformatics Center, Albanova, Stockholm University, SE-10691 Stockholm, Sweden
- *To whom correspondence should be addressed.+46 8 55378567+46 8 55378214
| |
Collapse
|
70
|
Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer ELL, Bateman A. The Pfam protein families database. Nucleic Acids Res 2007; 36:D281-8. [PMID: 18039703 PMCID: PMC2238907 DOI: 10.1093/nar/gkm960] [Citation(s) in RCA: 1671] [Impact Index Per Article: 98.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Pfam is a comprehensive collection of protein domains and families, represented as multiple sequence alignments and as profile hidden Markov models. The current release of Pfam (22.0) contains 9318 protein families. Pfam is now based not only on the UniProtKB sequence database, but also on NCBI GenPept and on sequences from selected metagenomics projects. Pfam is available on the web from the consortium members using a new, consistent and improved website design in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/), as well as from mirror sites in France (http://pfam.jouy.inra.fr/) and South Korea (http://pfam.ccbb.re.kr/).
Collapse
Affiliation(s)
- Robert D Finn
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton Hall, Hinxton, Cambridgeshire, CB10 1SA, UK
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
71
|
Abstract
Understanding the dynamics behind domain architecture evolution is of great importance to unravel the functions of proteins. Complex architectures have been created throughout evolution by rearrangement and duplication events. An interesting question is how many times a particular architecture has been created, a form of convergent evolution or domain architecture reinvention. Previous studies have approached this issue by comparing architectures found in different species. We wanted to achieve a finer-grained analysis by reconstructing protein architectures on complete domain trees. The prevalence of domain architecture reinvention in 96 genomes was investigated with a novel domain tree-based method that uses maximum parsimony for inferring ancestral protein architectures. Domain architectures were taken from Pfam. To ensure robustness, we applied the method to bootstrap trees and only considered results with strong statistical support. We detected multiple origins for 12.4% of the scored architectures. In a much smaller data set, the subset of completely domain-assigned proteins, the figure was 5.6%. These results indicate that domain architecture reinvention is a much more common phenomenon than previously thought. We also determined which domains are most frequent in multiply created architectures and assessed whether specific functions could be attributed to them. However, no strong functional bias was found in architectures with multiple origins.
Collapse
Affiliation(s)
- Kristoffer Forslund
- Stockholm Bioinformatics Centre, Albanova, Stockholm University, Stockholm, Sweden
| | | | | | | |
Collapse
|
72
|
Abstract
UNLABELLED PfamAlyzer is a Java applet that enables exploration of Pfam domain architectures using a user-friendly graphical interface. It can search the UniProt protein database for a domain pattern. Domain patterns similar to the query are presented graphically by PfamAlyzer either in a ranked list or pinned to the tree of life. Such domain-centric homology search can assist identification of distant homologs with shared domain architecture. AVAILABILITY PfamAlyzer has been integrated with the Pfam database and can be accessed at http://pfam.cgb.ki.se/pfamalyzer.
Collapse
Affiliation(s)
- Volker Hollich
- Department of Cell and Molecular Biology, Karolinska Institutet, S-171 77 Stockholm, Sweden
| | | |
Collapse
|
73
|
Abstract
When using conventional transmembrane topology and signal peptide predictors, such as TMHMM and SignalP, there is a substantial overlap between these two types of predictions. Applying these methods to five complete proteomes, we found that 30–65% of all predicted signal peptides and 25–35% of all predicted transmembrane topologies overlap. This impairs predictions of 5–10% of the proteome, hence this is an important issue in protein annotation. To address this problem, we previously designed a hidden Markov model, Phobius, that combines transmembrane topology and signal peptide predictions. The method makes an optimal choice between transmembrane segments and signal peptides, and also allows constrained and homology-enriched predictions. We here present a web interface (http://phobius.cgb.ki.se and http://phobius.binf.ku.dk) to access Phobius.
Collapse
Affiliation(s)
- Lukas Käll
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden.
| | | | | |
Collapse
|
74
|
Abstract
MOTIVATION The complete sequencing of many genomes has made it possible to identify orthologous genes descending from a common ancestor. However, reconstruction of evolutionary history over long time periods faces many challenges due to gene duplications and losses. Identification of orthologous groups shared by multiple proteomes therefore becomes a clustering problem in which an optimal compromise between conflicting evidences needs to be found. RESULTS Here we present a new proteome-scale analysis program called MultiParanoid that can automatically find orthology relationships between proteins in multiple proteomes. The software is an extension of the InParanoid program that identifies orthologs and inparalogs in pairwise proteome comparisons. MultiParanoid applies a clustering algorithm to merge multiple pairwise ortholog groups from InParanoid into multi-species ortholog groups. To avoid outparalogs in the same cluster, MultiParanoid only combines species that share the same last ancestor. To validate the clustering technique, we compared the results to a reference set obtained by manual phylogenetic analysis. We further compared the results to ortholog groups in KOGs and OrthoMCL, which revealed that MultiParanoid produces substantially fewer outparalogs than these resources. AVAILABILITY MultiParanoid is a freely available standalone program that enables efficient orthology analysis much needed in the post-genomic era. A web-based service providing access to the original datasets, the resulting groups of orthologs, and the source code of the program can be found at http://multiparanoid.cgb.ki.se.
Collapse
Affiliation(s)
- Andrey Alexeyenko
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden
| | | | | | | |
Collapse
|
75
|
Alexeyenko A, Millar AH, Whelan J, Sonnhammer ELL. Chromosomal clustering of nuclear genes encoding mitochondrial and chloroplast proteins in Arabidopsis. Trends Genet 2006; 22:589-93. [PMID: 16979780 DOI: 10.1016/j.tig.2006.09.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2006] [Revised: 07/19/2006] [Accepted: 09/04/2006] [Indexed: 11/18/2022]
Abstract
We present a statistical analysis of chromosomal clustering among nuclear genes encoding mitochondrial or chloroplast proteins in Arabidopsis. For both organelles, the clustering was significantly increased above the expectation, but the clustering effect was weak, and most clusters were small and dispersed. Clustered genes showed coexpression but not more than expected, and no substantial synteny was detected in other eukaryotic genomes. We propose that the unexpected clustering results from continuous selection favoring chromosomal proximity of genes acting in the same organelle.
Collapse
Affiliation(s)
- Andrey Alexeyenko
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden
| | | | | | | |
Collapse
|
76
|
Abstract
Obtaining high quality multiple alignments is crucial for a range of sequence analysis tasks. A common strategy is to align the sequences several times, varying the program or parameters until the best alignment according to manual inspection by human experts is found. Ideally, this should be assisted by an automatic assessment of the alignment quality. Our web-site allows users to perform all these steps: Kalign to align sequences, Kalignvu to view and verify the resulting alignments and Mumsa to assess the quality. Due to the computational efficiency of Kalign we can allow users to submit hundreds of sequences to be aligned and still guarantee fast response times. All servers are freely accessible and the underlying software can be freely downloaded for local use.
Collapse
Affiliation(s)
- Timo Lassmann
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden.
| | | |
Collapse
|
77
|
Abhiman S, Daub CO, Sonnhammer ELL. Prediction of function divergence in protein families using the substitution rate variation parameter alpha. Mol Biol Evol 2006; 23:1406-13. [PMID: 16672285 DOI: 10.1093/molbev/msl002] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Protein families typically embody a range of related functions and may thus be decomposed into subfamilies with, for example, distinct substrate specificities. Detection of functionally divergent subfamilies is possible by methods for recognizing branches of adaptive evolution in a gene tree. As the number of genome sequences is growing rapidly, it is highly desirable to automatically detect subfamily function divergence. To this end, we here introduce a method for large-scale prediction of function divergence within protein families. It is called the alpha shift measure (ASM) as it is based on detecting a shift in the shape parameter (alpha [alpha]) of the substitution rate gamma distribution. Four different methods for estimating alpha were investigated. We benchmarked the accuracy of ASM using function annotation from Enzyme Commission numbers within Pfam protein families divided into subfamilies by the automatic tree-based method BETE. In a test using 563 subfamily pairs in 162 families, ASM outperformed functional site-based methods using rate or conservation shifting (rate shift measure [RSM] and conservation shift measure [CSM]). The best results were obtained using the "GZ-Gamma" method for estimating alpha. By combining ASM with RSM and CSM using linear discriminant analysis, the prediction accuracy was further improved.
Collapse
Affiliation(s)
- Saraswathi Abhiman
- Center for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden.
| | | | | |
Collapse
|
78
|
Finn RD, Mistry J, Schuster-Böckler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer ELL, Bateman A. Pfam: clans, web tools and services. Nucleic Acids Res 2006; 34:D247-51. [PMID: 16381856 PMCID: PMC1347511 DOI: 10.1093/nar/gkj149] [Citation(s) in RCA: 1671] [Impact Index Per Article: 92.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Pfam is a database of protein families that currently contains 7973 entries (release 18.0). A recent development in Pfam has enabled the grouping of related families into clans. Pfam clans are described in detail, together with the new associated web pages. Improvements to the range of Pfam web tools and the first set of Pfam web services that allow programmatic access to the database and associated tools are also presented. Pfam is available on the web in the UK (), the USA (), France () and Sweden ().
Collapse
Affiliation(s)
- Robert D Finn
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
79
|
Abstract
G protein-coupled receptors (GPCRs) constitute a large superfamily involved in various types of signal transduction pathways triggered by hormones, odorants, peptides, proteins, and other types of ligands. The superfamily is so diverse that many members lack sequence similarity, although they all span the cell membrane seven times with an extracellular N and a cytosolic C terminus. We analyzed a divergent set of GPCRs and found distinct loop length patterns and differences in amino acid composition between cytosolic loops, extracellular loops, and membrane regions. We configured GPCRHMM, a hidden Markov model, to fit those features and trained it on a large dataset representing the entire superfamily. GPCRHMM was benchmarked to profile HMMs and generic transmembrane detectors on sets of known GPCRs and non-GPCRs. In a cross-validation procedure, profile HMMs produced an error rate nearly twice as high as GPCRHMM. In a sensitivity-selectivity test, GPCRHMM's sensitivity was about 15% higher than that of the best transmembrane predictors, at comparable false positive rates. We used GPCRHMM to search for novel members of the GPCR superfamily in five proteomes. All in all we detected 120 sequences that lacked annotation and are potentially novel GPCRs. Out of those 102 were found in Caenorhabditis elegans, four in human, and seven in mouse. Many predictions (65) belonged to Pfam domains of unknown function. GPCRHMM strongly rejected a family of arthropod-specific odorant receptors believed to be GPCRs. A detailed analysis showed that these sequences are indeed very different from other GPCRs. GPCRHMM is available at http://gpcrhmm.cgb.ki.se.
Collapse
Affiliation(s)
- Markus Wistrand
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden
| | | | | |
Collapse
|
80
|
Abstract
Multiple sequence alignments play a central role in the annotation of novel genomes. Given the biological and computational complexity of this task, the automatic generation of high-quality alignments remains challenging. Since multiple alignments are usually employed at the very start of data analysis pipelines, it is crucial to ensure high alignment quality. We describe a simple, yet elegant, solution to assess the biological accuracy of alignments automatically. Our approach is based on the comparison of several alignments of the same sequences. We introduce two functions to compare alignments: the average overlap score and the multiple overlap score. The former identifies difficult alignment cases by expressing the similarity among several alignments, while the latter estimates the biological correctness of individual alignments. We implemented both functions in the MUMSA program and demonstrate the overall robustness and accuracy of both functions on three large benchmark sets.
Collapse
Affiliation(s)
- Timo Lassmann
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden.
| | | |
Collapse
|
81
|
Hollich V, Milchert L, Arvestad L, Sonnhammer ELL. Assessment of protein distance measures and tree-building methods for phylogenetic tree reconstruction. Mol Biol Evol 2005; 22:2257-64. [PMID: 16049194 DOI: 10.1093/molbev/msi224] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Distance-based methods are popular for reconstructing evolutionary trees of protein sequences, mainly because of their speed and generality. A number of variants of the classical neighbor-joining (NJ) algorithm have been proposed, as well as a number of methods to estimate protein distances. We here present a large-scale assessment of performance in reconstructing the correct tree topology for the most popular algorithms. The programs BIONJ, FastME, Weighbor, and standard NJ were run using 12 distance estimators, producing 48 tree-building/distance estimation method combinations. These were evaluated on a test set based on real trees taken from 100 Pfam families. Each tree was used to generate multiple sequence alignments with the ROSE program using three evolutionary models. The accuracy of each method was analyzed as a function of both sequence divergence and location in the tree. We found that BIONJ produced the overall best results, although the average accuracy differed little between the tree-building methods (normally less than 1%). A noticeable trend was that FastME performed poorer than the rest on long branches. Weighbor was several orders of magnitude slower than the other programs. Larger differences were observed when using different distance estimators. Protein-adapted Jukes-Cantor and Kimura distance correction produced clearly poorer results than the other methods, even worse than uncorrected distances. We also assessed the recently developed Scoredist measure, which performed equally well as more complex methods.
Collapse
Affiliation(s)
- Volker Hollich
- Center for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden.
| | | | | | | |
Collapse
|
82
|
Abstract
Protein function shift can be predicted from sequence comparisons, either using positive selection signals or evolutionary rate estimation. None of the methods have been validated on large datasets, however. Here we investigate existing and novel methods for protein function shift prediction, and benchmark the accuracy against a large dataset of proteins with known enzymatic functions. Function change was predicted between subfamilies by identifying two kinds of sites in a multiple sequence alignment: Conservation-Shifting Sites (CSS), which are conserved in two subfamilies using two different amino acid types, and Rate-Shifting Sites (RSS), which have different evolutionary rates in two subfamilies. CSS were predicted by a new entropy-based method, and RSS using the Rate-Shift program. In principle, the more CSS and RSS between two subfamilies, the more likely a function shift between them. A test dataset was built by extracting subfamilies from Pfam with different EC numbers that belong to the same domain family. Subfamilies were generated automatically using a phylogenetic tree-based program, BETE. The dataset comprised 997 subfamily pairs with four or more members per subfamily. We observed a significant increase in CSS and RSS for subfamily comparisons with different EC numbers compared to cases with same EC numbers. The discrimination was better using RSS than CSS, and was more pronounced for larger families. Combining RSS and CSS by discriminant analysis improved classification accuracy to 71%. The method was applied to the Pfam database and the results are available at http://FunShift.cgb.ki.se. A closer examination of some superfamily comparisons showed that single EC numbers sometimes embody distinct functional classes. Hence, the measured accuracy of function shift is underestimated.
Collapse
Affiliation(s)
- Saraswathi Abhiman
- Center for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden
| | | |
Collapse
|
83
|
Abstract
MOTIVATION When predicting sequence features like transmembrane topology, signal peptides, coil-coil structures, protein secondary structure or genes, extra support can be gained from homologs. RESULTS We present here a general hidden Markov model (HMM) decoding algorithm that combines probabilities for sequence features of homologs by considering the average of the posterior label probability of each position in a global sequence alignment. The algorithm is an extension of the previously described 'optimal accuracy' decoder, allowing homology information to be used. It was benchmarked using an HMM for transmembrane topology and signal peptide prediction, Phobius. We found that the performance was substantially increased when incorporating information from homologs. AVAILABILITY A prediction server for transmembrane topology and signal peptides that uses the algorithm is available at http://phobius.cgb.ki.se/poly.html. An implementation of the algorithm is available on request from the authors.
Collapse
Affiliation(s)
- Lukas Käll
- Center for Genomics and Bioinformatics, Karolinska Institutet SE-17 177 Stockholm, Sweden
| | | | | |
Collapse
|
84
|
Abstract
The transmembrane topology of presenilins is still the subject of debate despite many experimental topology studies using antibodies or gene fusions. The results from these studies are partly contradictory and consequently several topology models have been proposed. Studies of presenilin-interacting proteins have produced further contradiction, primarily regarding the location of the C-terminus. It is thus impossible to produce a topology model that agrees with all published data on presenilin. We have analyzed the presenilin topology through computational sequence analysis of the presenilin family and the homologous presenilin-like protein family. Members of these families are intramembrane-cleaving aspartyl proteases. Although the overall sequence homology between the two families is low, they share the conserved putative active site residues and the conserved 'PAL' motif. Therefore, the topology model for the presenilin-like proteins can give some clues about the presenilin topology. Here we propose a novel nine-transmembrane topology with the C-terminus in the extracytosolic space. This model has strong support from published data on gamma-secretase function and presenilin topology. Contrary to most presenilin topology models, we show that hydrophobic region X is probably a transmembrane segment. Consequently, the C-terminus would be located in the extracytosolic space. However, the last C-terminal amino acids are relatively hydrophobic and in conjunction with existing experimental data we cannot exclude the possibility that the extreme C-terminus could be buried within the gamma-secretase complex. This might explain the difficulties in obtaining consistent experimental evidence regarding the location of the C-terminal region of presenilin.
Collapse
Affiliation(s)
- Anna Henricson
- Center for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden
| | | | | |
Collapse
|
85
|
Abstract
Short interfering RNAs (siRNAs) are a popular method for gene-knockdown, acting by degrading the target mRNA. Before performing experiments it is invaluable to locate and evaluate previous knockdown experiments for the gene of interest. The siRNA database provides a gene-centric view of siRNA experimental data, including siRNAs of known efficacy and siRNAs predicted to be of high efficacy by a combination of methods. Linked to these sequences is information such as siRNA thermodynamic properties and the potential for sequence-specific off-target effects. The database enables the user to evaluate an siRNA's potential for inhibition and non-specific effects. The database is available at http://siRNA.cgb.ki.se.
Collapse
Affiliation(s)
- Alistair M Chalk
- Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius väg 35, S-171 77 Stockholm, Sweden.
| | | | | | | |
Collapse
|
86
|
Abstract
The Inparanoid eukaryotic ortholog database (http://inparanoid.cgb.ki.se/) is a collection of pairwise ortholog groups between 17 whole genomes; Anopheles gambiae, Caenorhabditis briggsae, Caenorhabditis elegans, Drosophila melanogaster, Danio rerio, Takifugu rubripes, Gallus gallus, Homo sapiens, Mus musculus, Pan troglodytes, Rattus norvegicus, Oryza sativa, Plasmodium falciparum, Arabidopsis thaliana, Escherichia coli, Saccharomyces cerevisiae and Schizosaccharomyces pombe. Complete proteomes for these genomes were derived from Ensembl and UniProt and compared pairwise using Blast, followed by a clustering step using the Inparanoid program. An Inparanoid cluster is seeded by a reciprocally best-matching ortholog pair, around which inparalogs (should they exist) are gathered independently, while outparalogs are excluded. The ortholog clusters can be searched on the website using Ensembl gene/protein or UniProt identifiers, annotation text or by Blast alignment against our protein datasets. The entire dataset can be downloaded, as can the Inparanoid program itself.
Collapse
Affiliation(s)
- Kevin P O'Brien
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-171 77 Stockholm, Sweden
| | | | | |
Collapse
|
87
|
Abstract
Members of a protein family normally have a general biochemical function in common, but frequently one or more subgroups have evolved a slightly different function, such as different substrate specificity. It is important to detect such function shifts for a more accurate functional annotation. The FunShift database described here is a compilation of function shift analysis performed between subfamilies in protein families. It consists of two main components: (i) subfamilies derived from protein domain families and (ii) pairwise subfamily comparisons analyzed for function shift. The present release, FunShift 12, was derived from Pfam 12 and consists of 151 934 subfamilies derived from 7300 families. We carried out function shift analysis by two complementary methods on families with up to 500 members. From a total of 179 210 subfamily pairs, 62 384 were predicted to be functionally shifted in 2881 families. Each subfamily pair is provided with a markup of probable functional specificity-determining sites. Tools for searching and exploring the data are provided to make this database a valuable resource for protein function annotation. Knowledge of these functionally important sites will be useful for experimental biologists performing functional mutation studies. FunShift is available at http://FunShift.cgb.ki.se.
Collapse
Affiliation(s)
- Saraswathi Abhiman
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden
| | | |
Collapse
|
88
|
Abstract
A report on the fourth Cold Spring Harbor Laboratory/Wellcome Trust Conference on Genome Informatics, Hinxton, UK, 22-26 September 2004. A report on the fourth Cold Spring Harbor Laboratory/Wellcome Trust Conference on Genome Informatics, Hinxton, UK, 22-26 September 2004.
Collapse
Affiliation(s)
- Erik L L Sonnhammer
- Center for Genomics and Bioinformatics, Karolinska Institutet, 171 77 Stockholm, Sweden.
| |
Collapse
|
89
|
Abstract
One of the greatest promises of genome sequencing projects is to further the understanding of human diseases and to develop new therapies. Model organism genomes have been sequenced in parallel to human genomes to provide effective tools for the investigation of human gene function. Many of their genes share a common ancestry and function with human genes, and this is particularly true for orthologous genes. Here we present OrthoDisease, a comprehensive database of model organism genes that are orthologous to human disease genes. OrthoDisease was constructed by applying the Inparanoid ortholog detection algorithm to disease genes derived from the Online Mendelian Inheritance in Man database (OMIM). Pairwise whole genome/proteome comparisons between Homo sapiens and six other organisms were performed to identify ortholog clusters. OMIM numbers were extracted from the OMIM Morbid Map and were converted to gene sequences using the Locuslink mim2loc and loc2acc tables. These were mapped to Inparanoid ortholog clusters using Blast. The number of ortholog clusters in OrthoDisease with each respective species is currently: M. musculus, 1,354; D. melanogaster, 724; C. elegans, 533; A. thaliana, 398; S. cerevisiae, 290; and E. coli, 153. The database is accessible online at http://orthodisease.cgb.ki.se, and can be searched with disease or protein names. The web interface presents all ortholog clusters that include a selected disease gene. A capability to download the entire dataset is also provided.
Collapse
Affiliation(s)
- Kevin P O'Brien
- Center for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden.
| | | | | |
Collapse
|
90
|
Abstract
Insertions and deletions in a profile hidden Markov model (HMM) are modeled by transition probabilities between insert, delete and match states. These are estimated by combining observed data and prior probabilities. The transition prior probabilities can be defined either ad hoc or by maximum likelihood (ML) estimation. We show that the choice of transition prior greatly affects the HMM's ability to discriminate between true and false hits. HMM discrimination was measured using the HMMER 2.2 package applied to 373 families from Pfam. We measured the discrimination between true members and noise sequences employing various ML transition priors and also systematically scanned the parameter space of ad hoc transition priors. Our results indicate that ML priors produce far from optimal discrimination, and we present an empirically derived prior that considerably decreases the number of misclassifications compared to ML. Most of the difference stems from the probabilities for exiting a delete state. The ML prior, which is unaware of noise sequences, estimates a delete-to-delete probability that is relatively high and does not penalize noise sequences enough for optimal discrimination.
Collapse
Affiliation(s)
- Markus Wistrand
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden
| | | |
Collapse
|
91
|
Abstract
Short interfering RNAs are used in functional genomics studies to knockdown a single gene in a reversible manner. The results of siRNA experiments are highly dependent on the choice of siRNA sequence. In order to evaluate siRNA design rules, we collected a database of 398 siRNAs of known efficacy from 92 genes. We used this database to evaluate previously proposed rules from smaller datasets, and to find a new set of rules that are optimal for the entire database. We also trained a regression tree with full cross-validation. It was however difficult to obtain the same precision as methods previously tested on small datasets from one or two genes. We show that those methods are overfitting as they work poorly on independent validation datasets from multiple genes. Our new design rules can predict siRNAs with efficacy >/= 50% in 91% of cases, and with efficacy >/=90% in 52% of cases, which is more than a twofold improvement over random selection. Software for designing siRNAs is available online via a web server at or as a standalone version for high-throughput applications.
Collapse
Affiliation(s)
- Alistair M Chalk
- Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius väg 35, S-171 77 Stockholm, Sweden.
| | | | | |
Collapse
|
92
|
Käll L, Krogh A, Sonnhammer ELL. A combined transmembrane topology and signal peptide prediction method. J Mol Biol 2004; 338:1027-36. [PMID: 15111065 DOI: 10.1016/j.jmb.2004.03.016] [Citation(s) in RCA: 1666] [Impact Index Per Article: 83.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2003] [Revised: 02/25/2004] [Accepted: 03/09/2004] [Indexed: 01/09/2023]
Abstract
An inherent problem in transmembrane protein topology prediction and signal peptide prediction is the high similarity between the hydrophobic regions of a transmembrane helix and that of a signal peptide, leading to cross-reaction between the two types of predictions. To improve predictions further, it is therefore important to make a predictor that aims to discriminate between the two classes. In addition, topology information can be gained when successfully predicting a signal peptide leading a transmembrane protein since it dictates that the N terminus of the mature protein must be on the non-cytoplasmic side of the membrane. Here, we present Phobius, a combined transmembrane protein topology and signal peptide predictor. The predictor is based on a hidden Markov model (HMM) that models the different sequence regions of a signal peptide and the different regions of a transmembrane protein in a series of interconnected states. Training was done on a newly assembled and curated dataset. Compared to TMHMM and SignalP, errors coming from cross-prediction between transmembrane segments and signal peptides were reduced substantially by Phobius. False classifications of signal peptides were reduced from 26.1% to 3.9% and false classifications of transmembrane helices were reduced from 19.0% to 7.7%. Phobius was applied to the proteomes of Homo sapiens and Escherichia coli. Here we also noted a drastic reduction of false classifications compared to TMHMM/SignalP, suggesting that Phobius is well suited for whole-genome annotation of signal peptides and transmembrane regions. The method is available at as well as at
Collapse
Affiliation(s)
- Lukas Käll
- Center for Genomics and Bioinformatics, Karolinska Institutet, SE-17 177 Stockholm, Sweden
| | | | | |
Collapse
|
93
|
Wistrand M, Sonnhammer ELL. Improving Profile HMM Discrimination by Adapting Transition Probabilities. J Mol Biol 2004; 338:847-54. [PMID: 15099750 DOI: 10.1016/j.jmb.2004.03.023] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2003] [Revised: 02/25/2004] [Accepted: 03/04/2004] [Indexed: 12/21/2022]
Abstract
Profile hidden Markov models (HMMs) are used to model protein families and for detecting evolutionary relationships between proteins. Such a profile HMM is typically constructed from a multiple alignment of a set of related sequences. Transition probability parameters in an HMM are used to model insertions and deletions in the alignment. We show here that taking into account unrelated sequences when estimating the transition probability parameters helps to construct more discriminative models for the global/local alignment mode. After normal HMM training, a simple heuristic is employed that adjusts the transition probabilities between match and delete states according to observed transitions in the training set relative to the unrelated (noise) set. The method is called adaptive transition probabilities (ATP) and is based on the HMMER package implementation. It was benchmarked in two remote homology tests based on the Pfam and the SCOP classifications. Compared to the HMMER default procedure, the rate of misclassification was reduced significantly in both tests and across all levels of error rate.
Collapse
Affiliation(s)
- Markus Wistrand
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden
| | | |
Collapse
|
94
|
Abstract
UNLABELLED Sfixem is an sequence feature series (SFS) visualization tool implemented in Java. It is designed to visualize data from sequence analysis programs, allowing the user to view multiple sets of computationally generated analysis to assist the analysis process. SFS is used as the data exchange format. AVAILABILITY Sfixem is available for direct usage or download for local usage at http://sfixem.cgb.ki.se. A protein sequence analysis workbench using Sfixem is available at http://sfinx.cgb.ki.se.
Collapse
Affiliation(s)
- Alistair M Chalk
- Center for Genomics and Bioinformatics, Karolinska Institutet, 17177 Stockholm, Sweden.
| | | | | |
Collapse
|
95
|
Abstract
ChromoWheel is an Internet browser application for generating whole-genome illustrations. It can be used to depict chromosomes, genes and relations between chromosomal loci. The circular layout of chromosomes is advantageous for showing relationships between different chromosomes, as the connecting line never crosses over a chromosome. All graphical image components are in the vector-based format Scalable Vector Graphics, which are highly scaleable and admit user interaction. ChromoWheel can either be run with user-provided data in the generic SFS format, or as a browser front-end for precompiled genomic data.
Collapse
Affiliation(s)
- Sven Ekdahl
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden
| | | |
Collapse
|
96
|
Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer ELL, Studholme DJ, Yeats C, Eddy SR. The Pfam protein families database. Nucleic Acids Res 2004; 32:D138-41. [PMID: 14681378 PMCID: PMC308855 DOI: 10.1093/nar/gkh121] [Citation(s) in RCA: 2595] [Impact Index Per Article: 129.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
Pfam is a large collection of protein families and domains. Over the past 2 years the number of families in Pfam has doubled and now stands at 6190 (version 10.0). Methodology improvements for searching the Pfam collection locally as well as via the web are described. Other recent innovations include modelling of discontinuous domains allowing Pfam domain definitions to be closer to those found in structure databases. Pfam is available on the web in the UK (http://www.sanger.ac.uk/Software/Pfam/), the USA (http://pfam.wustl.edu/), France (http://pfam.jouy.inra.fr/) and Sweden (http://Pfam.cgb.ki.se/).
Collapse
Affiliation(s)
- Alex Bateman
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
97
|
Abstract
One of the most reliable methods for protein function annotation is to transfer experimentally known functions from orthologous proteins in other organisms. Most methods for identifying orthologs operate on a subset of organisms with a completely sequenced genome, and treat proteins as single-domain units. However, it is well known that proteins are often made up of several independent domains, and there is a wealth of protein sequences from genomes that are not completely sequenced. A comprehensive set of protein domain families is found in the Pfam database. We wanted to apply orthology detection to Pfam families, but first some issues needed to be addressed. First, orthology detection becomes impractical and unreliable when too many species are included. Second, shorter domains contain less information. It is therefore important to assess the quality of the orthology assignment and avoid very short domains altogether. We present a database of orthologous protein domains in Pfam called HOPS: Hierarchical grouping of Orthologous and Paralogous Sequences. Orthology is inferred in a hierarchic system of phylogenetic subgroups using ortholog bootstrapping. To avoid the frequent errors stemming from horizontally transferred genes in bacteria, the analysis is presently limited to eukaryotic genes. The results are accessible in the graphical browser NIFAS, a Java tool originally developed for analyzing phylogenetic relations within Pfam families. The method was tested on a set of curated orthologs with experimentally verified function. In comparison to tree reconciliation with a complete species tree, our approach finds significantly more orthologs in the test set. Examples for investigating gene fusions and domain recombination using HOPS are given.
Collapse
Affiliation(s)
- Christian E V Storm
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden
| | | |
Collapse
|
98
|
Abstract
Genomic clustering of genes in a pathway is commonly found in prokaryotes due to transcriptional operons, but these are not present in most eukaryotes. Yet, there might be clustering to a lesser extent of pathway members in eukaryotic genomes, that assist coregulation of a set of functionally cooperating genes. We analyzed five sequenced eukaryotic genomes for clustering of genes assigned to the same pathway in the KEGG database. Between 98% and 30% of the analyzed pathways in a genome were found to exhibit significantly higher clustering levels than expected by chance. In descending order by the level of clustering, the genomes studied were Saccharomyces cerevisiae, Homo sapiens, Caenorhabditis elegans, Arabidopsis thaliana, and Drosophila melanogaster. Surprisingly, there is not much agreement between genomes in terms of which pathways are most clustered. Only seven of 69 pathways found in all species were significantly clustered in all five of them. This species-specific pattern of pathway clustering may reflect adaptations or evolutionary events unique to a particular lineage. We note that although operons are common in C. elegans, only 58% of the pathways showed significant clustering, which is less than in human. Virtually all pathways in S. cerevisiae showed significant clustering.
Collapse
Affiliation(s)
- Jennifer M Lee
- Center for Genomics and Bioinformatics, Karolinska Institutet, S171 77 Stockholm, Sweden
| | | |
Collapse
|
99
|
Abstract
Transmembrane prediction methods are generally benchmarked on a set of proteins with experimentally verified topology. We have investigated if the accuracy measured on such datasets can be expected in an unbiased genomic analysis, or if there is a bias towards 'easily predictable' proteins in the benchmark datasets. As a measurement of accuracy, the concordance of the results from five different prediction methods was used (TMHMM, PHD, HMMTOP, MEMSAT, and TOPPRED). The benchmark dataset showed significantly higher levels (up to five times) of agreement between different methods than in 10 tested genomes. We have also analyzed which programs are most prone to make mispredictions by measuring the frequency of one-out-of-five disagreeing predictions.
Collapse
Affiliation(s)
- Lukas Käll
- Center for Genomics and Bioinformatics, Karolinska Institutet, 17177, Stockholm, Sweden
| | | |
Collapse
|
100
|
|