1
|
Buzzao D, Castresana-Aguirre M, Guala D, Sonnhammer ELL. Benchmarking enrichment analysis methods with the disease pathway network. Brief Bioinform 2024; 25:bbae069. [PMID: 38436561 PMCID: PMC10939300 DOI: 10.1093/bib/bbae069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Revised: 01/10/2024] [Accepted: 02/03/2024] [Indexed: 03/05/2024] Open
Abstract
Enrichment analysis (EA) is a common approach to gain functional insights from genome-scale experiments. As a consequence, a large number of EA methods have been developed, yet it is unclear from previous studies which method is the best for a given dataset. The main issues with previous benchmarks include the complexity of correctly assigning true pathways to a test dataset, and lack of generality of the evaluation metrics, for which the rank of a single target pathway is commonly used. We here provide a generalized EA benchmark and apply it to the most widely used EA methods, representing all four categories of current approaches. The benchmark employs a new set of 82 curated gene expression datasets from DNA microarray and RNA-Seq experiments for 26 diseases, of which only 13 are cancers. In order to address the shortcomings of the single target pathway approach and to enhance the sensitivity evaluation, we present the Disease Pathway Network, in which related Kyoto Encyclopedia of Genes and Genomes pathways are linked. We introduce a novel approach to evaluate pathway EA by combining sensitivity and specificity to provide a balanced evaluation of EA methods. This approach identifies Network Enrichment Analysis methods as the overall top performers compared with overlap-based methods. By using randomized gene expression datasets, we explore the null hypothesis bias of each method, revealing that most of them produce skewed P-values.
Collapse
Affiliation(s)
- Davide Buzzao
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 171 21 Solna, Sweden
| | | | - Dimitri Guala
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 171 21 Solna, Sweden
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 171 21 Solna, Sweden
| |
Collapse
|
2
|
Persson E, Sonnhammer ELL. InParanoiDB 9: Ortholog Groups for Protein Domains and Full-Length Proteins. J Mol Biol 2023:168001. [PMID: 36764355 DOI: 10.1016/j.jmb.2023.168001] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 01/20/2023] [Accepted: 02/01/2023] [Indexed: 02/11/2023]
Abstract
Prediction of orthologs is an important bioinformatics pursuit that is frequently used for inferring protein function and evolutionary analyses. The InParanoid database is a well known resource of ortholog predictions between a wide variety of organisms. Although orthologs have historically been inferred at the level of full-length protein sequences, many proteins consist of several independent protein domains that may be orthologous to domains in other proteins in a way that differs from the full-length protein case. To be able to capture all types of orthologous relations, conventional full-length protein orthologs can be complemented with orthologs inferred at the domain level. We here present InParanoiDB 9, covering 640 species and providing orthologs for both protein domains and full-length proteins. InParanoiDB 9 was built using the faster InParanoid-DIAMOND algorithm for orthology analysis, as well as Domainoid and Pfam to infer orthologous domains. InParanoiDB 9 is based on proteomes from 447 eukaryotes, 158 bacteria and 35 archaea, and includes over one billion predicted ortholog groups. A new website has been built for the database, providing multiple search options as well as visualization of groups of orthologs and orthologous domains. This release constitutes a major upgrade of the InParanoid database in terms of the number of species as well as the new capability to operate on the domain level. InParanoiDB 9 is available at https://inparanoidb.sbc.su.se/.
Collapse
Affiliation(s)
- Emma Persson
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden. https://twitter.com/eriksonnhammer
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden.
| |
Collapse
|
3
|
Buzzao D, Castresana-Aguirre M, Guala D, Sonnhammer ELL. TOPAS, a network-based approach to detect disease modules in a top-down fashion. NAR Genom Bioinform 2022; 4:lqac093. [PMCID: PMC9706483 DOI: 10.1093/nargab/lqac093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Revised: 10/14/2022] [Accepted: 11/15/2022] [Indexed: 12/02/2022] Open
Abstract
A vast scenario of potential disease mechanisms and remedies is yet to be discovered. The field of Network Medicine has grown thanks to the massive amount of high-throughput data and the emerging evidence that disease-related proteins form ‘disease modules’. Relying on prior disease knowledge, network-based disease module detection algorithms aim at connecting the list of known disease associated genes by exploiting interaction networks. Most existing methods extend disease modules by iteratively adding connector genes in a bottom-up fashion, while top-down approaches remain largely unexplored. We have created TOPAS, an iterative approach that aims at connecting the largest number of seed nodes in a top-down fashion through connectors that guarantee the highest flow of a Random Walk with Restart in a network of functional associations. We used a corpus of 382 manually selected functional gene sets to benchmark our algorithm against SCA, DIAMOnD, MaxLink and ROBUST across four interactomes. We demonstrate that TOPAS outperforms competing methods in terms of Seed Recovery Rate, Seed to Connector Ratio and consistency during module detection. We also show that TOPAS achieves competitive performance in terms of biological relevance of detected modules and scalability.
Collapse
Affiliation(s)
- Davide Buzzao
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 171 21 Solna, Sweden
| | | | - Dimitri Guala
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 171 21 Solna, Sweden
| | | |
Collapse
|
4
|
Seçilmiş D, Nelander S, Sonnhammer ELL. Optimal Sparsity Selection Based on an Information Criterion for Accurate Gene Regulatory Network Inference. Front Genet 2022; 13:855770. [PMID: 35923701 PMCID: PMC9340570 DOI: 10.3389/fgene.2022.855770] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Accepted: 05/30/2022] [Indexed: 11/25/2022] Open
Abstract
Accurate inference of gene regulatory networks (GRNs) is important to unravel unknown regulatory mechanisms and processes, which can lead to the identification of treatment targets for genetic diseases. A variety of GRN inference methods have been proposed that, under suitable data conditions, perform well in benchmarks that consider the entire spectrum of false-positives and -negatives. However, it is very challenging to predict which single network sparsity gives the most accurate GRN. Lacking criteria for sparsity selection, a simplistic solution is to pick the GRN that has a certain number of links per gene, which is guessed to be reasonable. However, this does not guarantee finding the GRN that has the correct sparsity or is the most accurate one. In this study, we provide a general approach for identifying the most accurate and sparsity-wise relevant GRN within the entire space of possible GRNs. The algorithm, called SPA, applies a “GRN information criterion” (GRNIC) that is inspired by two commonly used model selection criteria, Akaike and Bayesian Information Criterion (AIC and BIC) but adapted to GRN inference. The results show that the approach can, in most cases, find the GRN whose sparsity is close to the true sparsity and close to as accurate as possible with the given GRN inference method and data. The datasets and source code can be found at https://bitbucket.org/sonnhammergrni/spa/.
Collapse
Affiliation(s)
- Deniz Seçilmiş
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Solna, Sweden
| | - Sven Nelander
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - Erik L. L. Sonnhammer
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Solna, Sweden
- *Correspondence: Erik L. L. Sonnhammer,
| |
Collapse
|
5
|
Guala D, Sonnhammer ELL. Corrigendum: Network Crosstalk as a Basis for Drug Repurposing. Front Genet 2022; 13:921286. [PMID: 35656321 PMCID: PMC9151565 DOI: 10.3389/fgene.2022.921286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2022] [Accepted: 04/19/2022] [Indexed: 11/13/2022] Open
Affiliation(s)
- Dimitri Guala
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Solna, Sweden.,Merck AB, Solna, Sweden
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Solna, Sweden
| |
Collapse
|
6
|
Castresana-Aguirre M, Guala D, Sonnhammer ELL. Benefits and Challenges of Pre-clustered Network-Based Pathway Analysis. Front Genet 2022; 13:855766. [PMID: 35620466 PMCID: PMC9127507 DOI: 10.3389/fgene.2022.855766] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Accepted: 04/25/2022] [Indexed: 12/13/2022] Open
Abstract
Functional analysis of gene sets derived from experiments is typically done by pathway annotation. Although many algorithms exist for analyzing the association between a gene set and a pathway, an issue which is generally ignored is that gene sets often represent multiple pathways. In such cases an association to a pathway is weakened by the presence of genes associated with other pathways. A way to counteract this is to cluster the gene set into more homogenous parts before performing pathway analysis on each module. We explored whether network-based pre-clustering of a query gene set can improve pathway analysis. The methods MCL, Infomap, and MGclus were used to cluster the gene set projected onto the FunCoup network. We characterized how well these methods are able to detect individual pathways in multi-pathway gene sets, and applied each of the clustering methods in combination with four pathway analysis methods: Gene Enrichment Analysis, BinoX, NEAT, and ANUBIX. Using benchmarks constructed from the KEGG pathway database we found that clustering can be beneficial by increasing the sensitivity of pathway analysis methods and by providing deeper insights of biological mechanisms related to the phenotype under study. However, keeping a high specificity is a challenge. For ANUBIX, clustering caused a minor loss of specificity, while for BinoX and NEAT it caused an unacceptable loss of specificity. GEA had very low sensitivity both before and after clustering. The choice of clustering method only had a minor effect on the results. We show examples of this approach and conclude that clustering can improve overall pathway annotation performance, but should only be used if the used enrichment method has a low false positive rate.
Collapse
Affiliation(s)
- Miguel Castresana-Aguirre
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Stockholm, Sweden
| | - Dimitri Guala
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Stockholm, Sweden
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Stockholm, Sweden
| |
Collapse
|
7
|
Seçilmiş D, Hillerton T, Sonnhammer ELL. GRNbenchmark - a web server for benchmarking directed gene regulatory network inference methods. Nucleic Acids Res 2022; 50:W398-W404. [PMID: 35609981 PMCID: PMC9252735 DOI: 10.1093/nar/gkac377] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 04/20/2022] [Accepted: 05/19/2022] [Indexed: 11/30/2022] Open
Abstract
Accurate inference of gene regulatory networks (GRN) is an essential component of systems biology, and there is a constant development of new inference methods. The most common approach to assess accuracy for publications is to benchmark the new method against a selection of existing algorithms. This often leads to a very limited comparison, potentially biasing the results, which may stem from tuning the benchmark's properties or incorrect application of other methods. These issues can be avoided by a web server with a broad range of data properties and inference algorithms, that makes it easy to perform comprehensive benchmarking of new methods, and provides a more objective assessment. Here we present https://GRNbenchmark.org/ - a new web server for benchmarking GRN inference methods, which provides the user with a set of benchmarks with several datasets, each spanning a range of properties including multiple noise levels. As soon as the web server has performed the benchmarking, the accuracy results are made privately available to the user via interactive summary plots and underlying curves. The user can then download these results for any purpose, and decide whether or not to make them public to share with the community.
Collapse
Affiliation(s)
- Deniz Seçilmiş
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Thomas Hillerton
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| |
Collapse
|
8
|
Persson E, Sonnhammer ELL. InParanoid-DIAMOND: faster orthology analysis with the InParanoid algorithm. Bioinformatics 2022; 38:2918-2919. [PMID: 35561192 PMCID: PMC9113356 DOI: 10.1093/bioinformatics/btac194] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2021] [Revised: 03/14/2022] [Accepted: 03/29/2022] [Indexed: 02/03/2023] Open
Abstract
SUMMARY Predicting orthologs, genes in different species having shared ancestry, is an important task in bioinformatics. Orthology prediction tools are required to make accurate and fast predictions, in order to analyze large amounts of data within a feasible time frame. InParanoid is a well-known algorithm for orthology analysis, shown to perform well in benchmarks, but having the major limitation of long runtimes on large datasets. Here, we present an update to the InParanoid algorithm that can use the faster tool DIAMOND instead of BLAST for the homolog search step. We show that it reduces the runtime by 94%, while still obtaining similar performance in the Quest for Orthologs benchmark. AVAILABILITY AND IMPLEMENTATION The source code is available at (https://bitbucket.org/sonnhammergroup/inparanoid). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Emma Persson
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, 17121 Solna, Sweden
| | | |
Collapse
|
9
|
Ogris C, Castresana-Aguirre M, Sonnhammer ELL. PathwAX II: Network-based pathway analysis with interactive visualization of network crosstalk. Bioinformatics 2022; 38:2659-2660. [PMID: 35266519 PMCID: PMC9048662 DOI: 10.1093/bioinformatics/btac153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Revised: 02/03/2022] [Accepted: 03/09/2022] [Indexed: 11/28/2022] Open
Abstract
Motivation Pathway annotation tools are indispensable for the interpretation of a wide range of experiments in life sciences. Network-based algorithms have recently been developed which are more sensitive than traditional overlap-based algorithms, but there is still a lack of good online tools for network-based pathway analysis. Results We present PathwAX II—a pathway analysis web tool based on network crosstalk analysis using the BinoX algorithm. It offers several new features compared with the first version, including interactive graphical network visualization of the crosstalk between a query gene set and an enriched pathway, and the addition of Reactome pathways. Availability and implementation PathwAX II is available at http://pathwax.sbc.su.se. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Christoph Ogris
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, 17121 Solna, Box, Sweden 1031.,Institute of Computational Biology, Helmholtz Center Munich, Neuherberg, Germany Ingolstädter Landstr. 1 85764
| | - Miguel Castresana-Aguirre
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, 17121 Solna, Box, Sweden 1031
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, 17121 Solna, Box, Sweden 1031
| |
Collapse
|
10
|
Guala D, Sonnhammer ELL. Network Crosstalk as a Basis for Drug Repurposing. Front Genet 2022; 13:792090. [PMID: 35350247 PMCID: PMC8958038 DOI: 10.3389/fgene.2022.792090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2021] [Accepted: 01/27/2022] [Indexed: 11/23/2022] Open
Abstract
The need for systematic drug repurposing has seen a steady increase over the past decade and may be particularly valuable to quickly remedy unexpected pandemics. The abundance of functional interaction data has allowed mapping of substantial parts of the human interactome modeled using functional association networks, favoring network-based drug repurposing. Network crosstalk-based approaches have never been tested for drug repurposing despite their success in the related and more mature field of pathway enrichment analysis. We have, therefore, evaluated the top performing crosstalk-based approaches for drug repurposing. Additionally, the volume of new interaction data as well as more sophisticated network integration approaches compelled us to construct a new benchmark for performance assessment of network-based drug repurposing tools, which we used to compare network crosstalk-based methods with a state-of-the-art technique. We find that network crosstalk-based drug repurposing is able to rival the state-of-the-art method and in some cases outperform it.
Collapse
Affiliation(s)
- Dimitri Guala
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden
- Merck AB, Solna, Sweden
| | - Erik L. L. Sonnhammer
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden
- *Correspondence: Erik L. L. Sonnhammer,
| |
Collapse
|
11
|
Hillerton T, Seçilmiş D, Nelander S, Sonnhammer ELL. Fast and accurate gene regulatory network inference by normalized least squares regression. Bioinformatics 2022; 38:2263-2268. [PMID: 35176145 PMCID: PMC9004640 DOI: 10.1093/bioinformatics/btac103] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Revised: 01/10/2022] [Accepted: 02/15/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Inferring an accurate gene regulatory network (GRN) has long been a key goal in the field of systems biology. To do this, it is important to find a suitable balance between the maximum number of true positive and the minimum number of false-positive interactions. Another key feature is that the inference method can handle the large size of modern experimental data, meaning the method needs to be both fast and accurate. The Least Squares Cut-Off (LSCO) method can fulfill both these criteria, however as it is based on least squares it is vulnerable to known issues of amplifying extreme values, small or large. In GRN this manifests itself with genes that are erroneously hyper-connected to a large fraction of all genes due to extremely low value fold changes. RESULTS We developed a GRN inference method called Least Squares Cut-Off with Normalization (LSCON) that tackles this problem. LSCON extends the LSCO algorithm by regularization to avoid hyper-connected genes and thereby reduce false positives. The regularization used is based on normalization, which removes effects of extreme values on the fit. We benchmarked LSCON and compared it to Genie3, LASSO, LSCO and Ridge regression, in terms of accuracy, speed and tendency to predict hyper-connected genes. The results show that LSCON achieves better or equal accuracy compared to LASSO, the best existing method, especially for data with extreme values. Thanks to the speed of least squares regression, LSCON does this an order of magnitude faster than LASSO. AVAILABILITY AND IMPLEMENTATION Data: https://bitbucket.org/sonnhammergrni/lscon; Code: https://bitbucket.org/sonnhammergrni/genespider. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Thomas Hillerton
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, 17121 Solna, Sweden
| | - Deniz Seçilmiş
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, 17121 Solna, Sweden
| | - Sven Nelander
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, 75185 Uppsala, Sweden
| | | |
Collapse
|
12
|
Zhivkoplias EK, Vavulov O, Hillerton T, Sonnhammer ELL. Generation of Realistic Gene Regulatory Networks by Enriching for Feed-Forward Loops. Front Genet 2022; 13:815692. [PMID: 35222536 PMCID: PMC8872634 DOI: 10.3389/fgene.2022.815692] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 01/13/2022] [Indexed: 11/13/2022] Open
Abstract
The regulatory relationships between genes and proteins in a cell form a gene regulatory network (GRN) that controls the cellular response to changes in the environment. A number of inference methods to reverse engineer the original GRN from large-scale expression data have recently been developed. However, the absence of ground-truth GRNs when evaluating the performance makes realistic simulations of GRNs necessary. One aspect of this is that local network motif analysis of real GRNs indicates that the feed-forward loop (FFL) is significantly enriched. To simulate this properly, we developed a novel motif-based preferential attachment algorithm, FFLatt, which outperformed the popular GeneNetWeaver network generation tool in reproducing the FFL motif occurrence observed in literature-based biological GRNs. It also preserves important topological properties such as scale-free topology, sparsity, and average in/out-degree per node. We conclude that FFLatt is well-suited as a network generation module for a benchmarking framework with the aim to provide fair and robust performance evaluation of GRN inference methods.
Collapse
Affiliation(s)
- Erik K. Zhivkoplias
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Solna, Sweden
| | - Oleg Vavulov
- Bioinformatics Institute, St. Petersburg, Russia
| | - Thomas Hillerton
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Solna, Sweden
| | - Erik L. L. Sonnhammer
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Solna, Sweden
- *Correspondence: Erik L. L. Sonnhammer,
| |
Collapse
|
13
|
Rivero-García I, Castresana-Aguirre M, Guglielmo L, Guala D, Sonnhammer ELL. Drug repurposing improves disease targeting 11-fold and can be augmented by network module targeting, applied to COVID-19. Sci Rep 2021; 11:20687. [PMID: 34667255 PMCID: PMC8526804 DOI: 10.1038/s41598-021-99721-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Accepted: 09/30/2021] [Indexed: 12/14/2022] Open
Abstract
This analysis presents a systematic evaluation of the extent of therapeutic opportunities that can be obtained from drug repurposing by connecting drug targets with disease genes. When using FDA-approved indications as a reference level we found that drug repurposing can offer an average of an 11-fold increase in disease coverage, with the maximum number of diseases covered per drug being increased from 134 to 167 after extending the drug targets with their high confidence first neighbors. Additionally, by network analysis to connect drugs to disease modules we found that drugs on average target 4 disease modules, yet the similarity between disease modules targeted by the same drug is generally low and the maximum number of disease modules targeted per drug increases from 158 to 229 when drug targets are neighbor-extended. Moreover, our results highlight that drug repurposing is more dependent on target proteins being shared between diseases than on polypharmacological properties of drugs. We apply our drug repurposing and network module analysis to COVID-19 and show that Fostamatinib is the drug with the highest module coverage.
Collapse
Affiliation(s)
- Inés Rivero-García
- grid.10548.380000 0004 1936 9377Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Miguel Castresana-Aguirre
- grid.10548.380000 0004 1936 9377Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Luca Guglielmo
- grid.10548.380000 0004 1936 9377Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Dimitri Guala
- grid.10548.380000 0004 1936 9377Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Erik L. L. Sonnhammer
- grid.10548.380000 0004 1936 9377Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| |
Collapse
|
14
|
Castresana-Aguirre M, Persson E, Sonnhammer ELL. PathBIX-a web server for network-based pathway annotation with adaptive null models. Bioinform Adv 2021; 1:vbab010. [PMID: 36700096 PMCID: PMC9710673 DOI: 10.1093/bioadv/vbab010] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/08/2021] [Accepted: 06/30/2021] [Indexed: 01/28/2023]
Abstract
Motivation Pathway annotation is a vital tool for interpreting and giving meaning to experimental data in life sciences. Numerous tools exist for this task, where the most recent generation of pathway enrichment analysis tools, network-based methods, utilize biological networks to gain a richer source of information as a basis of the analysis than merely the gene content. Network-based methods use the network crosstalk between the query gene set and the genes in known pathways, and compare this to a null model of random expectation. Results We developed PathBIX, a novel web application for network-based pathway analysis, based on the recently published ANUBIX algorithm which has been shown to be more accurate than previous network-based methods. The PathBIX website performs pathway annotation for 21 species, and utilizes prefetched and preprocessed network data from FunCoup 5.0 networks and pathway data from three databases: KEGG, Reactome, and WikiPathways. Availability https://pathbix.sbc.su.se/. Contact erik.sonnhammer@scilifelab.se. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Miguel Castresana-Aguirre
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Stockholm 17121, Sweden
| | - Emma Persson
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Stockholm 17121, Sweden
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Stockholm 17121, Sweden,To whom correspondence should be addressed.
| |
Collapse
|
15
|
Seçilmiş D, Hillerton T, Nelander S, Sonnhammer ELL. Inferring the experimental design for accurate gene regulatory network inference. Bioinformatics 2021; 37:3553-3559. [PMID: 33978748 PMCID: PMC8545292 DOI: 10.1093/bioinformatics/btab367] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Revised: 03/29/2021] [Accepted: 05/11/2021] [Indexed: 11/17/2022] Open
Abstract
Motivation Accurate inference of gene regulatory interactions is of importance for understanding the mechanisms of underlying biological processes. For gene expression data gathered from targeted perturbations, gene regulatory network (GRN) inference methods that use the perturbation design are the top performing methods. However, the connection between the perturbation design and gene expression can be obfuscated due to problems, such as experimental noise or off-target effects, limiting the methods’ ability to reconstruct the true GRN. Results In this study, we propose an algorithm, IDEMAX, to infer the effective perturbation design from gene expression data in order to eliminate the potential risk of fitting a disconnected perturbation design to gene expression. We applied IDEMAX to synthetic data from two different data generation tools, GeneNetWeaver and GeneSPIDER, and assessed its effect on the experiment design matrix as well as the accuracy of the GRN inference, followed by application to a real dataset. The results show that our approach consistently improves the accuracy of GRN inference compared to using the intended perturbation design when much of the signal is hidden by noise, which is often the case for real data. Availability and implementation https://bitbucket.org/sonnhammergrni/idemax. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Deniz Seçilmiş
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, Solna, 17121, Sweden
| | - Thomas Hillerton
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, Solna, 17121, Sweden
| | - Sven Nelander
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, Solna, 17121, Sweden
| |
Collapse
|
16
|
Persson E, Castresana-Aguirre M, Buzzao D, Guala D, Sonnhammer ELL. FunCoup 5: Functional Association Networks in All Domains of Life, Supporting Directed Links and Tissue-Specificity. J Mol Biol 2021; 433:166835. [PMID: 33539890 DOI: 10.1016/j.jmb.2021.166835] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Revised: 12/18/2020] [Accepted: 01/15/2021] [Indexed: 02/07/2023]
Abstract
FunCoup (https://funcoup.sbc.su.se) is one of the most comprehensive functional association networks of genes/proteins available. Functional associations are inferred by integrating different types of evidence using a redundancy-weighted naïve Bayesian approach, combined with orthology transfer. FunCoup's high coverage comes from using eleven different types of evidence, and extensive transfer of information between species. Since the latest update of the database, the availability of source data has improved drastically, and user expectations on a tool for functional associations have grown. To meet these requirements, we have made a new release of FunCoup with updated source data and improved functionality. FunCoup 5 now includes 22 species from all domains of life, and the source data for evidences, gold standards, and genomes have been updated to the latest available versions. In this new release, directed regulatory links inferred from transcription factor binding can be visualized in the network viewer for the human interactome. Another new feature is the possibility to filter by genes expressed in a certain tissue in the network viewer. FunCoup 5 further includes the SARS-CoV-2 proteome, allowing users to visualize and analyze interactions between SARS-CoV-2 and human proteins in order to better understand COVID-19. This new release of FunCoup constitutes a major advance for the users, with updated sources, new species and improved functionality for analysis of the networks.
Collapse
Affiliation(s)
- Emma Persson
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Miguel Castresana-Aguirre
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Davide Buzzao
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Dimitri Guala
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden.
| |
Collapse
|
17
|
Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, Tosatto SCE, Paladin L, Raj S, Richardson LJ, Finn RD, Bateman A. Pfam: The protein families database in 2021. Nucleic Acids Res 2021; 49:D412-D419. [PMID: 33125078 PMCID: PMC7779014 DOI: 10.1093/nar/gkaa913] [Citation(s) in RCA: 2297] [Impact Index Per Article: 765.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2020] [Revised: 10/01/2020] [Accepted: 10/06/2020] [Indexed: 12/19/2022] Open
Abstract
The Pfam database is a widely used resource for classifying protein sequences into families and domains. Since Pfam was last described in this journal, over 350 new families have been added in Pfam 33.1 and numerous improvements have been made to existing entries. To facilitate research on COVID-19, we have revised the Pfam entries that cover the SARS-CoV-2 proteome, and built new entries for regions that were not covered by Pfam. We have reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family. The new Pfam-B is based on a clustering by the MMseqs2 software. We have compared all of the regions in the RepeatsDB to those in Pfam and have started to use the results to build and refine Pfam repeat families. Pfam is freely available for browsing and download at http://pfam.xfam.org/.
Collapse
Affiliation(s)
- Jaina Mistry
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Sara Chuguransky
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Lowri Williams
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Matloob Qureshi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Gustavo A Salazar
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy
| | - Lisanna Paladin
- Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy
| | - Shriya Raj
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Lorna J Richardson
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| |
Collapse
|
18
|
Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, Tosatto SCE, Paladin L, Raj S, Richardson LJ, Finn RD, Bateman A. Pfam: The protein families database in 2021. Nucleic Acids Res 2021. [PMID: 33125078 DOI: 10.6019/tol.pfam_fams-t.2018.00001.1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/14/2023] Open
Abstract
The Pfam database is a widely used resource for classifying protein sequences into families and domains. Since Pfam was last described in this journal, over 350 new families have been added in Pfam 33.1 and numerous improvements have been made to existing entries. To facilitate research on COVID-19, we have revised the Pfam entries that cover the SARS-CoV-2 proteome, and built new entries for regions that were not covered by Pfam. We have reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family. The new Pfam-B is based on a clustering by the MMseqs2 software. We have compared all of the regions in the RepeatsDB to those in Pfam and have started to use the results to build and refine Pfam repeat families. Pfam is freely available for browsing and download at http://pfam.xfam.org/.
Collapse
Affiliation(s)
- Jaina Mistry
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Sara Chuguransky
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Lowri Williams
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Matloob Qureshi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Gustavo A Salazar
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy
| | - Lisanna Paladin
- Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy
| | - Shriya Raj
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Lorna J Richardson
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| |
Collapse
|
19
|
Seçilmiş D, Hillerton T, Morgan D, Tjärnberg A, Nelander S, Nordling TEM, Sonnhammer ELL. Uncovering cancer gene regulation by accurate regulatory network inference from uninformative data. NPJ Syst Biol Appl 2020; 6:37. [PMID: 33168813 PMCID: PMC7652823 DOI: 10.1038/s41540-020-00154-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2019] [Accepted: 10/15/2020] [Indexed: 01/11/2023] Open
Abstract
The interactions among the components of a living cell that constitute the gene regulatory network (GRN) can be inferred from perturbation-based gene expression data. Such networks are useful for providing mechanistic insights of a biological system. In order to explore the feasibility and quality of GRN inference at a large scale, we used the L1000 data where ~1000 genes have been perturbed and their expression levels have been quantified in 9 cancer cell lines. We found that these datasets have a very low signal-to-noise ratio (SNR) level causing them to be too uninformative to infer accurate GRNs. We developed a gene reduction pipeline in which we eliminate uninformative genes from the system using a selection criterion based on SNR, until reaching an informative subset. The results show that our pipeline can identify an informative subset in an overall uninformative dataset, allowing inference of accurate subset GRNs. The accurate GRNs were functionally characterized and potential novel cancer-related regulatory interactions were identified.
Collapse
Affiliation(s)
- Deniz Seçilmiş
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121, Solna, Sweden
| | - Thomas Hillerton
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121, Solna, Sweden
| | - Daniel Morgan
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121, Solna, Sweden
| | - Andreas Tjärnberg
- Center for Developmental Genetics, New York University, New York, NY, USA
| | - Sven Nelander
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - Torbjörn E M Nordling
- Department of Mechanical Engineering, National Cheng Kung University, Tainan, Taiwan
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121, Solna, Sweden.
| |
Collapse
|
20
|
Castresana-Aguirre M, Sonnhammer ELL. Pathway-specific model estimation for improved pathway annotation by network crosstalk. Sci Rep 2020; 10:13585. [PMID: 32788619 PMCID: PMC7423893 DOI: 10.1038/s41598-020-70239-z] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2019] [Accepted: 07/06/2020] [Indexed: 12/23/2022] Open
Abstract
Pathway enrichment analysis is the most common approach for understanding which biological processes are affected by altered gene activities under specific conditions. However, it has been challenging to find a method that efficiently avoids false positives while keeping a high sensitivity. We here present a new network-based method ANUBIX based on sampling random gene sets against intact pathway. Benchmarking shows that ANUBIX is considerably more accurate than previous network crosstalk based methods, which have the drawback of modelling pathways as random gene sets. We demonstrate that ANUBIX does not have a bias for finding certain pathways, which previous methods do, and show that ANUBIX finds biologically relevant pathways that are missed by other methods.
Collapse
Affiliation(s)
- Miguel Castresana-Aguirre
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121, Solna, Sweden
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121, Solna, Sweden.
| |
Collapse
|
21
|
Abstract
BACKGROUND Fusion transcripts are involved in tumourigenesis and play a crucial role in tumour heterogeneity, tumour evolution and cancer treatment resistance. However, fusion transcripts have not been studied at high spatial resolution in tissue sections due to the lack of full-length transcripts with spatial information. New high-throughput technologies like spatial transcriptomics measure the transcriptome of tissue sections on almost single-cell level. While this technique does not allow for direct detection of fusion transcripts, we show that they can be inferred using the relative poly(A) tail abundance of the involved parental genes. METHOD We present a new method STfusion, which uses spatial transcriptomics to infer the presence and absence of poly(A) tails. A fusion transcript lacks a poly(A) tail for the 5' gene and has an elevated number of poly(A) tails for the 3' gene. Its expression level is defined by the upstream promoter of the 5' gene. STfusion measures the difference between the observed and expected number of poly(A) tails with a novel C-score. RESULTS We verified the STfusion ability to predict fusion transcripts on HeLa cells with known fusions. STfusion and C-score applied to clinical prostate cancer data revealed the spatial distribution of the cis-SAGe SLC45A3-ELK4 in 12 tissue sections with almost single-cell resolution. The cis-SAGe occurred in disease areas, e.g. inflamed, prostatic intraepithelial neoplastic, or cancerous areas, and occasionally in normal glands. CONCLUSIONS STfusion detects fusion transcripts in cancer cell line and clinical tissue data, and distinguishes chimeric transcripts from chimeras caused by trans-splicing events. With STfusion and the use of C-scores, fusion transcripts can be spatially localised in clinical tissue sections on almost single cell level.
Collapse
Affiliation(s)
- Stefanie Friedrich
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Box 1031, 17121, Solna, Sweden.
| | - Erik L L Sonnhammer
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Box 1031, 17121, Solna, Sweden
| |
Collapse
|
22
|
Friedrich S, Barbulescu R, Helleday T, Sonnhammer ELL. MetaCNV - a consensus approach to infer accurate copy numbers from low coverage data. BMC Med Genomics 2020; 13:76. [PMID: 32487140 PMCID: PMC7268502 DOI: 10.1186/s12920-020-00731-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2019] [Accepted: 05/20/2020] [Indexed: 12/23/2022] Open
Abstract
Background The majority of copy number callers requires high read coverage data that is often achieved with elevated material input, which increases the heterogeneity of tissue samples. However, to gain insights into smaller areas within a tissue sample, e.g. a cancerous area in a heterogeneous tissue sample, less material is used for sequencing, which results in lower read coverage. Therefore, more focus needs to be put on copy number calling that is sensitive enough for low coverage data. Results We present MetaCNV, a copy number caller that infers reliable copy numbers for human genomes with a consensus approach. MetaCNV specializes in low coverage data, but also performs well on normal and high coverage data. MetaCNV integrates the results of multiple copy number callers and infers absolute and unbiased copy numbers for the entire genome. MetaCNV is based on a meta-model that bypasses the weaknesses of current calling models while combining the strengths of existing approaches. Here we apply MetaCNV based on ReadDepth, SVDetect, and CNVnator to real and simulated datasets in order to demonstrate how the approach improves copy number calling. Conclusions MetaCNV, available at https://bitbucket.org/sonnhammergroup/metacnv, provides accurate copy number prediction on low coverage data and performs well on high coverage data.
Collapse
Affiliation(s)
- Stefanie Friedrich
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121, Solna, Sweden.
| | - Remus Barbulescu
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121, Solna, Sweden
| | - Thomas Helleday
- Department of Oncology-Pathology, Science for Life Laboratory, Karolinska Institutet, Solna, Sweden
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121, Solna, Sweden
| |
Collapse
|
23
|
El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, Qureshi M, Richardson LJ, Salazar GA, Smart A, Sonnhammer ELL, Hirsh L, Paladin L, Piovesan D, Tosatto SCE, Finn RD. The Pfam protein families database in 2019. Nucleic Acids Res 2020; 47:D427-D432. [PMID: 30357350 PMCID: PMC6324024 DOI: 10.1093/nar/gky995] [Citation(s) in RCA: 2924] [Impact Index Per Article: 731.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Accepted: 10/09/2018] [Indexed: 12/11/2022] Open
Abstract
The last few years have witnessed significant changes in Pfam (https://pfam.xfam.org). The number of families has grown substantially to a total of 17,929 in release 32.0. New additions have been coupled with efforts to improve existing families, including refinement of domain boundaries, their classification into Pfam clans, as well as their functional annotation. We recently began to collaborate with the RepeatsDB resource to improve the definition of tandem repeat families within Pfam. We carried out a significant comparison to the structural classification database, namely the Evolutionary Classification of Protein Domains (ECOD) that led to the creation of 825 new families based on their set of uncharacterized families (EUFs). Furthermore, we also connected Pfam entries to the Sequence Ontology (SO) through mapping of the Pfam type definitions to SO terms. Since Pfam has many community contributors, we recently enabled the linking between authorship of all Pfam entries with the corresponding authors’ ORCID identifiers. This effectively permits authors to claim credit for their Pfam curation and link them to their ORCID record.
Collapse
Affiliation(s)
- Sara El-Gebali
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jaina Mistry
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Sean R Eddy
- HHMI, Harvard University, 16 Divinity Ave Cambridge, MA 02138 USA
| | - Aurélien Luciani
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Simon C Potter
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Matloob Qureshi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Lorna J Richardson
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Gustavo A Salazar
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Alfredo Smart
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Erik L L Sonnhammer
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, 17121 Solna, Sweden
| | - Layla Hirsh
- Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy.,Dept. of Engineering, Pontificia Universidad Católica del Perú 1801, San Miguel 15088, Lima, Perú
| | - Lisanna Paladin
- Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy
| | - Damiano Piovesan
- Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
24
|
Abstract
Background Orthology inference is normally based on full-length protein sequences. However, most proteins contain independently folding and recurring regions, domains. The domain architecture of a protein is vital for its function, and recombination events mean individual domains can have different evolutionary histories. It has previously been shown that orthologous proteins may differ in domain architecture, creating challenges for orthology inference methods operating on full-length sequences. We have developed Domainoid, a new tool aiming to overcome these challenges faced by full-length orthology methods by inferring orthology on the domain level. It employs the InParanoid algorithm on single domains separately, to infer groups of orthologous domains. Results This domain-oriented approach allows detection of discordant domain orthologs, cases where different domains on the same protein have different evolutionary histories. In addition to domain level analysis, protein level orthology based on the fraction of domains that are orthologous can be inferred. Domainoid orthology assignments were compared to those yielded by the conventional full-length approach InParanoid, and were validated in a standard benchmark. Conclusions Our results show that domain-based orthology inference can reveal many orthologous relationships that are not found by full-length sequence approaches. Availability https://bitbucket.org/sonnhammergroup/domainoid/
Collapse
Affiliation(s)
- Emma Persson
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121, Solna, Sweden
| | - Mateusz Kaduk
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121, Solna, Sweden
| | - Sofia K Forslund
- Experimental and Clinical Research Cente, a joint cooperation of Max-Delbrück Center for Molecular Medicine and Charité-Universitätsmedizin Berlin, 13125, Berlin, Germany.,European Molecular Biology Laboratory, Structural and Computational Biology Unit, 69117, Heidelberg, Germany
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121, Solna, Sweden.
| |
Collapse
|
25
|
Ogris C, Guala D, Sonnhammer ELL. FunCoup 4: new species, data, and visualization. Nucleic Acids Res 2019; 46:D601-D607. [PMID: 29165593 PMCID: PMC5755233 DOI: 10.1093/nar/gkx1138] [Citation(s) in RCA: 61] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2017] [Accepted: 10/31/2017] [Indexed: 01/22/2023] Open
Abstract
This release of the FunCoup database (http://funcoup.sbc.su.se) is the fourth generation of one of the most comprehensive databases for genome-wide functional association networks. These functional associations are inferred via integrating various data types using a naive Bayesian algorithm and orthology based information transfer across different species. This approach provides high coverage of the included genomes as well as high quality of inferred interactions. In this update of FunCoup we introduce four new eukaryotic species: Schizosaccharomyces pombe, Plasmodium falciparum, Bos taurus, Oryza sativa and open the database to the prokaryotic domain by including networks for Escherichia coli and Bacillus subtilis. The latter allows us to also introduce a new class of functional association between genes - co-occurrence in the same operon. We also supplemented the existing classes of functional association: metabolic, signaling, complex and physical protein interaction with up-to-date information. In this release we switched to InParanoid v8 as the source of orthology and base for calculation of phylogenetic profiles. While populating all other evidence types with new data we introduce a new evidence type based on quantitative mass spectrometry data. Finally, the new JavaScript based network viewer provides the user an intuitive and responsive platform to further evaluate the results.
Collapse
Affiliation(s)
- Christoph Ogris
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Dimitri Guala
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Erik L L Sonnhammer
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| |
Collapse
|
26
|
Guala D, Ogris C, Müller N, Sonnhammer ELL. Genome-wide functional association networks: background, data & state-of-the-art resources. Brief Bioinform 2019; 21:1224-1237. [PMID: 31281921 PMCID: PMC7373183 DOI: 10.1093/bib/bbz064] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2019] [Revised: 04/29/2019] [Accepted: 05/04/2019] [Indexed: 02/06/2023] Open
Abstract
The vast amount of experimental data from recent advances in the field of high-throughput biology begs for integration into more complex data structures such as genome-wide functional association networks. Such networks have been used for elucidation of the interplay of intra-cellular molecules to make advances ranging from the basic science understanding of evolutionary processes to the more translational field of precision medicine. The allure of the field has resulted in rapid growth of the number of available network resources, each with unique attributes exploitable to answer different biological questions. Unfortunately, the high volume of network resources makes it impossible for the intended user to select an appropriate tool for their particular research question. The aim of this paper is to provide an overview of the underlying data and representative network resources as well as to mention methods of integration, allowing a customized approach to resource selection. Additionally, this report will provide a primer for researchers venturing into the field of network integration.
Collapse
Affiliation(s)
- Dimitri Guala
- Science for Life Laboratory, Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Box 1031, 17121 Solna, Sweden
| | - Christoph Ogris
- Computational Cell Maps, Institute of Computational Biology, Helmholtz Center Munich, Ingolstädter Landstr. 1, 85764 Neuherberg, Germany
| | - Nikola Müller
- Computational Cell Maps, Institute of Computational Biology, Helmholtz Center Munich, Ingolstädter Landstr. 1, 85764 Neuherberg, Germany
| | - Erik L L Sonnhammer
- Science for Life Laboratory, Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Box 1031, 17121 Solna, Sweden
| |
Collapse
|
27
|
Morgan D, Tjärnberg A, Nordling TEM, Sonnhammer ELL. A generalized framework for controlling FDR in gene regulatory network inference. Bioinformatics 2018; 35:1026-1032. [DOI: 10.1093/bioinformatics/bty764] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2018] [Revised: 08/23/2018] [Accepted: 08/28/2018] [Indexed: 12/23/2022] Open
Affiliation(s)
- Daniel Morgan
- Department of Biochemistry and Biophysics, Stockholm Bioinformatics Center, Science for Life Laboratory, Stockholm University, Stockholm, Sweden
| | - Andreas Tjärnberg
- Department of Physics, Chemistry and Biology/Bioinformatics, Linköping University, Linköping, Sweden
| | - Torbjörn E M Nordling
- Department of Mechanical Engineering, National Cheng Kung University, Tainan, Taiwan
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Stockholm Bioinformatics Center, Science for Life Laboratory, Stockholm University, Stockholm, Sweden
| |
Collapse
|
28
|
Guala D, Bernhem K, Blal HA, Jans D, Lundberg E, Brismar H, Sonnhammer ELL. Experimental validation of predicted cancer genes using FRET. Methods Appl Fluoresc 2018; 6:035007. [PMID: 29570091 DOI: 10.1088/2050-6120/aab932] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Huge amounts of data are generated in genome wide experiments, designed to investigate diseases with complex genetic causes. Follow up of all potential leads produced by such experiments is currently cost prohibitive and time consuming. Gene prioritization tools alleviate these constraints by directing further experimental efforts towards the most promising candidate targets. Recently a gene prioritization tool called MaxLink was shown to outperform other widely used state-of-the-art prioritization tools in a large scale in silico benchmark. An experimental validation of predictions made by MaxLink has however been lacking. In this study we used Fluorescence Resonance Energy Transfer, an established experimental technique for detection of protein-protein interactions, to validate potential cancer genes predicted by MaxLink. Our results provide confidence in the use of MaxLink for selection of new targets in the battle with polygenic diseases.
Collapse
Affiliation(s)
- Dimitri Guala
- Science for Life Laboratory, Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Box 1031, 17121 Solna, Sweden
| | | | | | | | | | | | | |
Collapse
|
29
|
Carreras-Puigvert J, Zitnik M, Jemth AS, Carter M, Unterlass JE, Hallström B, Loseva O, Karem Z, Calderón-Montaño JM, Lindskog C, Edqvist PH, Matuszewski DJ, Ait Blal H, Berntsson RPA, Häggblad M, Martens U, Studham M, Lundgren B, Wählby C, Sonnhammer ELL, Lundberg E, Stenmark P, Zupan B, Helleday T. A comprehensive structural, biochemical and biological profiling of the human NUDIX hydrolase family. Nat Commun 2017; 8:1541. [PMID: 29142246 PMCID: PMC5688067 DOI: 10.1038/s41467-017-01642-w] [Citation(s) in RCA: 86] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2016] [Accepted: 10/06/2017] [Indexed: 01/04/2023] Open
Abstract
The NUDIX enzymes are involved in cellular metabolism and homeostasis, as well as mRNA processing. Although highly conserved throughout all organisms, their biological roles and biochemical redundancies remain largely unclear. To address this, we globally resolve their individual properties and inter-relationships. We purify 18 of the human NUDIX proteins and screen 52 substrates, providing a substrate redundancy map. Using crystal structures, we generate sequence alignment analyses revealing four major structural classes. To a certain extent, their substrate preference redundancies correlate with structural classes, thus linking structure and activity relationships. To elucidate interdependence among the NUDIX hydrolases, we pairwise deplete them generating an epistatic interaction map, evaluate cell cycle perturbations upon knockdown in normal and cancer cells, and analyse their protein and mRNA expression in normal and cancer tissues. Using a novel FUSION algorithm, we integrate all data creating a comprehensive NUDIX enzyme profile map, which will prove fundamental to understanding their biological functionality.
Collapse
Affiliation(s)
- Jordi Carreras-Puigvert
- Division of Translational Medicine and Chemical Biology, Science for Life Laboratory, Department of Molecular Biochemistry and Biophysics, Karolinska Institutet, Stockholm, 171 65, Sweden.
| | - Marinka Zitnik
- Faculty of Computer and Information Science, University of Ljubljana, SI-1000, Ljubljana, Slovenia
- Department of Computer Science, Stanford University, Palo Alto, CA, 94305, USA
| | - Ann-Sofie Jemth
- Division of Translational Medicine and Chemical Biology, Science for Life Laboratory, Department of Molecular Biochemistry and Biophysics, Karolinska Institutet, Stockholm, 171 65, Sweden
| | - Megan Carter
- Department of Biochemistry and Biophysics, Stockholm University, 106 91, Stockholm, Sweden
| | - Judith E Unterlass
- Division of Translational Medicine and Chemical Biology, Science for Life Laboratory, Department of Molecular Biochemistry and Biophysics, Karolinska Institutet, Stockholm, 171 65, Sweden
| | - Björn Hallström
- Cell Profiling-Affinity Proteomics, Science for Life Laboratory, KTH-Royal Institute of Technology, Stockholm, 17165, Sweden
| | - Olga Loseva
- Division of Translational Medicine and Chemical Biology, Science for Life Laboratory, Department of Molecular Biochemistry and Biophysics, Karolinska Institutet, Stockholm, 171 65, Sweden
| | - Zhir Karem
- Division of Translational Medicine and Chemical Biology, Science for Life Laboratory, Department of Molecular Biochemistry and Biophysics, Karolinska Institutet, Stockholm, 171 65, Sweden
| | - José Manuel Calderón-Montaño
- Division of Translational Medicine and Chemical Biology, Science for Life Laboratory, Department of Molecular Biochemistry and Biophysics, Karolinska Institutet, Stockholm, 171 65, Sweden
| | - Cecilia Lindskog
- Department of Immunology, Genetics and Pathology, Science for Life Laboratory, 751 85, Uppsala, Sweden
| | - Per-Henrik Edqvist
- Department of Immunology, Genetics and Pathology, Science for Life Laboratory, 751 85, Uppsala, Sweden
| | - Damian J Matuszewski
- Centre for Image Analysis and Science for Life Laboratory, Uppsala University, Uppsala, 751 05, Sweden
| | - Hammou Ait Blal
- Cell Profiling-Affinity Proteomics, Science for Life Laboratory, KTH-Royal Institute of Technology, Stockholm, 17165, Sweden
| | - Ronnie P A Berntsson
- Department of Biochemistry and Biophysics, Stockholm University, 106 91, Stockholm, Sweden
| | - Maria Häggblad
- Biochemical and Cellular Screening Facility, Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Stockholm, 171 65, Sweden
| | - Ulf Martens
- Biochemical and Cellular Screening Facility, Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Stockholm, 171 65, Sweden
| | - Matthew Studham
- Stockholm Bioinformatics Center, Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Box 1031, 171 21, Solna, Sweden
| | - Bo Lundgren
- Biochemical and Cellular Screening Facility, Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Stockholm, 171 65, Sweden
| | - Carolina Wählby
- Centre for Image Analysis and Science for Life Laboratory, Uppsala University, Uppsala, 751 05, Sweden
| | - Erik L L Sonnhammer
- Stockholm Bioinformatics Center, Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Box 1031, 171 21, Solna, Sweden
| | - Emma Lundberg
- Cell Profiling-Affinity Proteomics, Science for Life Laboratory, KTH-Royal Institute of Technology, Stockholm, 17165, Sweden
| | - Pål Stenmark
- Department of Biochemistry and Biophysics, Stockholm University, 106 91, Stockholm, Sweden
| | - Blaz Zupan
- Faculty of Computer and Information Science, University of Ljubljana, SI-1000, Ljubljana, Slovenia
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Thomas Helleday
- Division of Translational Medicine and Chemical Biology, Science for Life Laboratory, Department of Molecular Biochemistry and Biophysics, Karolinska Institutet, Stockholm, 171 65, Sweden.
| |
Collapse
|
30
|
Abstract
In order to maximize the use of results from high-throughput experimental studies, e.g. GWAS, for identification and diagnostics of new disease-associated genes, it is important to have properly analyzed and benchmarked gene prioritization tools. While prospective benchmarks are underpowered to provide statistically significant results in their attempt to differentiate the performance of gene prioritization tools, a strategy for retrospective benchmarking has been missing, and new tools usually only provide internal validations. The Gene Ontology(GO) contains genes clustered around annotation terms. This intrinsic property of GO can be utilized in construction of robust benchmarks, objective to the problem domain. We demonstrate how this can be achieved for network-based gene prioritization tools, utilizing the FunCoup network. We use cross-validation and a set of appropriate performance measures to compare state-of-the-art gene prioritization algorithms: three based on network diffusion, NetRank and two implementations of Random Walk with Restart, and MaxLink that utilizes network neighborhood. Our benchmark suite provides a systematic and objective way to compare the multitude of available and future gene prioritization tools, enabling researchers to select the best gene prioritization tool for the task at hand, and helping to guide the development of more accurate methods.
Collapse
Affiliation(s)
- Dimitri Guala
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Erik L L Sonnhammer
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| |
Collapse
|
31
|
Kaduk M, Riegler C, Lemp O, Sonnhammer ELL. HieranoiDB: a database of orthologs inferred by Hieranoid. Nucleic Acids Res 2017; 45:D687-D690. [PMID: 27742821 PMCID: PMC5210627 DOI: 10.1093/nar/gkw923] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Revised: 09/30/2016] [Accepted: 10/05/2016] [Indexed: 02/04/2023] Open
Abstract
HieranoiDB (http://hieranoiDB.sbc.su.se) is a freely available on-line database for hierarchical groups of orthologs inferred by the Hieranoid algorithm. It infers orthologs at each node in a species guide tree with the InParanoid algorithm as it progresses from the leaves to the root. Here we present a database HieranoiDB with a web interface that makes it easy to search and visualize the output of Hieranoid, and to download it in various formats. Searching can be performed using protein description, identifier or sequence. In this first version, orthologs are available for the 66 Quest for Orthologs reference proteomes. The ortholog trees are shown graphically and interactively with marked speciation and duplication nodes that show the inferred evolutionary scenario, and allow for correct extraction of predicted orthologs from the Hieranoid trees.
Collapse
Affiliation(s)
- Mateusz Kaduk
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Christian Riegler
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
- FH OÖ - University of Applied Sciences Upper Austria, Hagenberg 4232, Austria
| | - Oliver Lemp
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
- FH OÖ - University of Applied Sciences Upper Austria, Hagenberg 4232, Austria
| | - Erik L L Sonnhammer
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| |
Collapse
|
32
|
Tjärnberg A, Morgan DC, Studham M, Nordling TEM, Sonnhammer ELL. GeneSPIDER – gene regulatory network inference benchmarking with controlled network and data properties. Mol BioSyst 2017; 13:1304-1312. [DOI: 10.1039/c7mb00058h] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
A key question in network inference, that has not been properly answered, is what accuracy can be expected for a given biological dataset and inference method.
Collapse
Affiliation(s)
- Andreas Tjärnberg
- Stockholm Bioinformatics Center
- Science for Life Laboratory
- Sweden
- Department of Biochemistry and Biophysics
- Stockholm University
| | - Daniel C. Morgan
- Stockholm Bioinformatics Center
- Science for Life Laboratory
- Sweden
- Department of Biochemistry and Biophysics
- Stockholm University
| | - Matthew Studham
- Stockholm Bioinformatics Center
- Science for Life Laboratory
- Sweden
- Department of Biochemistry and Biophysics
- Stockholm University
| | - Torbjörn E. M. Nordling
- Stockholm Bioinformatics Center
- Science for Life Laboratory
- Sweden
- Department of Mechanical Engineering
- National Cheng Kung University
| | - Erik L. L. Sonnhammer
- Stockholm Bioinformatics Center
- Science for Life Laboratory
- Sweden
- Department of Biochemistry and Biophysics
- Stockholm University
| |
Collapse
|
33
|
Ogris C, Guala D, Helleday T, Sonnhammer ELL. A novel method for crosstalk analysis of biological networks: improving accuracy of pathway annotation. Nucleic Acids Res 2016; 45:e8. [PMID: 27664219 PMCID: PMC5314790 DOI: 10.1093/nar/gkw849] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2016] [Revised: 08/17/2016] [Accepted: 08/23/2016] [Indexed: 12/13/2022] Open
Abstract
Analyzing gene expression patterns is a mainstay to gain functional insights of biological systems. A plethora of tools exist to identify significant enrichment of pathways for a set of differentially expressed genes. Most tools analyze gene overlap between gene sets and are therefore severely hampered by the current state of pathway annotation, yet at the same time they run a high risk of false assignments. A way to improve both true positive and false positive rates (FPRs) is to use a functional association network and instead look for enrichment of network connections between gene sets. We present a new network crosstalk analysis method BinoX that determines the statistical significance of network link enrichment or depletion between gene sets, using the binomial distribution. This is a much more appropriate statistical model than previous methods have employed, and as a result BinoX yields substantially better true positive and FPRs than was possible before. A number of benchmarks were performed to assess the accuracy of BinoX and competing methods. We demonstrate examples of how BinoX finds many biologically meaningful pathway annotations for gene sets from cancer and other diseases, which are not found by other methods. BinoX is available at http://sonnhammer.org/BinoX.
Collapse
Affiliation(s)
- Christoph Ogris
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Dimitri Guala
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Thomas Helleday
- Division of Translational Medicine and Chemical Biology, Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Erik L L Sonnhammer
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| |
Collapse
|
34
|
Abstract
Motivation: Over the last decades, vast numbers of sequences were deposited in public databases. Bioinformatics tools allow homology and consequently functional inference for these sequences. New profile-based homology search tools have been introduced, allowing reliable detection of remote homologs, but have not been systematically benchmarked. To provide such a comparison, which can guide bioinformatics workflows, we extend and apply our previously developed benchmark approach to evaluate the ‘next generation’ of profile-based approaches, including CS-BLAST, HHSEARCH and PHMMER, in comparison with the non-profile based search tools NCBI-BLAST, USEARCH, UBLAST and FASTA. Method: We generated challenging benchmark datasets based on protein domain architectures within either the PFAM + Clan, SCOP/Superfamily or CATH/Gene3D domain definition schemes. From each dataset, homologous and non-homologous protein pairs were aligned using each tool, and standard performance metrics calculated. We further measured congruence of domain architecture assignments in the three domain databases. Results: CSBLAST and PHMMER had overall highest accuracy. FASTA, UBLAST and USEARCH showed large trade-offs of accuracy for speed optimization. Conclusion: Profile methods are superior at inferring remote homologs but the difference in accuracy between methods is relatively small. PHMMER and CSBLAST stand out with the highest accuracy, yet still at a reasonable computational cost. Additionally, we show that less than 0.1% of Swiss-Prot protein pairs considered homologous by one database are considered non-homologous by another, implying that these classifications represent equivalent underlying biological phenomena, differing mostly in coverage and granularity. Availability and Implementation: Benchmark datasets and all scripts are placed at (http://sonnhammer.org/download/Homology_benchmark). Contact:forslund@embl.de Supplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ganapathi Varma Saripella
- Science for Life Laboratory, Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Stockholm SE-10691, Sweden
| | - Erik L L Sonnhammer
- Science for Life Laboratory, Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Stockholm SE-10691, Sweden
| | - Kristoffer Forslund
- European Molecular Biology Laboratory, Structural and Computational Biology Unit, Heidelberg 69117, Germany
| |
Collapse
|
35
|
Ogris C, Helleday T, Sonnhammer ELL. PathwAX: a web server for network crosstalk based pathway annotation. Nucleic Acids Res 2016; 44:W105-9. [PMID: 27151197 PMCID: PMC4987909 DOI: 10.1093/nar/gkw356] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2016] [Accepted: 04/19/2016] [Indexed: 12/22/2022] Open
Abstract
Pathway annotation of gene lists is often used to functionally analyse biomolecular data such as gene expression in order to establish which processes are activated in a given experiment. Databases such as KEGG or GO represent collections of how genes are known to be organized in pathways, and the challenge is to compare a given gene list with the known pathways such that all true relations are identified. Most tools apply statistical measures to the gene overlap between the gene list and pathway. It is however problematic to avoid false negatives and false positives when only using the gene overlap. The pathwAX web server (http://pathwAX.sbc.su.se/) applies a different approach which is based on network crosstalk. It uses the comprehensive network FunCoup to analyse network crosstalk between a query gene list and KEGG pathways. PathwAX runs the BinoX algorithm, which employs Monte-Carlo sampling of randomized networks and estimates a binomial distribution, for estimating the statistical significance of the crosstalk. This results in substantially higher accuracy than gene overlap methods. The system was optimized for speed and allows interactive web usage. We illustrate the usage and output of pathwAX.
Collapse
Affiliation(s)
- Christoph Ogris
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Thomas Helleday
- Division of Translational Medicine and Chemical Biology, Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Erik L L Sonnhammer
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| |
Collapse
|
36
|
Tjärnberg A, Nordling TEM, Studham M, Nelander S, Sonnhammer ELL. Avoiding pitfalls in L1-regularised inference of gene networks. Mol BioSyst 2015; 11:287-96. [DOI: 10.1039/c4mb00419a] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
L1 regularisation methods fail to infer the correct network even when the data are so informative that all existing links can be proven to exist.
Collapse
Affiliation(s)
- Andreas Tjärnberg
- Stockholm Bioinformatics Centre
- Science for Life Laboratory
- 17121 Solna
- Sweden
- Department of Biochemistry and Biophysics
| | - Torbjörn E. M. Nordling
- Stockholm Bioinformatics Centre
- Science for Life Laboratory
- 17121 Solna
- Sweden
- Department of Immunology
| | - Matthew Studham
- Stockholm Bioinformatics Centre
- Science for Life Laboratory
- 17121 Solna
- Sweden
| | - Sven Nelander
- Department of Immunology
- Genetics and Pathology
- Uppsala University
- Rudbeck laboratory
- 75185 Uppsala
| | - Erik L. L. Sonnhammer
- Stockholm Bioinformatics Centre
- Science for Life Laboratory
- 17121 Solna
- Sweden
- Department of Biochemistry and Biophysics
| |
Collapse
|
37
|
Abstract
The InParanoid database (http://InParanoid.sbc.su.se) provides a user interface to orthologs inferred by the InParanoid algorithm. As there are now international efforts to curate and standardize complete proteomes, we have switched to using these resources rather than gathering and curating the proteomes ourselves. InParanoid release 8 is based on the 66 reference proteomes that the ‘Quest for Orthologs’ community has agreed on using, plus 207 additional proteomes from the UniProt complete proteomes—in total 273 species. These represent 246 eukaryotes, 20 bacteria and seven archaea. Compared to the previous release, this increases the number of species by 173% and the number of pairwise species comparisons by 650%. In turn, the number of ortholog groups has increased by 423%. We present the contents and usages of InParanoid 8, and a detailed analysis of how the proteome content has changed since the previous release.
Collapse
Affiliation(s)
- Erik L L Sonnhammer
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden
| | - Gabriel Östlund
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden
| |
Collapse
|
38
|
Studham ME, Tjärnberg A, Nordling TEM, Nelander S, Sonnhammer ELL. Functional association networks as priors for gene regulatory network inference. ACTA ACUST UNITED AC 2014; 30:i130-8. [PMID: 24931976 PMCID: PMC4058914 DOI: 10.1093/bioinformatics/btu285] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Motivation: Gene regulatory network (GRN) inference reveals the influences genes have on one another in cellular regulatory systems. If the experimental data are inadequate for reliable inference of the network, informative priors have been shown to improve the accuracy of inferences. Results: This study explores the potential of undirected, confidence-weighted networks, such as those in functional association databases, as a prior source for GRN inference. Such networks often erroneously indicate symmetric interaction between genes and may contain mostly correlation-based interaction information. Despite these drawbacks, our testing on synthetic datasets indicates that even noisy priors reflect some causal information that can improve GRN inference accuracy. Our analysis on yeast data indicates that using the functional association databases FunCoup and STRING as priors can give a small improvement in GRN inference accuracy with biological data. Contact:matthew.studham@scilifelab.se Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Matthew E Studham
- Stockholm Bioinformatics Centre, Science for Life Laboratory, SE-171 65 Solna, Sweden, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Department of Immunology, Genetics and Pathology, Uppsala University, Rudbeck Laboratory, SE-751 05 Uppsala, Sweden and Swedish eScience Research Center, SE-100 44 Stockholm, SwedenStockholm Bioinformatics Centre, Science for Life Laboratory, SE-171 65 Solna, Sweden, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Department of Immunology, Genetics and Pathology, Uppsala University, Rudbeck Laboratory, SE-751 05 Uppsala, Sweden and Swedish eScience Research Center, SE-100 44 Stockholm, Sweden
| | - Andreas Tjärnberg
- Stockholm Bioinformatics Centre, Science for Life Laboratory, SE-171 65 Solna, Sweden, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Department of Immunology, Genetics and Pathology, Uppsala University, Rudbeck Laboratory, SE-751 05 Uppsala, Sweden and Swedish eScience Research Center, SE-100 44 Stockholm, SwedenStockholm Bioinformatics Centre, Science for Life Laboratory, SE-171 65 Solna, Sweden, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Department of Immunology, Genetics and Pathology, Uppsala University, Rudbeck Laboratory, SE-751 05 Uppsala, Sweden and Swedish eScience Research Center, SE-100 44 Stockholm, Sweden
| | - Torbjörn E M Nordling
- Stockholm Bioinformatics Centre, Science for Life Laboratory, SE-171 65 Solna, Sweden, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Department of Immunology, Genetics and Pathology, Uppsala University, Rudbeck Laboratory, SE-751 05 Uppsala, Sweden and Swedish eScience Research Center, SE-100 44 Stockholm, SwedenStockholm Bioinformatics Centre, Science for Life Laboratory, SE-171 65 Solna, Sweden, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Department of Immunology, Genetics and Pathology, Uppsala University, Rudbeck Laboratory, SE-751 05 Uppsala, Sweden and Swedish eScience Research Center, SE-100 44 Stockholm, Sweden
| | - Sven Nelander
- Stockholm Bioinformatics Centre, Science for Life Laboratory, SE-171 65 Solna, Sweden, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Department of Immunology, Genetics and Pathology, Uppsala University, Rudbeck Laboratory, SE-751 05 Uppsala, Sweden and Swedish eScience Research Center, SE-100 44 Stockholm, Sweden
| | - Erik L L Sonnhammer
- Stockholm Bioinformatics Centre, Science for Life Laboratory, SE-171 65 Solna, Sweden, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Department of Immunology, Genetics and Pathology, Uppsala University, Rudbeck Laboratory, SE-751 05 Uppsala, Sweden and Swedish eScience Research Center, SE-100 44 Stockholm, SwedenStockholm Bioinformatics Centre, Science for Life Laboratory, SE-171 65 Solna, Sweden, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Department of Immunology, Genetics and Pathology, Uppsala University, Rudbeck Laboratory, SE-751 05 Uppsala, Sweden and Swedish eScience Research Center, SE-100 44 Stockholm, SwedenStockholm Bioinformatics Centre, Science for Life Laboratory, SE-171 65 Solna, Sweden, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Department of Immunology, Genetics and Pathology, Uppsala University, Rudbeck Laboratory, SE-751 05 Uppsala, Sweden and Swedish eScience Research Center, SE-100 44 Stockholm, Sweden
| |
Collapse
|
39
|
Sonnhammer ELL, Gabaldón T, Sousa da Silva AW, Martin M, Robinson-Rechavi M, Boeckmann B, Thomas PD, Dessimoz C. Big data and other challenges in the quest for orthologs. Bioinformatics 2014; 30:2993-8. [PMID: 25064571 PMCID: PMC4201156 DOI: 10.1093/bioinformatics/btu492] [Citation(s) in RCA: 98] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
Given the rapid increase of species with a sequenced genome, the need to identify orthologous genes between them has emerged as a central bioinformatics task. Many different methods exist for orthology detection, which makes it difficult to decide which one to choose for a particular application. Here, we review the latest developments and issues in the orthology field, and summarize the most recent results reported at the third ‘Quest for Orthologs’ meeting. We focus on community efforts such as the adoption of reference proteomes, standard file formats and benchmarking. Progress in these areas is good, and they are already beneficial to both orthology consumers and providers. However, a major current issue is that the massive increase in complete proteomes poses computational challenges to many of the ortholog database providers, as most orthology inference algorithms scale at least quadratically with the number of proteomes. The Quest for Orthologs consortium is an open community with a number of working groups that join efforts to enhance various aspects of orthology analysis, such as defining standard formats and datasets, documenting community resources and benchmarking. Availability and implementation: All such materials are available at http://questfororthologs.org. Contact:erik.sonnhammer@scilifelab.se or c.dessimoz@ucl.ac.uk
Collapse
Affiliation(s)
- Erik L L Sonnhammer
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London
| | - Toni Gabaldón
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London
| | - Alan W Sousa da Silva
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK
| | - Maria Martin
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK
| | - Marc Robinson-Rechavi
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London
| | - Brigitte Boeckmann
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK
| | - Paul D Thomas
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK
| | - Christophe Dessimoz
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London
| | | |
Collapse
|
40
|
Abstract
Orthologs are an indispensable bridge to transfer biological knowledge between species, from protein annotations to sophisticated disease models. However, orthology assignment is not trivial. A large number of resources now exist, each with its own idiosyncrasies. The goal of this review is to compare their contents and clarify which database is most suited for a certain task.:
Collapse
Affiliation(s)
- Andrey Alexeyenko
- Stockholm Bioinformatics Center, Albanova, Stockholm University, SE-106 91, Stockholm, Sweden
| | - Julia Lindberg
- Stockholm Bioinformatics Center, Albanova, Stockholm University, SE-106 91, Stockholm, Sweden
| | - Asa Pérez-Bercoff
- Stockholm Bioinformatics Center, Albanova, Stockholm University, SE-106 91, Stockholm, Sweden
| | - Erik L L Sonnhammer
- Stockholm Bioinformatics Center, Albanova, Stockholm University, SE-106 91, Stockholm, Sweden.
| |
Collapse
|
41
|
Guala D, Sjölund E, Sonnhammer ELL. MaxLink: network-based prioritization of genes tightly linked to a disease seed set. Bioinformatics 2014; 30:2689-90. [DOI: 10.1093/bioinformatics/btu344] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
|
42
|
Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, Sonnhammer ELL, Tate J, Punta M. Pfam: the protein families database. Nucleic Acids Res 2013; 42:D222-30. [PMID: 24288371 PMCID: PMC3965110 DOI: 10.1093/nar/gkt1223] [Citation(s) in RCA: 4192] [Impact Index Per Article: 381.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface. We also discuss the mapping between Pfam and known 3D structures.
Collapse
Affiliation(s)
- Robert D Finn
- HHMI Janelia Farm Research Campus, 19700 Helix Drive, Ashburn, VA 20147 USA, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK, MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, OX1 3QX, UK, Institute of Biotechnology and Department of Biological and Environmental Sciences, University of Helsinki, PO Box 56 (Viikinkaari 5), 00014 Helsinki, Finland and Stockholm Bioinformatics Center, Swedish eScience Research Center, Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, PO Box 1031, SE-17121 Solna, Sweden
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
43
|
Abstract
We present an update of the FunCoup database (http://FunCoup.sbc.su.se) of functional couplings, or functional associations, between genes and gene products. Identifying these functional couplings is an important step in the understanding of higher level mechanisms performed by complex cellular processes. FunCoup distinguishes between four classes of couplings: participation in the same signaling cascade, participation in the same metabolic process, co-membership in a protein complex and physical interaction. For each of these four classes, several types of experimental and statistical evidence are combined by Bayesian integration to predict genome-wide functional coupling networks. The FunCoup framework has been completely re-implemented to allow for more frequent future updates. It contains many improvements, such as a regularization procedure to automatically downweight redundant evidences and a novel method to incorporate phylogenetic profile similarity. Several datasets have been updated and new data have been added in FunCoup 3.0. Furthermore, we have developed a new Web site, which provides powerful tools to explore the predicted networks and to retrieve detailed information about the data underlying each prediction.
Collapse
Affiliation(s)
- Thomas Schmitt
- Stockholm Bioinformatics Centre, Science for Life Laboratory, Box 1031, Solna SE-17121, Sweden, Department of Biochemistry and Biophysics, Stockholm University and Swedish eScience Research Center
| | | | | |
Collapse
|
44
|
Abstract
An accurate inference of orthologs is essential in many research fields such as comparative genomics, molecular evolution, and genome annotation. Existing methods for genome-scale orthology inference are mostly based on all-versus-all similarity searches that scale quadratically with the number of species. This limits their application to the increasing number of available large-scale datasets. Here, we present Hieranoid, a new orthology inference method using a hierarchical approach. Hieranoid performs pairwise orthology analysis using InParanoid at each node in a guide tree as it progresses from its leaves to the root. This concept reduces the total runtime complexity from a quadratic to a linear function of the number of species. The tree hierarchy provides a natural structure in multi-species ortholog groups, and the aggregation of multiple sequences allows for multiple alignment similarity searching techniques, which can yield more accurate ortholog groups. Using the recently published orthobench benchmark, Hieranoid showed the overall best performance. Our progressive approach presents a new way to infer orthologs that combines efficient graph-based methodology with aspects of compute-intensive tree-based methods. The linear scaling with the number of species is a major advantage for large-scale applications and makes Hieranoid well suited to cope with vast amounts of sequenced genomes in the future. Hieranoid is an open source and can be downloaded at Hieranoid.sbc.su.se.
Collapse
Affiliation(s)
- Fabian Schreiber
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden; Department of Biochemistry and Biophysics, Stockholm University, SE-10691 Stockholm, Sweden.
| | - Erik L L Sonnhammer
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden; Department of Biochemistry and Biophysics, Stockholm University, SE-10691 Stockholm, Sweden; Swedish e-Science Research Center, SE-10044 Stockholm, Sweden
| |
Collapse
|
45
|
Abstract
Network analysis is an important tool for functional annotation of genes and proteins. A common approach to discern structure in a global network is to infer network clusters, or modules, and assume a functional coherence within each module, which may represent a complex or a pathway. It is however not trivial to define optimal modules. Although many methods have been proposed, it is unclear which methods perform best in general. It seems that most methods produce far from optimal results but in different ways. MGclus is a new algorithm designed to detect modules with a strongly interconnected neighborhood in large scale biological interaction networks. In our benchmarks we found MGclus to outperform other methods when applied to random graphs with varying degree of noise, and to perform equally or better when applied to biological protein interaction networks. MGclus is implemented in Java and utilizes the JGraphT graph library. It has an easy to use command-line interface and is available for download from .
Collapse
Affiliation(s)
- Oliver Frings
- Stockholm Bioinformatics Centre, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden
| | | | | |
Collapse
|
46
|
McCormack T, Frings O, Alexeyenko A, Sonnhammer ELL. Statistical assessment of crosstalk enrichment between gene groups in biological networks. PLoS One 2013; 8:e54945. [PMID: 23372799 PMCID: PMC3553069 DOI: 10.1371/journal.pone.0054945] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2011] [Accepted: 12/21/2012] [Indexed: 11/19/2022] Open
Abstract
Motivation Analyzing groups of functionally coupled genes or proteins in the context of global interaction networks has become an important aspect of bioinformatic investigations. Assessing the statistical significance of crosstalk enrichment between or within groups of genes can be a valuable tool for functional annotation of experimental gene sets. Results Here we present CrossTalkZ, a statistical method and software to assess the significance of crosstalk enrichment between pairs of gene or protein groups in large biological networks. We demonstrate that the standard z-score is generally an appropriate and unbiased statistic. We further evaluate the ability of four different methods to reliably recover crosstalk within known biological pathways. We conclude that the methods preserving the second-order topological network properties perform best. Finally, we show how CrossTalkZ can be used to annotate experimental gene sets using known pathway annotations and that its performance at this task is superior to gene enrichment analysis (GEA). Availability and Implementation CrossTalkZ (available at http://sonnhammer.sbc.su.se/download/software/CrossTalkZ/) is implemented in C++, easy to use, fast, accepts various input file formats, and produces a number of statistics. These include z-score, p-value, false discovery rate, and a test of normality for the null distributions.
Collapse
Affiliation(s)
- Theodore McCormack
- Stockholm Bioinformatics Centre, Science for Life Laboratory, Solna, Sweden
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Oliver Frings
- Stockholm Bioinformatics Centre, Science for Life Laboratory, Solna, Sweden
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Andrey Alexeyenko
- Stockholm Bioinformatics Centre, Science for Life Laboratory, Solna, Sweden
- School of Biotechnology, Royal Institute of Technology, Stockholm, Sweden
| | - Erik L. L. Sonnhammer
- Stockholm Bioinformatics Centre, Science for Life Laboratory, Solna, Sweden
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
- Swedish eScience Research Center, Stockholm, Sweden
- * E-mail:
| |
Collapse
|
47
|
Frings O, Mank JE, Alexeyenko A, Sonnhammer ELL. Network analysis of functional genomics data: application to avian sex-biased gene expression. ScientificWorldJournal 2012; 2012:130491. [PMID: 23319882 PMCID: PMC3540752 DOI: 10.1100/2012/130491] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2012] [Accepted: 11/25/2012] [Indexed: 12/03/2022] Open
Abstract
Gene expression analysis is often used to investigate the molecular and functional underpinnings of a phenotype. However, differential expression of individual genes is limited in that it does not consider how the genes interact with each other in networks. To address this shortcoming we propose a number of network-based analyses that give additional functional insights into the studied process. These were applied to a dataset of sex-specific gene expression in the chicken gonad and brain at different developmental stages. We first constructed a global chicken interaction network. Combining the network with the expression data showed that most sex-biased genes tend to have lower network connectivity, that is, act within local network environments, although some interesting exceptions were found. Genes of the same sex bias were generally more strongly connected with each other than expected. We further studied the fates of duplicated sex-biased genes and found that there is a significant trend to keep the same pattern of sex bias after duplication. We also identified sex-biased modules in the network, which reveal pathways or complexes involved in sex-specific processes. Altogether, this work integrates evolutionary genomics with systems biology in a novel way, offering new insights into the modular nature of sex-biased genes.
Collapse
Affiliation(s)
- Oliver Frings
- Stockholm Bioinformatics Centre, Science for Life Laboratory, Box 1031, SE-171 21 Solna, Sweden
| | | | | | | |
Collapse
|
48
|
Abstract
The identification of orthologs—genes pairs descended from a common ancestor through speciation, rather than duplication—has emerged as an essential component of many bioinformatics applications, ranging from the annotation of new genomes to experimental target prioritization. Yet, the development and application of orthology inference methods is hampered by the lack of consensus on source proteomes, file formats and benchmarks. The second ‘Quest for Orthologs’ meeting brought together stakeholders from various communities to address these challenges. We report on achievements and outcomes of this meeting, focusing on topics of particular relevance to the research community at large. The Quest for Orthologs consortium is an open community that welcomes contributions from all researchers interested in orthology research and applications. Contact:dessimoz@ebi.ac.uk
Collapse
|
49
|
Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, Sonnhammer ELL, Eddy SR, Bateman A, Finn RD. The Pfam protein families database. Nucleic Acids Res 2011; 40:D290-301. [PMID: 22127870 PMCID: PMC3245129 DOI: 10.1093/nar/gkr1065] [Citation(s) in RCA: 2852] [Impact Index Per Article: 219.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Pfam is a widely used database of protein families, currently containing more than 13,000 manually curated protein families as of release 26.0. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/). Here, we report on changes that have occurred since our 2010 NAR paper (release 24.0). Over the last 2 years, we have generated 1840 new families and increased coverage of the UniProt Knowledgebase (UniProtKB) to nearly 80%. Notably, we have taken the step of opening up the annotation of our families to the Wikipedia community, by linking Pfam families to relevant Wikipedia pages and encouraging the Pfam and Wikipedia communities to improve and expand those pages. We continue to improve the Pfam website and add new visualizations, such as the 'sunburst' representation of taxonomic distribution of families. In this work we additionally address two topics that will be of particular interest to the Pfam community. First, we explain the definition and use of family-specific, manually curated gathering thresholds. Second, we discuss some of the features of domains of unknown function (also known as DUFs), which constitute a rapidly growing class of families within Pfam.
Collapse
Affiliation(s)
- Marco Punta
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
50
|
Abstract
FunCoup (http://FunCoup.sbc.su.se) is a database that maintains and visualizes global gene/protein networks of functional coupling that have been constructed by Bayesian integration of diverse high-throughput data. FunCoup achieves high coverage by orthology-based integration of data sources from different model organisms and from different platforms. We here present release 2.0 in which the data sources have been updated and the methodology has been refined. It contains a new data type Genetic Interaction, and three new species: chicken, dog and zebra fish. As FunCoup extensively transfers functional coupling information between species, the new input datasets have considerably improved both coverage and quality of the networks. The number of high-confidence network links has increased dramatically. For instance, the human network has more than eight times as many links above confidence 0.5 as the previous release. FunCoup provides facilities for analysing the conservation of subnetworks in multiple species. We here explain how to do comparative interactomics on the FunCoup website.
Collapse
Affiliation(s)
- Andrey Alexeyenko
- School of Biotechnology, Royal Institute of Technology, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden
| | | | | | | | | | | |
Collapse
|