1
|
Deng C, Li HD, Zhang LS, Liu Y, Li Y, Wang J. Identifying new cancer genes based on the integration of annotated gene sets via hypergraph neural networks. Bioinformatics 2024; 40:i511-i520. [PMID: 38940121 PMCID: PMC11211849 DOI: 10.1093/bioinformatics/btae257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Identifying cancer genes remains a significant challenge in cancer genomics research. Annotated gene sets encode functional associations among multiple genes, and cancer genes have been shown to cluster in hallmark signaling pathways and biological processes. The knowledge of annotated gene sets is critical for discovering cancer genes but remains to be fully exploited. RESULTS Here, we present the DIsease-Specific Hypergraph neural network (DISHyper), a hypergraph-based computational method that integrates the knowledge from multiple types of annotated gene sets to predict cancer genes. First, our benchmark results demonstrate that DISHyper outperforms the existing state-of-the-art methods and highlight the advantages of employing hypergraphs for representing annotated gene sets. Second, we validate the accuracy of DISHyper-predicted cancer genes using functional validation results and multiple independent functional genomics data. Third, our model predicts 44 novel cancer genes, and subsequent analysis shows their significant associations with multiple types of cancers. Overall, our study provides a new perspective for discovering cancer genes and reveals previously undiscovered cancer genes. AVAILABILITY AND IMPLEMENTATION DISHyper is freely available for download at https://github.com/genemine/DISHyper.
Collapse
Affiliation(s)
- Chao Deng
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China
| | - Hong-Dong Li
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China
| | - Li-Shen Zhang
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China
| | - Yiwei Liu
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China
| | - Yaohang Li
- Department of Computer Science, Old Dominion University, Norfolk, VA 23529-0001, United States
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China
| |
Collapse
|
2
|
Huang Y, Chen F, Sun H, Zhong C. Exploring gene-patient association to identify personalized cancer driver genes by linear neighborhood propagation. BMC Bioinformatics 2024; 25:34. [PMID: 38254011 PMCID: PMC10804660 DOI: 10.1186/s12859-024-05662-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Accepted: 01/18/2024] [Indexed: 01/24/2024] Open
Abstract
BACKGROUND Driver genes play a vital role in the development of cancer. Identifying driver genes is critical for diagnosing and understanding cancer. However, challenges remain in identifying personalized driver genes due to tumor heterogeneity of cancer. Although many computational methods have been developed to solve this problem, few efforts have been undertaken to explore gene-patient associations to identify personalized driver genes. RESULTS Here we propose a method called LPDriver to identify personalized cancer driver genes by employing linear neighborhood propagation model on individual genetic data. LPDriver builds personalized gene network based on the genetic data of individual patients, extracts the gene-patient associations from the bipartite graph of the personalized gene network and utilizes a linear neighborhood propagation model to mine gene-patient associations to detect personalized driver genes. The experimental results demonstrate that as compared to the existing methods, our method shows competitive performance and can predict cancer driver genes in a more accurate way. Furthermore, these results also show that besides revealing novel driver genes that have been reported to be related with cancer, LPDriver is also able to identify personalized cancer driver genes for individual patients by their network characteristics even if the mutation data of genes are hidden. CONCLUSIONS LPDriver can provide an effective approach to predict personalized cancer driver genes, which could promote the diagnosis and treatment of cancer. The source code and data are freely available at https://github.com/hyr0771/LPDriver .
Collapse
Affiliation(s)
- Yiran Huang
- School of Computer, Electronics and Information, Guangxi University, Nanning, 530004, China
- Key Laboratory of Parallel, Distributed and Intelligent Computing in Guangxi Universities and Colleges, Guangxi University, Nanning, 530004, China
- Guangxi Key Laboratory of Multimedia Communications and Network Technology, Guangxi University, Nanning, 530004, China
| | - Fuhao Chen
- School of Computer, Electronics and Information, Guangxi University, Nanning, 530004, China
| | - Hongtao Sun
- School of Computer, Electronics and Information, Guangxi University, Nanning, 530004, China
| | - Cheng Zhong
- School of Computer, Electronics and Information, Guangxi University, Nanning, 530004, China.
- Key Laboratory of Parallel, Distributed and Intelligent Computing in Guangxi Universities and Colleges, Guangxi University, Nanning, 530004, China.
- Guangxi Key Laboratory of Multimedia Communications and Network Technology, Guangxi University, Nanning, 530004, China.
| |
Collapse
|
3
|
Visonà G, Bouzigon E, Demenais F, Schweikert G. Network propagation for GWAS analysis: a practical guide to leveraging molecular networks for disease gene discovery. Brief Bioinform 2024; 25:bbae014. [PMID: 38340090 PMCID: PMC10858647 DOI: 10.1093/bib/bbae014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Revised: 12/28/2023] [Accepted: 01/08/2024] [Indexed: 02/12/2024] Open
Abstract
MOTIVATION Genome-wide association studies (GWAS) have enabled large-scale analysis of the role of genetic variants in human disease. Despite impressive methodological advances, subsequent clinical interpretation and application remains challenging when GWAS suffer from a lack of statistical power. In recent years, however, the use of information diffusion algorithms with molecular networks has led to fruitful insights on disease genes. RESULTS We present an overview of the design choices and pitfalls that prove crucial in the application of network propagation methods to GWAS summary statistics. We highlight general trends from the literature, and present benchmark experiments to expand on these insights selecting as case study three diseases and five molecular networks. We verify that the use of gene-level scores based on GWAS P-values offers advantages over the selection of a set of 'seed' disease genes not weighted by the associated P-values if the GWAS summary statistics are of sufficient quality. Beyond that, the size and the density of the networks prove to be important factors for consideration. Finally, we explore several ensemble methods and show that combining multiple networks may improve the network propagation approach.
Collapse
Affiliation(s)
- Giovanni Visonà
- Empirical Inference, Max-Planck Institute for Intelligent Systems, Tübingen 72076, Germany
| | | | | | | |
Collapse
|
4
|
Boyd SS, Slawson C, Thompson JA. AMEND: active module identification using experimental data and network diffusion. BMC Bioinformatics 2023; 24:277. [PMID: 37415126 DOI: 10.1186/s12859-023-05376-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2023] [Accepted: 06/02/2023] [Indexed: 07/08/2023] Open
Abstract
BACKGROUND Molecular interaction networks have become an important tool in providing context to the results of various omics experiments. For example, by integrating transcriptomic data and protein-protein interaction (PPI) networks, one can better understand how the altered expression of several genes are related with one another. The challenge then becomes how to determine, in the context of the interaction network, the subset(s) of genes that best captures the main mechanisms underlying the experimental conditions. Different algorithms have been developed to address this challenge, each with specific biological questions in mind. One emerging area of interest is to determine which genes are equivalently or inversely changed between different experiments. The equivalent change index (ECI) is a recently proposed metric that measures the extent to which a gene is equivalently or inversely regulated between two experiments. The goal of this work is to develop an algorithm that makes use of the ECI and powerful network analysis techniques to identify a connected subset of genes that are highly relevant to the experimental conditions. RESULTS To address the above goal, we developed a method called Active Module identification using Experimental data and Network Diffusion (AMEND). The AMEND algorithm is designed to find a subset of connected genes in a PPI network that have large experimental values. It makes use of random walk with restart to create gene weights, and a heuristic solution to the Maximum-weight Connected Subgraph problem using these weights. This is performed iteratively until an optimal subnetwork (i.e., active module) is found. AMEND was compared to two current methods, NetCore and DOMINO, using two gene expression datasets. CONCLUSION The AMEND algorithm is an effective, fast, and easy-to-use method for identifying network-based active modules. It returned connected subnetworks with the largest median ECI by magnitude, capturing distinct but related functional groups of genes. Code is freely available at https://github.com/samboyd0/AMEND .
Collapse
Affiliation(s)
- Samuel S Boyd
- Department of Biostatistics and Data Science, University of Kansas Medical Center, 3901 Rainbow Blvd., Kansas City, KS, 66103, USA
- University of Kansas Cancer Center, Kansas City, KS, USA
| | - Chad Slawson
- Department of Biochemistry, University of Kansas Medical Center, 3901 Rainbow Blvd., Kansas City, KS, 66103, USA
- University of Kansas Cancer Center, Kansas City, KS, USA
- University of Kansas Alzheimer's Disease Research Center, Fairway, KS, USA
| | - Jeffrey A Thompson
- Department of Biostatistics and Data Science, University of Kansas Medical Center, 3901 Rainbow Blvd., Kansas City, KS, 66103, USA.
- University of Kansas Cancer Center, Kansas City, KS, USA.
| |
Collapse
|
5
|
Heer M, Giudice L, Mengoni C, Giugno R, Rico D. Esearch3D: propagating gene expression in chromatin networks to illuminate active enhancers. Nucleic Acids Res 2023; 51:e55. [PMID: 37021559 PMCID: PMC10250221 DOI: 10.1093/nar/gkad229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Revised: 03/06/2023] [Accepted: 04/03/2023] [Indexed: 04/07/2023] Open
Abstract
Most cell type-specific genes are regulated by the interaction of enhancers with their promoters. The identification of enhancers is not trivial as enhancers are diverse in their characteristics and dynamic in their interaction partners. We present Esearch3D, a new method that exploits network theory approaches to identify active enhancers. Our work is based on the fact that enhancers act as a source of regulatory information to increase the rate of transcription of their target genes and that the flow of this information is mediated by the folding of chromatin in the three-dimensional (3D) nuclear space between the enhancer and the target gene promoter. Esearch3D reverse engineers this flow of information to calculate the likelihood of enhancer activity in intergenic regions by propagating the transcription levels of genes across 3D genome networks. Regions predicted to have high enhancer activity are shown to be enriched in annotations indicative of enhancer activity. These include: enhancer-associated histone marks, bidirectional CAGE-seq, STARR-seq, P300, RNA polymerase II and expression quantitative trait loci (eQTLs). Esearch3D leverages the relationship between chromatin architecture and transcription, allowing the prediction of active enhancers and an understanding of the complex underpinnings of regulatory networks. The method is available at: https://github.com/InfOmics/Esearch3D and https://doi.org/10.5281/zenodo.7737123.
Collapse
Affiliation(s)
- Maninder Heer
- Biosciences Institute, Faculty of Medical Sciences, Newcastle University, Newcastle upon Tyne, UK
| | - Luca Giudice
- Department of Computer Science, University of Verona, Strada le Grazie 15, 37134, Verona, Italy
- A.I. Virtanen Institute for Molecular Sciences, University of Eastern Finland, Kuopio, Finland
| | - Claudia Mengoni
- Department of Computer Science, University of Verona, Strada le Grazie 15, 37134, Verona, Italy
| | - Rosalba Giugno
- Department of Computer Science, University of Verona, Strada le Grazie 15, 37134, Verona, Italy
| | - Daniel Rico
- Biosciences Institute, Faculty of Medical Sciences, Newcastle University, Newcastle upon Tyne, UK
| |
Collapse
|
6
|
Bouabid C, Rabhi S, Thedinga K, Barel G, Tnani H, Rabhi I, Benkahla A, Herwig R, Guizani-Tabbane L. Host M-CSF induced gene expression drives changes in susceptible and resistant mice-derived BMdMs upon Leishmania major infection. Front Immunol 2023; 14:1111072. [PMID: 37187743 PMCID: PMC10175952 DOI: 10.3389/fimmu.2023.1111072] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Accepted: 04/11/2023] [Indexed: 05/17/2023] Open
Abstract
Leishmaniases are a group of diseases with different clinical manifestations. Macrophage-Leishmania interactions are central to the course of the infection. The outcome of the disease depends not only on the pathogenicity and virulence of the parasite, but also on the activation state, the genetic background, and the underlying complex interaction networks operative in the host macrophages. Mouse models, with mice strains having contrasting behavior in response to parasite infection, have been very helpful in exploring the mechanisms underlying differences in disease progression. We here analyzed previously generated dynamic transcriptome data obtained from Leishmania major (L. major) infected bone marrow derived macrophages (BMdMs) from resistant and susceptible mouse. We first identified differentially expressed genes (DEGs) between the M-CSF differentiated macrophages derived from the two hosts, and found a differential basal transcriptome profile independent of Leishmania infection. These host signatures, in which 75% of the genes are directly or indirectly related to the immune system, may account for the differences in the immune response to infection between the two strains. To gain further insights into the underlying biological processes induced by L. major infection driven by the M-CSF DEGs, we mapped the time-resolved expression profiles onto a large protein-protein interaction (PPI) network and performed network propagation to identify modules of interacting proteins that agglomerate infection response signals for each strain. This analysis revealed profound differences in the resulting responses networks related to immune signaling and metabolism that were validated by qRT-PCR time series experiments leading to plausible and provable hypotheses for the differences in disease pathophysiology. In summary, we demonstrate that the host's gene expression background determines to a large degree its response to L. major infection, and that the gene expression analysis combined with network propagation is an effective approach to help identifying dynamically altered mouse strain-specific networks that hold mechanistic information about these contrasting responses to infection.
Collapse
Affiliation(s)
- Cyrine Bouabid
- Laboratory of Medical Parasitology, Biotechnology and Biomolecules (PMBB), Institut Pasteur de Tunis, Tunis, Tunisia
- Faculty of Sciences of Tunis, Université de Tunis El Manar, Tunis, Tunisia
| | - Sameh Rabhi
- Laboratory of Medical Parasitology, Biotechnology and Biomolecules (PMBB), Institut Pasteur de Tunis, Tunis, Tunisia
| | - Kristina Thedinga
- Department Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Gal Barel
- Department Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Hedia Tnani
- Laboratory de BioInformatic, BioMathematic and BioStatistic (BIMS), Institut Pasteur de Tunis, Tunis, Tunisia
| | - Imen Rabhi
- Laboratory of Medical Parasitology, Biotechnology and Biomolecules (PMBB), Institut Pasteur de Tunis, Tunis, Tunisia
- Higher Institute of Biotechnology at Sidi-Thabet (ISBST), Biotechnopole Sidi-Thabet- University of Manouba, Sidi-Thabet, Tunisia
| | - Alia Benkahla
- Laboratory de BioInformatic, BioMathematic and BioStatistic (BIMS), Institut Pasteur de Tunis, Tunis, Tunisia
| | - Ralf Herwig
- Department Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Lamia Guizani-Tabbane
- Laboratory of Medical Parasitology, Biotechnology and Biomolecules (PMBB), Institut Pasteur de Tunis, Tunis, Tunisia
- *Correspondence: Lamia Guizani-Tabbane,
| |
Collapse
|
7
|
Hoehe MR, Herwig R. Analysis of 1276 Haplotype-Resolved Genomes Allows Characterization of Cis- and Trans-Abundant Genes. Methods Mol Biol 2023; 2590:237-272. [PMID: 36335503 DOI: 10.1007/978-1-0716-2819-5_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Many methods for haplotyping have materialized, but their application on a significant scale has been rare to date. Here we summarize analyses that were carried out in 1092 genomes from the 1000 Genomes Consortium and validated in an unprecedented number of 184 PGP genomes that have been experimentally haplotype-resolved by application of the Long-Fragment Read (LFR) technology. These analyses provided first insights into the diplotypic nature of human genomes and its potential functional implications. Thus, protein-changing variants were not randomly distributed between the two homologues of 18,121 autosomal protein-coding genes but occurred significantly more frequently in cis than in trans configurations in virtually each of the 1276 phased genomes. This resulted in global cis/trans ratios of ~60:40, establishing "cis abundance" as a universal characteristic of diploid human genomes. This phenomenon was based on two different classes of genes, a larger one exhibiting cis configurations of protein-changing variants in excess, so-called "cis-abundant" genes, and a smaller one of "trans-abundant" genes. These two gene classes, which together constitute a common diplotypic exome, were further functionally distinguished by means of gene ontology (GO) and pathway enrichment analysis. Moreover, they were distinguishable in terms of their effects on the human interactome, where they constitute distinct cis and trans modules, as shown with network propagation on a large integrated protein-protein interaction network. These analyses, recently performed with updated database and analysis tools, further consolidated the characterization of cis- and trans-abundant genes while expanding previous results. In this chapter, we present the key results along with the materials and methods to motivate readers to investigate these findings independently and gain further insights into the diplotypic nature of genes and genomes.
Collapse
Affiliation(s)
- Margret R Hoehe
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.
| | - Ralf Herwig
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| |
Collapse
|
8
|
Atlas of interactions between SARS-CoV-2 macromolecules and host proteins. CELL INSIGHT 2022; 2:100068. [PMID: 37192911 PMCID: PMC9670597 DOI: 10.1016/j.cellin.2022.100068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Revised: 10/30/2022] [Accepted: 11/04/2022] [Indexed: 11/18/2022]
Abstract
The proteins and RNAs of viruses extensively interact with host proteins after infection. We collected and reanalyzed all available datasets of protein-protein and RNA-protein interactions related to SARS-CoV-2. We investigated the reproducibility of those interactions and made strict filters to identify highly confident interactions. We systematically analyzed the interaction network and identified preferred subcellular localizations of viral proteins, some of which such as ORF8 in ER and ORF7A/B in ER membrane were validated using dual fluorescence imaging. Moreover, we showed that viral proteins frequently interact with host machinery related to protein processing in ER and vesicle-associated processes. Integrating the protein- and RNA-interactomes, we found that SARS-CoV-2 RNA and its N protein closely interacted with stress granules including 40 core factors, of which we specifically validated G3BP1, IGF2BP1, and MOV10 using RIP and Co-IP assays. Combining CRISPR screening results, we further identified 86 antiviral and 62 proviral factors and associated drugs. Using network diffusion, we found additional 44 interacting proteins including two proviral factors previously validated. Furthermore, we showed that this atlas could be applied to identify the complications associated with COVID-19. All data are available in the AIMaP database (https://mvip.whu.edu.cn/aimap/) for users to easily explore the interaction map.
Collapse
|
9
|
Network-Based Approaches for Disease-Gene Association Prediction Using Protein-Protein Interaction Networks. Int J Mol Sci 2022; 23:ijms23137411. [PMID: 35806415 PMCID: PMC9266751 DOI: 10.3390/ijms23137411] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Revised: 06/25/2022] [Accepted: 06/30/2022] [Indexed: 01/02/2023] Open
Abstract
Genome-wide association studies (GWAS) can be used to infer genome intervals that are involved in genetic diseases. However, investigating a large number of putative mutations for GWAS is resource- and time-intensive. Network-based computational approaches are being used for efficient disease-gene association prediction. Network-based methods are based on the underlying assumption that the genes causing the same diseases are located close to each other in a molecular network, such as a protein-protein interaction (PPI) network. In this survey, we provide an overview of network-based disease-gene association prediction methods based on three categories: graph-theoretic algorithms, machine learning algorithms, and an integration of these two. We experimented with six selected methods to compare their prediction performance using a heterogeneous network constructed by combining a genome-wide weighted PPI network, an ontology-based disease network, and disease-gene associations. The experiment was conducted in two different settings according to the presence and absence of known disease-associated genes. The results revealed that HerGePred, an integrative method, outperformed in the presence of known disease-associated genes, whereas PRINCE, which adopted a network propagation algorithm, was the most competitive in the absence of known disease-associated genes. Overall, the results demonstrated that the integrative methods performed better than the methods using graph-theory only, and the methods using a heterogeneous network performed better than those using a homogeneous PPI network only.
Collapse
|
10
|
Thedinga K, Herwig R. Gradient tree boosting and network propagation for the identification of pan-cancer survival networks. STAR Protoc 2022; 3:101353. [PMID: 35509973 PMCID: PMC9059156 DOI: 10.1016/j.xpro.2022.101353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
Cancer survival prediction is typically done with uninterpretable machine learning techniques, e.g., gradient tree boosting. Therefore, additional steps are needed to infer biological plausibility of the predictions. Here, we describe a protocol that combines pan-cancer survival prediction with XGBoost tree-ensemble learning and subsequent propagation of the learned feature weights on protein interaction networks. This protocol is based on TCGA transcriptome data of 8,024 patients from 25 cancer types but can easily be adapted to cancer patient data from other sources. For complete details on the use and execution of this protocol, please refer to Thedinga and Herwig (2022). Efficient pan-cancer survival prediction with XGBoost Network propagation with NetCore improves biological plausibility of features Combined approach identifies pan-cancer survival networks
Collapse
|
11
|
Bernett J, Krupke D, Sadegh S, Baumbach J, Fekete SP, Kacprowski T, List M, Blumenthal DB. Robust disease module mining via enumeration of diverse prize-collecting Steiner trees. Bioinformatics 2022; 38:1600-1606. [PMID: 34984440 DOI: 10.1093/bioinformatics/btab876] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Revised: 11/29/2021] [Accepted: 12/31/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Disease module mining methods (DMMMs) extract subgraphs that constitute candidate disease mechanisms from molecular interaction networks such as protein-protein interaction (PPI) networks. Irrespective of the employed models, DMMMs typically include non-robust steps in their workflows, i.e. the computed subnetworks vary when running the DMMMs multiple times on equivalent input. This lack of robustness has a negative effect on the trustworthiness of the obtained subnetworks and is hence detrimental for the widespread adoption of DMMMs in the biomedical sciences. RESULTS To overcome this problem, we present a new DMMM called ROBUST (robust disease module mining via enumeration of diverse prize-collecting Steiner trees). In a large-scale empirical evaluation, we show that ROBUST outperforms competing methods in terms of robustness, scalability and, in most settings, functional relevance of the produced modules, measured via KEGG (Kyoto Encyclopedia of Genes and Genomes) gene set enrichment scores and overlap with DisGeNET disease genes. AVAILABILITY AND IMPLEMENTATION A Python 3 implementation and scripts to reproduce the results reported in this article are available on GitHub: https://github.com/bionetslab/robust, https://github.com/bionetslab/robust-eval. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Judith Bernett
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, 85354 Freising, Germany
| | - Dominik Krupke
- Department of Computer Science, TU Braunschweig, 38106 Braunschweig, Germany
| | - Sepideh Sadegh
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, 85354 Freising, Germany.,Institute for Computational Systems Biology, University of Hamburg, 22607 Hamburg, Germany
| | - Jan Baumbach
- Institute for Computational Systems Biology, University of Hamburg, 22607 Hamburg, Germany.,Department of Mathematics and Computer Science, University of Southern Denmark, 5230 Odense, Denmark
| | - Sándor P Fekete
- Department of Computer Science, TU Braunschweig, 38106 Braunschweig, Germany.,Braunschweig Integrated Centre of Systems Biology (BRICS), 38106 Braunschweig, Germany
| | - Tim Kacprowski
- Braunschweig Integrated Centre of Systems Biology (BRICS), 38106 Braunschweig, Germany.,Division Data Science in Biomedicine, Peter L. Reichertz Institute for Medical Informatics, Technical University of Braunschweig and Hannover Medical School, 38106 Braunschweig, Germany
| | - Markus List
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, 85354 Freising, Germany
| | - David B Blumenthal
- Department Artificial Intelligence in Biomedical Engineering (AIBE), Friedrich-Alexander University Erlangen-Nürnberg (FAU), 91052 Erlangen, Germany
| |
Collapse
|
12
|
Liu S, Lu Y, Geng D. Molecular Subgroup Classification in Alzheimer's Disease by Transcriptomic Profiles. J Mol Neurosci 2022; 72:866-879. [PMID: 35080766 DOI: 10.1007/s12031-021-01957-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Accepted: 12/08/2021] [Indexed: 12/19/2022]
Abstract
Alzheimer's disease (AD) is a progressive cognitive disorder that occurs worldwide, and the lack of disease-modifying targets and pathways is a pressing issue. This study aimed to provide new targets and pathways by performing molecular subgroup classification. After normalizing the collected data, the subgroup number was confirmed with consensus clustering. Comparisons of clinical features among subgroups were conducted to clarify the clinical traits of each subgroup. Subgroup-specific genes were identified to perform weighted gene coexpression analysis (WGCNA). Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses were carried out. Next, gene set enrichment analysis (GSEA) was performed. Protein-protein interaction networks were built to screen core genes and in each subgroup to perform Spearman correlation analysis with clinical traits. Sequencing profiles of 1068 AD samples collected from 2 datasets were classified into 3 subgroups. Clinical comparisons revealed that patients in subgroup III tended to be younger, while their pathological grades were the most severe. WGCNA detected four gene modules, and the turquoise module, where the dopaminergic synapse pathway was enriched, was related to subgroup I. The neurotrophin signaling pathway and TGF-beta signaling pathway were robustly enriched in the blue and brown modules, respectively, in subgroup III. Moreover, 3 hub genes in subgroup I were negatively correlated with the sum of neurofibrillary tangle (Nft) density. Conversely, hub genes in subgroups II and III exhibited positive correlations with the sum of Nft density. These results provide new pathways and targets for AD treatment.
Collapse
Affiliation(s)
- Sha Liu
- Department of Neurology, Affiliated Hospital of Xuzhou Medical University, West Huaihai Road 99, Xuzhou, 221002, Jiangsu, China
| | - Yan Lu
- Department of Neurology, The Municipal Hospital, Xuzhou Medical University, Xuzhou, 221116, Jiangsu, China
| | - Deqin Geng
- Department of Neurology, Affiliated Hospital of Xuzhou Medical University, West Huaihai Road 99, Xuzhou, 221002, Jiangsu, China.
| |
Collapse
|
13
|
A gradient tree boosting and network propagation derived pan-cancer survival network of the tumor microenvironment. iScience 2022; 25:103617. [PMID: 35106465 PMCID: PMC8786644 DOI: 10.1016/j.isci.2021.103617] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 11/12/2021] [Accepted: 12/09/2021] [Indexed: 12/22/2022] Open
Abstract
Predicting cancer survival from molecular data is an important aspect of biomedical research because it allows quantifying patient risks and thus individualizing therapy. We introduce XGBoost tree ensemble learning to predict survival from transcriptome data of 8,024 patients from 25 different cancer types and show highly competitive performance with state-of-the-art methods. To further improve plausibility of the machine learning approach we conducted two additional steps. In the first step, we applied pan-cancer training and showed that it substantially improves prognosis compared with cancer subtype-specific training. In the second step, we applied network propagation and inferred a pan-cancer survival network consisting of 103 genes. This network highlights cross-cohort features and is predictive for the tumor microenvironment and immune status of the patients. Our work demonstrates that pan-cancer learning combined with network propagation generalizes over multiple cancer types and identifies biologically plausible features that can serve as biomarkers for monitoring cancer survival. Highly performing cancer survival prediction with XGBoost Pan-cancer training outperforms single-cohort training Combined approach consisting of machine learning and network propagation Tumor microenvironment is most strongly involved in cancer survival prediction
Collapse
|
14
|
Kamburov A, Herwig R. ConsensusPathDB 2022: molecular interactions update as a resource for network biology. Nucleic Acids Res 2021; 50:D587-D595. [PMID: 34850110 PMCID: PMC8728246 DOI: 10.1093/nar/gkab1128] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 10/21/2021] [Accepted: 11/04/2021] [Indexed: 01/01/2023] Open
Abstract
Molecular interactions are key drivers of biological function. Providing interaction resources to the research community is important since they allow functional interpretation and network-based analysis of molecular data. ConsensusPathDB (http://consensuspathdb.org) is a meta-database combining interactions of diverse types from 31 public resources for humans, 16 for mice and 14 for yeasts. Using ConsensusPathDB, researchers commonly evaluate lists of genes, proteins and metabolites against sets of molecular interactions defined by pathways, Gene Ontology and network neighborhoods and retrieve complex molecular neighborhoods formed by heterogeneous interaction types. Furthermore, the integrated protein–protein interaction network is used as a basis for propagation methods. Here, we present the 2022 update of ConsensusPathDB, highlighting content growth, additional functionality and improved database stability. For example, the number of human molecular interactions increased to 859 848 connecting 200 499 unique physical entities such as genes/proteins, metabolites and drugs. Furthermore, we integrated regulatory datasets in the form of transcription factor–, microRNA– and enhancer–gene target interactions, thus providing novel functionality in the context of overrepresentation and enrichment analyses. We specifically emphasize the use of the integrated protein–protein interaction network as a scaffold for network inferences, present topological characteristics of the network and discuss strengths and shortcomings of such approaches.
Collapse
Affiliation(s)
- Atanas Kamburov
- R&D Digital Technologies Department, Bayer AG, Berlin 13353, Germany
| | - Ralf Herwig
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin 14195, Germany
| |
Collapse
|
15
|
Charmpi K, Chokkalingam M, Johnen R, Beyer A. Optimizing network propagation for multi-omics data integration. PLoS Comput Biol 2021; 17:e1009161. [PMID: 34762640 PMCID: PMC8664198 DOI: 10.1371/journal.pcbi.1009161] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2021] [Revised: 12/10/2021] [Accepted: 10/12/2021] [Indexed: 01/11/2023] Open
Abstract
Network propagation refers to a class of algorithms that integrate information from input data across connected nodes in a given network. These algorithms have wide applications in systems biology, protein function prediction, inferring condition-specifically altered sub-networks, and prioritizing disease genes. Despite the popularity of network propagation, there is a lack of comparative analyses of different algorithms on real data and little guidance on how to select and parameterize the various algorithms. Here, we address this problem by analyzing different combinations of network normalization and propagation methods and by demonstrating schemes for the identification of optimal parameter settings on real proteome and transcriptome data. Our work highlights the risk of a ‘topology bias’ caused by the incorrect use of network normalization approaches. Capitalizing on the fact that network propagation is a regularization approach, we show that minimizing the bias-variance tradeoff can be utilized for selecting optimal parameters. The application to real multi-omics data demonstrated that optimal parameters could also be obtained by either maximizing the agreement between different omics layers (e.g. proteome and transcriptome) or by maximizing the consistency between biological replicates. Furthermore, we exemplified the utility and robustness of network propagation on multi-omics datasets for identifying ageing-associated genes in brain and liver tissues of rats and for elucidating molecular mechanisms underlying prostate cancer progression. Overall, this work compares different network propagation approaches and it presents strategies for how to use network propagation algorithms to optimally address a specific research question at hand. Modern technologies enable the simultaneous measurement of tens of thousands of molecules in biological samples. Algorithms called network propagation or network smoothing are frequently used to integrate such data with already known molecular interaction data, such as protein and gene interaction networks. These methods distribute the information on molecular perturbations within the network and help identifying network regions that are enriched for many perturbed (affected) molecules. Despite the popularity of these methods, there is a lack of guidance on how to optimally use them. Here, we highlight possible pitfalls when using incorrect network normalization methods. Further, we present different ways for optimizing the smoothing parameters used during network smoothing: the first approach maximizes the consistency between replicate measurements within a dataset; the second one maximizes the consistency between different types of ‘omics’ measurements, such as proteomics and transcriptomics. Using two multi-omics datasets, one from a cohort of prostate cancer patients, the other one from an ageing study on rat brain and liver tissues, we exemplify the effects of these strategies on real data.
Collapse
Affiliation(s)
- Konstantina Charmpi
- CECAD Cologne Excellence Cluster on Cellular Stress Responses in Aging Associated Diseases, Cologne, Germany
| | - Manopriya Chokkalingam
- CECAD Cologne Excellence Cluster on Cellular Stress Responses in Aging Associated Diseases, Cologne, Germany
| | - Ronja Johnen
- CECAD Cologne Excellence Cluster on Cellular Stress Responses in Aging Associated Diseases, Cologne, Germany
| | - Andreas Beyer
- CECAD Cologne Excellence Cluster on Cellular Stress Responses in Aging Associated Diseases, Cologne, Germany
- Center for Molecular Medicine Cologne (CMMC), Medical Faculty, University of Cologne, Cologne, Germany
- Institute for Genetics, Faculty of Mathematics and Natural Sciences, University of Cologne, Cologne, Germany
- * E-mail:
| |
Collapse
|
16
|
Lazareva O, Baumbach J, List M, Blumenthal DB. On the limits of active module identification. Brief Bioinform 2021; 22:6189770. [PMID: 33782690 DOI: 10.1093/bib/bbab066] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Revised: 01/29/2021] [Indexed: 12/12/2022] Open
Abstract
In network and systems medicine, active module identification methods (AMIMs) are widely used for discovering candidate molecular disease mechanisms. To this end, AMIMs combine network analysis algorithms with molecular profiling data, most commonly, by projecting gene expression data onto generic protein-protein interaction (PPI) networks. Although active module identification has led to various novel insights into complex diseases, there is increasing awareness in the field that the combination of gene expression data and PPI network is problematic because up-to-date PPI networks have a very small diameter and are subject to both technical and literature bias. In this paper, we report the results of an extensive study where we analyzed for the first time whether widely used AMIMs really benefit from using PPI networks. Our results clearly show that, except for the recently proposed AMIM DOMINO, the tested AMIMs do not produce biologically more meaningful candidate disease modules on widely used PPI networks than on random networks with the same node degrees. AMIMs hence mainly learn from the node degrees and mostly fail to exploit the biological knowledge encoded in the edges of the PPI networks. This has far-reaching consequences for the field of active module identification. In particular, we suggest that novel algorithms are needed which overcome the degree bias of most existing AMIMs and/or work with customized, context-specific networks instead of generic PPI networks.
Collapse
Affiliation(s)
- Olga Lazareva
- Chair of Experimental Bioinformatics, Technical University of Munich, Freising, Germany
| | - Jan Baumbach
- Chair of Experimental Bioinformatics, Technical University of Munich, Freising, Germany.,Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany.,Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
| | - Markus List
- Chair of Experimental Bioinformatics, Technical University of Munich, Freising, Germany
| | - David B Blumenthal
- Chair of Experimental Bioinformatics, Technical University of Munich, Freising, Germany
| |
Collapse
|