1
|
Cantor E, Guauque-Olarte S, León R, Chabert S, Salas R. Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data. BioData Min 2024; 17:34. [PMID: 39256872 PMCID: PMC11389072 DOI: 10.1186/s13040-024-00388-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2024] [Accepted: 09/02/2024] [Indexed: 09/12/2024] Open
Abstract
The use of prior knowledge in the machine learning framework has been considered a potential tool to handle the curse of dimensionality in genetic and genomics data. Although random forest (RF) represents a flexible non-parametric approach with several advantages, it can provide poor accuracy in high-dimensional settings, mainly in scenarios with small sample sizes. We propose a knowledge-slanted RF that integrates biological networks as prior knowledge into the model to improve its performance and explainability, exemplifying its use for selecting and identifying relevant genes. knowledge-slanted RF is a combination of two stages. First, prior knowledge represented by graphs is translated by running a random walk with restart algorithm to determine the relevance of each gene based on its connection and localization on a protein-protein interaction network. Then, each relevance is used to modify the selection probability to draw a gene as a candidate split-feature in the conventional RF. Experiments in simulated datasets with very small sample sizes ( n ≤ 30 ) comparing knowledge-slanted RF against conventional RF and logistic lasso regression, suggest an improved precision in outcome prediction compared to the other methods. The knowledge-slanted RF was completed with the introduction of a modified version of the Boruta feature selection algorithm. Finally, knowledge-slanted RF identified more relevant biological genes, offering a higher level of explainability for users than conventional RF. These findings were corroborated in one real case to identify relevant genes to calcific aortic valve stenosis.
Collapse
Affiliation(s)
- Erika Cantor
- Department of clinical epidemiology and biostatistics, Pontificia Universidad Javeriana, Bogotá, 110221, Colombia.
| | - Sandra Guauque-Olarte
- Department of basic sciences and oral medicine, Universidad Nacional de Colombia, Bogotá, 16486, Colombia
| | - Roberto León
- Department of Computer Science, Universidad Técnica Federico Santa María, Santiago de Chile, 8940897, Chile
| | - Steren Chabert
- School of Biomedical Engineering, Universidad de Valparaiso, Valparaíso, 2360102, Chile
- Millennium Science Initiative Intelligent Healthcare Engineering, Santiago de Chile, 7820436, Chile
- Center of Interdisciplinary Biomedical and Engineering Research for Health - MEDING, Universidad de Valparaiso, Valparaíso, 2360102, Chile
| | - Rodrigo Salas
- School of Biomedical Engineering, Universidad de Valparaiso, Valparaíso, 2360102, Chile
- Millennium Science Initiative Intelligent Healthcare Engineering, Santiago de Chile, 7820436, Chile
- Center of Interdisciplinary Biomedical and Engineering Research for Health - MEDING, Universidad de Valparaiso, Valparaíso, 2360102, Chile
| |
Collapse
|
2
|
Saranya KR, Vimina ER, Pinto FR. TransNeT-CGP: A cluster-based comorbid gene prioritization by integrating transcriptomics and network-topological features. Comput Biol Chem 2024; 110:108038. [PMID: 38461796 DOI: 10.1016/j.compbiolchem.2024.108038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 01/11/2024] [Accepted: 02/25/2024] [Indexed: 03/12/2024]
Abstract
The local disruptions caused by the genes of one disease can influence the pathways associated with the other diseases resulting in comorbidity. For gene therapies, it is necessary to prioritize the key genes that regulate common biological mechanisms to tackle the issues caused by overlapping diseases. This work proposes a clustering-based computational approach for prioritising the comorbid genes within the overlapping disease modules by analyzing Protein-Protein Interaction networks. For this, a sub-network with gene interactions of the disease pair was extracted from the interactome. The edge weights are assigned by combining the pairwise gene expression correlation and betweenness centrality scores. Further, a weighted graph clustering algorithm is applied and dominant nodes of high-density clusters are ranked based on clustering coefficients and neighborhood connectivity. Case studies based on neurodegenerative diseases such as Amyotrophic Lateral Sclerosis- Spinal Muscular Atrophy (ALS-SMA) pair and cancers such as Ovarian Carcinoma-Invasive Ductal Breast Carcinoma (OC-IDBC) pair were conducted to examine the efficacy of the proposed method. To identify the mechanistic role of top-ranked genes, we used Functional and Pathway enrichment analysis, connectivity analysis with leave-one-out (LOO) method, analysis of associated disease-related protein complexes, and prioritization tools such as TOPPGENE and Heml2.0. From pathway analysis, it was observed that the top 10 genes obtained using the proposed method were associated with 10 pathways in ALS-SMA comorbidity and 15 in the case of OC-IDBC, while that in similar methods like SAPDSB and S2B were 4, 6 respectively for ALS-SMA and 9, 10 respectively for OC-IDBC. In both case studies, 70 % of the disease-specific benchmark protein complexes were linked to top-ranked genes of the proposed method while that of SAPDSB and S2B were 55 % and 60 % respectively. Additionally, it was found that the removal of the top 10 genes disconnect the network into 14 distinct components in the case of ALS-SMA and 9 in the case of OC-IDBC. The experimental results shows that the proposed method can be effectively used for identifying key genes in comorbidity and can offer insights about the intricate molecular relationship driving comorbid diseases.
Collapse
Affiliation(s)
- K R Saranya
- Department of Computer Science & IT, School of Computing, Amrita Vishwa Vidyapeetham, Kochi Campus, India.
| | - E R Vimina
- Department of Computer Science & IT, School of Computing, Amrita Vishwa Vidyapeetham, Kochi Campus, India.
| | - F R Pinto
- Chemistry and Biochemistry Department, Faculty of Sciences, University of Lisbon, Portugal.
| |
Collapse
|
3
|
da Silva Rosa SC, Barzegar Behrooz A, Guedes S, Vitorino R, Ghavami S. Prioritization of genes for translation: a computational approach. Expert Rev Proteomics 2024; 21:125-147. [PMID: 38563427 DOI: 10.1080/14789450.2024.2337004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 02/21/2024] [Indexed: 04/04/2024]
Abstract
INTRODUCTION Gene identification for genetic diseases is critical for the development of new diagnostic approaches and personalized treatment options. Prioritization of gene translation is an important consideration in the molecular biology field, allowing researchers to focus on the most promising candidates for further investigation. AREAS COVERED In this paper, we discussed different approaches to prioritize genes for translation, including the use of computational tools and machine learning algorithms, as well as experimental techniques such as knockdown and overexpression studies. We also explored the potential biases and limitations of these approaches and proposed strategies to improve the accuracy and reliability of gene prioritization methods. Although numerous computational methods have been developed for this purpose, there is a need for computational methods that incorporate tissue-specific information to enable more accurate prioritization of candidate genes. Such methods should provide tissue-specific predictions, insights into underlying disease mechanisms, and more accurate prioritization of genes. EXPERT OPINION Using advanced computational tools and machine learning algorithms to prioritize genes, we can identify potential targets for therapeutic intervention of complex diseases. This represents an up-and-coming method for drug development and personalized medicine.
Collapse
Affiliation(s)
- Simone C da Silva Rosa
- Department of Human Anatomy and Cell Science, Max Rady College of Medicine, Rady Faculty of Health Science, University of Manitoba, Winnipeg, Canada
| | - Amir Barzegar Behrooz
- Department of Human Anatomy and Cell Science, Max Rady College of Medicine, Rady Faculty of Health Science, University of Manitoba, Winnipeg, Canada
- Electrophysiology Research Center, Neuroscience Institute, Tehran University of Medical Sciences, Tehran, Iran
| | - Sofia Guedes
- LAQV/REQUIMTE, Department of Chemistry, University of Aveiro, Aveiro, Portugal
| | - Rui Vitorino
- LAQV/REQUIMTE, Department of Chemistry, University of Aveiro, Aveiro, Portugal
- Department of Medical Sciences, Institute of Biomedicine-iBiMED, University of Aveiro, Aveiro, Portugal
- UnIC@RISE, Department of Surgery and Physiology, Faculty of Medicine of the University of Porto, Porto, Portugal
| | - Saeid Ghavami
- Department of Human Anatomy and Cell Science, Max Rady College of Medicine, Rady Faculty of Health Science, University of Manitoba, Winnipeg, Canada
- Faculty of Medicine in Zabrze, Academia of Silesia, Katowice, Poland
- Research Institute of Oncology and Hematology, Cancer Care Manitoba, University of Manitoba, Winnipeg, Canada
| |
Collapse
|
4
|
Zhang P, Zhang W, Sun W, Xu J, Hu H, Wang L, Wong L. Identification of gene biomarkers for brain diseases via multi-network topological semantics extraction and graph convolutional network. BMC Genomics 2024; 25:175. [PMID: 38350848 PMCID: PMC10865627 DOI: 10.1186/s12864-024-09967-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Accepted: 01/03/2024] [Indexed: 02/15/2024] Open
Abstract
BACKGROUND Brain diseases pose a significant threat to human health, and various network-based methods have been proposed for identifying gene biomarkers associated with these diseases. However, the brain is a complex system, and extracting topological semantics from different brain networks is necessary yet challenging to identify pathogenic genes for brain diseases. RESULTS In this study, we present a multi-network representation learning framework called M-GBBD for the identification of gene biomarker in brain diseases. Specifically, we collected multi-omics data to construct eleven networks from different perspectives. M-GBBD extracts the spatial distributions of features from these networks and iteratively optimizes them using Kullback-Leibler divergence to fuse the networks into a common semantic space that represents the gene network for the brain. Subsequently, a graph consisting of both gene and large-scale disease proximity networks learns representations through graph convolution techniques and predicts whether a gene is associated which brain diseases while providing associated scores. Experimental results demonstrate that M-GBBD outperforms several baseline methods. Furthermore, our analysis supported by bioinformatics revealed CAMP as a significantly associated gene with Alzheimer's disease identified by M-GBBD. CONCLUSION Collectively, M-GBBD provides valuable insights into identifying gene biomarkers for brain diseases and serves as a promising framework for brain networks representation learning.
Collapse
Affiliation(s)
- Ping Zhang
- College of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277100, Shandong, China
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Weihan Zhang
- CAS Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, The Innovative Academy of Seed Design, Chinese Academy of Sciences, Hubei Hongshan Laboratory, Wuhan, 430074, China
| | - Weicheng Sun
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Jinsheng Xu
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Hua Hu
- College of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277100, Shandong, China.
| | - Lei Wang
- College of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277100, Shandong, China.
- Guangxi Key Lab of Human-Machine Interaction and Intelligent Decision, Guangxi Academy of Sciences, Nanning, 530007, China.
| | - Leon Wong
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, 518118, China.
| |
Collapse
|
5
|
Visonà G, Bouzigon E, Demenais F, Schweikert G. Network propagation for GWAS analysis: a practical guide to leveraging molecular networks for disease gene discovery. Brief Bioinform 2024; 25:bbae014. [PMID: 38340090 PMCID: PMC10858647 DOI: 10.1093/bib/bbae014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Revised: 12/28/2023] [Accepted: 01/08/2024] [Indexed: 02/12/2024] Open
Abstract
MOTIVATION Genome-wide association studies (GWAS) have enabled large-scale analysis of the role of genetic variants in human disease. Despite impressive methodological advances, subsequent clinical interpretation and application remains challenging when GWAS suffer from a lack of statistical power. In recent years, however, the use of information diffusion algorithms with molecular networks has led to fruitful insights on disease genes. RESULTS We present an overview of the design choices and pitfalls that prove crucial in the application of network propagation methods to GWAS summary statistics. We highlight general trends from the literature, and present benchmark experiments to expand on these insights selecting as case study three diseases and five molecular networks. We verify that the use of gene-level scores based on GWAS P-values offers advantages over the selection of a set of 'seed' disease genes not weighted by the associated P-values if the GWAS summary statistics are of sufficient quality. Beyond that, the size and the density of the networks prove to be important factors for consideration. Finally, we explore several ensemble methods and show that combining multiple networks may improve the network propagation approach.
Collapse
Affiliation(s)
- Giovanni Visonà
- Empirical Inference, Max-Planck Institute for Intelligent Systems, Tübingen 72076, Germany
| | | | | | | |
Collapse
|
6
|
Berber I, Erten C, Kazan H. Predator: Predicting the Impact of Cancer Somatic Mutations on Protein-Protein Interactions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3163-3172. [PMID: 37030791 DOI: 10.1109/tcbb.2023.3262119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Since many biological processes are governed by protein-protein interactions, understanding which mutations lead to a disruption in these interactions is profoundly important for cancer research. Most of the existing methods focus on the stability of the protein without considering the specific effects of a mutation on its interactions with other proteins. Here, we focus on somatic mutations that appear on the interface regions of the protein and predict the interactions that would be affected by a mutation of interest. We build an ensemble model, Predator, that classifies the interface mutations as disruptive or nondisruptive based on the predicted effects of mutations on specific protein-protein interactions. We show that Predator outperforms existing approaches in literature in terms of prediction accuracy. We then apply Predator on various TCGA cancer cohorts and perform comprehensive analysis at cohort level, patient level, and gene level in determining the genes whose interface mutations tend to yield a disruption in its interactions. The predictions obtained by Predator shed light on interesting patterns on several genes for each cohort regarding their potential as cancer drivers. Our analyses further reveal that the identified genes and their frequently disrupted partners exhibit patterns of mutually exclusivity across cancer cohorts under study.
Collapse
|
7
|
Gao Z, Pan Y, Ding P, Xu R. A knowledge graph-based disease-gene prediction system using multi-relational graph convolution networks. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2023; 2022:468-476. [PMID: 37128437 PMCID: PMC10148306] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Identifying disease-gene associations is important for understanding molecule mechanisms of diseases, finding diagnostic markers and therapeutic targets. Many computational methods have been proposed to predict disease related genes by integrating different biological databases into heterogeneous networks. However, it remains a challenging task to leverage heterogeneous topological and semantic information from multi-source biological data to enhance disease-gene prediction. In this study, we propose a knowledge graph-based disease-gene prediction system (GenePredict-KG) by modeling semantic relations extracted from various genotypic and phenotypic databases. We first constructed a knowledge graph that comprised 2,292,609 associations between 73,358 entities for 14 types of phenotypic and genotypic relations and 7 entity types. We developed a knowledge graph embedding model to learn low-dimensional representations of entities and relations, and utilized these embeddings to infer new disease-gene interactions. We compared GenePredict-KG with several state-of-the-art models using multiple evaluation metrics. GenePredict-KG achieved high performances [AUROC (the area under receiver operating characteristic) = 0.978, AUPR (the area under precision-recall) = 0.343 and MRR (the mean reciprocal rank) = 0.244], outperforming other state-of-art methods.
Collapse
Affiliation(s)
- Zhenxiang Gao
- Center for Artificial Intelligence in Drug Discovery, Case Western Reserve University School of Medicine, Cleveland, OH, USA
| | - Yiheng Pan
- Center for Artificial Intelligence in Drug Discovery, Case Western Reserve University School of Medicine, Cleveland, OH, USA
| | - Pingjian Ding
- Center for Artificial Intelligence in Drug Discovery, Case Western Reserve University School of Medicine, Cleveland, OH, USA
| | - Rong Xu
- Center for Artificial Intelligence in Drug Discovery, Case Western Reserve University School of Medicine, Cleveland, OH, USA
| |
Collapse
|
8
|
Altuntas V. Diffusion Alignment Coefficient (DAC): A Novel Similarity Metric for Protein-Protein Interaction Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:894-903. [PMID: 35737632 DOI: 10.1109/tcbb.2022.3185406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Interaction networks can be used to predict the functions of unknown proteins using known interactions and proteins with known functions. Many graph theory or diffusion-based methods have been proposed, using the assumption that the topological properties of a protein in a network are related to its biological function. Here we seek to improve function prediction by finding more similar neighbors with a new diffusion-based alignment technique to overcome the topological information loss of the node. In this study, we introduce the Diffusion Alignment Coefficient (DAC) algorithm, which combines diffusion, longest common subsequence, and longest common substring techniques to measure the similarity of two nodes in protein interaction networks. As a proof of concept, our experiments, conducted on a real PPI networks S.cerevisiae and Homo Sapiens, demonstrated that our method obtained better results than competitors for MIPS and MSigDB Collections hallmark gene set functional categories. This is the first study to develop a measure of node function similarity using alignment to consider the positions of nodes in protein-protein interaction networks. According to the experimental results, the use of spatial information belonging to the nodes in the network has a positive effect on the detection of more functionally similar neighboring nodes.
Collapse
|
9
|
Jagodnik KM, Shvili Y, Bartal A. HetIG-PreDiG: A Heterogeneous Integrated Graph Model for Predicting Human Disease Genes based on gene expression. PLoS One 2023; 18:e0280839. [PMID: 36791052 PMCID: PMC9931161 DOI: 10.1371/journal.pone.0280839] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2022] [Accepted: 01/10/2023] [Indexed: 02/16/2023] Open
Abstract
Graph analytical approaches permit identifying novel genes involved in complex diseases, but are limited by (i) inferring structural network similarity of connected gene nodes, ignoring potentially relevant unconnected nodes; (ii) using homogeneous graphs, missing gene-disease associations' complexity; (iii) relying on disease/gene-phenotype associations' similarities, involving highly incomplete data; (iv) using binary classification, with gene-disease edges as positive training samples, and non-associated gene and disease nodes as negative samples that may include currently unknown disease genes; or (v) reporting predicted novel associations without systematically evaluating their accuracy. Addressing these limitations, we develop the Heterogeneous Integrated Graph for Predicting Disease Genes (HetIG-PreDiG) model that includes gene-gene, gene-disease, and gene-tissue associations. We predict novel disease genes using low-dimensional representation of nodes accounting for network structure, and extending beyond network structure using the developed Gene-Disease Prioritization Score (GDPS) reflecting the degree of gene-disease association via gene co-expression data. For negative training samples, we select non-associated gene and disease nodes with lower GDPS that are less likely to be affiliated. We evaluate the developed model's success in predicting novel disease genes by analyzing the prediction probabilities of gene-disease associations. HetIG-PreDiG successfully predicts (Micro-F1 = 0.95) gene-disease associations, outperforming baseline models, and is validated using published literature, thus advancing our understanding of complex genetic diseases.
Collapse
Affiliation(s)
- Kathleen M. Jagodnik
- The School of Business Administration, Bar-Ilan University, Ramat Gan, Israel
- Department of Psychiatry, Harvard Medical School, Boston, MA, United States of America
- Department of Psychiatry, Massachusetts General Hospital, Boston, MA, United States of America
| | - Yael Shvili
- Department of Surgery A, Meir Medical Center, Kfar Sava, Israel
| | - Alon Bartal
- The School of Business Administration, Bar-Ilan University, Ramat Gan, Israel
- * E-mail:
| |
Collapse
|
10
|
An integrated network representation of multiple cancer-specific data for graph-based machine learning. NPJ Syst Biol Appl 2022; 8:14. [PMID: 35487924 PMCID: PMC9054771 DOI: 10.1038/s41540-022-00226-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Accepted: 04/04/2022] [Indexed: 12/20/2022] Open
Abstract
Genomic profiles of cancer cells provide valuable information on genetic alterations in cancer. Several recent studies employed these data to predict the response of cancer cell lines to drug treatment. Nonetheless, due to the multifactorial phenotypes and intricate mechanisms of cancer, the accurate prediction of the effect of pharmacotherapy on a specific cell line based on the genetic information alone is problematic. Emphasizing on the system-level complexity of cancer, we devised a procedure to integrate multiple heterogeneous data, including biological networks, genomics, inhibitor profiling, and gene-disease associations, into a unified graph structure. In order to construct compact, yet information-rich cancer-specific networks, we developed a novel graph reduction algorithm. Driven by not only the topological information, but also the biological knowledge, the graph reduction increases the feature-only entropy while preserving the valuable graph-feature information. Subsequent comparative benchmarking simulations employing a tissue level cross-validation protocol demonstrate that the accuracy of a graph-based predictor of the drug efficacy is 0.68, which is notably higher than those measured for more traditional, matrix-based techniques on the same data. Overall, the non-Euclidean representation of the cancer-specific data improves the performance of machine learning to predict the response of cancer to pharmacotherapy. The generated data are freely available to the academic community at https://osf.io/dzx7b/.
Collapse
|
11
|
Du J, Lin D, Yuan R, Chen X, Liu X, Yan J. Graph Embedding Based Novel Gene Discovery Associated With Diabetes Mellitus. Front Genet 2021; 12:779186. [PMID: 34899863 PMCID: PMC8657768 DOI: 10.3389/fgene.2021.779186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2021] [Accepted: 10/20/2021] [Indexed: 11/25/2022] Open
Abstract
Diabetes mellitus is a group of complex metabolic disorders which has affected hundreds of millions of patients world-widely. The underlying pathogenesis of various types of diabetes is still unclear, which hinders the way of developing more efficient therapies. Although many genes have been found associated with diabetes mellitus, more novel genes are still needed to be discovered towards a complete picture of the underlying mechanism. With the development of complex molecular networks, network-based disease-gene prediction methods have been widely proposed. However, most existing methods are based on the hypothesis of guilt-by-association and often handcraft node features based on local topological structures. Advances in graph embedding techniques have enabled automatically global feature extraction from molecular networks. Inspired by the successful applications of cutting-edge graph embedding methods on complex diseases, we proposed a computational framework to investigate novel genes associated with diabetes mellitus. There are three main steps in the framework: network feature extraction based on graph embedding methods; feature denoising and regeneration using stacked autoencoder; and disease-gene prediction based on machine learning classifiers. We compared the performance by using different graph embedding methods and machine learning classifiers and designed the best workflow for predicting genes associated with diabetes mellitus. Functional enrichment analysis based on Human Phenotype Ontology (HPO), KEGG, and GO biological process and publication search further evaluated the predicted novel genes.
Collapse
Affiliation(s)
| | | | | | | | | | - Jing Yan
- Zhejiang Hospital, Hangzhou, China.,Zhejiang Provincial Key Lab of Geriatrics, Zhejiang Hospital, Hangzhou, China
| |
Collapse
|
12
|
Mohsen H, Gunasekharan V, Qing T, Seay M, Surovtseva Y, Negahban S, Szallasi Z, Pusztai L, Gerstein MB. Network propagation-based prioritization of long tail genes in 17 cancer types. Genome Biol 2021; 22:287. [PMID: 34620211 PMCID: PMC8496153 DOI: 10.1186/s13059-021-02504-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Accepted: 09/17/2021] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND The diversity of genomic alterations in cancer poses challenges to fully understanding the etiologies of the disease. Recent interest in infrequent mutations, in genes that reside in the "long tail" of the mutational distribution, uncovered new genes with significant implications in cancer development. The study of cancer-relevant genes often requires integrative approaches pooling together multiple types of biological data. Network propagation methods demonstrate high efficacy in achieving this integration. Yet, the majority of these methods focus their assessment on detecting known cancer genes or identifying altered subnetworks. In this paper, we introduce a network propagation approach that entirely focuses on prioritizing long tail genes with potential functional impact on cancer development. RESULTS We identify sets of often overlooked, rarely to moderately mutated genes whose biological interactions significantly propel their mutation-frequency-based rank upwards during propagation in 17 cancer types. We call these sets "upward mobility genes" and hypothesize that their significant rank improvement indicates functional importance. We report new cancer-pathway associations based on upward mobility genes that are not previously identified using driver genes alone, validate their role in cancer cell survival in vitro using extensive genome-wide RNAi and CRISPR data repositories, and further conduct in vitro functional screenings resulting in the validation of 18 previously unreported genes. CONCLUSION Our analysis extends the spectrum of cancer-relevant genes and identifies novel potential therapeutic targets.
Collapse
Affiliation(s)
- Hussein Mohsen
- Computational Biology & Bioinformatics Program, Yale University, New Haven, CT, 06511, USA.
| | | | - Tao Qing
- Breast Medical Oncology, Yale School of Medicine, New Haven, CT, 06511, USA
| | - Montrell Seay
- Yale Center for Molecular Discovery, Yale University, West Haven, CT, 06516, USA
| | - Yulia Surovtseva
- Yale Center for Molecular Discovery, Yale University, West Haven, CT, 06516, USA
| | - Sahand Negahban
- Department of Statistics & Data Science, Yale University, New Haven, CT, 06511, USA
| | - Zoltan Szallasi
- Children's Hospital Informatics Program, Harvard-MIT Division of Health Sciences and Technology, Harvard Medical School, Boston, MA, 02115, USA
| | - Lajos Pusztai
- Breast Medical Oncology, Yale School of Medicine, New Haven, CT, 06511, USA.
| | - Mark B Gerstein
- Computational Biology & Bioinformatics Program, Yale University, New Haven, CT, 06511, USA.
- Department of Statistics & Data Science, Yale University, New Haven, CT, 06511, USA.
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, 06511, USA.
- Department of Computer Science, Yale University, New Haven, CT, 06511, USA.
| |
Collapse
|
13
|
Zhang W, Wang SL, Liu Y. Identification of Cancer Driver Modules Based on Graph Clustering from Multiomics Data. J Comput Biol 2021; 28:1007-1020. [PMID: 34529511 DOI: 10.1089/cmb.2021.0052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
A major challenge in cancer genomics is to identify cancer driver genes and modules. Most existing methods to identify cancer driver modules (iCDM) identify groups of genes whose somatic mutational patterns exhibit either mutual exclusivity or high coverage of patient samples, without considering other biological information from multiomics data sets. Here we integrate mutual exclusivity, coverage, and protein-protein interaction information to construct an edge-weighted network, and present a graph clustering approach based on symmetric non-negative matrix factorization to iCDM. iCDM was tested on pan-cancer data and the results were compared with those from several advanced computational methods. Our approach outperformed other methods in recovering known cancer driver modules, and the identified driver modules showed high accuracy in classifying normal and tumor samples.
Collapse
Affiliation(s)
- Wei Zhang
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China.,Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, China
| | - Shu-Lin Wang
- College of Computer Science and Electronics Engineering, Hunan University, Changsha, China
| | - Yue Liu
- College of Computer Science and Electronics Engineering, Hunan University, Changsha, China
| |
Collapse
|
14
|
Coşkun M, Koyutürk M. Node Similarity Based Graph Convolution for Link Prediction in Biological Networks. Bioinformatics 2021; 37:4501-4508. [PMID: 34152393 PMCID: PMC8652026 DOI: 10.1093/bioinformatics/btab464] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Revised: 05/20/2021] [Accepted: 06/17/2021] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND Link prediction is an important and well-studied problem in network biology. Recently, graph representation learning methods, including Graph Convolutional Network (GCN)-based node embedding have drawn increasing attention in link prediction. MOTIVATION An important component of GCN-based network embedding is the convolution matrix, which is used to propagate features across the network. Existing algorithms use the degree-normalized adjacency matrix for this purpose, as this matrix is closely related to the graph Laplacian, capturing the spectral properties of the network. In parallel, it has been shown that GCNs with a single layer can generate more robust embeddings by reducing the number of parameters. Laplacian-based convolution is not well suited to single layered GCNs, as it limits the propagation of information to immediate neighbors of a node. RESULTS Capitalizing on the rich literature on unsupervised link prediction, we propose using node similarity based convolution matrices in GCNs to compute node embeddings for link prediction. We consider eight representative node similarity measures (Common Neighbors, Jaccard Index, Adamic-Adar, Resource Allocation, Hub Depressed Index, Hub Promoted Index, Sorenson Index, Salton Index) for this purpose. We systematically compare the performance of the resulting algorithms against GCNs that use the degree-normalized adjacency matrix for convolution, as well as other link prediction algorithms. In our experiments, we use three link prediction tasks involving biomedical networks: drug-disease association (DDA) prediction, drug-drug interaction (DDI) prediction, protein-protein interaction (PPI) prediction. Our results show that node similarity-based convolution matrices significantly improve the link prediction performance of GCN-based embeddings. CONCLUSION As sophisticated machine learning frameworks are increasingly employed in biological applications, historically well-established methods can be useful in making a head-start. AVAILABILITY Our method, SiGraC, is implemented as a Python library and is freely available at https://github.com/mustafaCoskunAgu/SiGraC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mustafa Coşkun
- Department of Computer Engineering, Abdullah Gül University.,Hakkari University, Kayseri, 38080, Turkey
| | - Mehmet Koyutürk
- Department of Computer and Data Sciences.,Center for Proteomics and Bioinformatics, Case Western Reserve University, Cleveland, OH, 44106, USA
| |
Collapse
|
15
|
Coşkun M, Baggag A, Koyutürk M. Fast computation of Katz index for efficient processing of link prediction queries. Data Min Knowl Discov 2021. [DOI: 10.1007/s10618-021-00754-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
16
|
Luo P, Chen B, Liao B, Wu F. Predicting disease‐associated genes: Computational methods, databases, and evaluations. WIRES DATA MINING AND KNOWLEDGE DISCOVERY 2021; 11. [DOI: 10.1002/widm.1383] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2019] [Accepted: 06/13/2020] [Indexed: 09/09/2024]
Abstract
AbstractComplex diseases are associated with a set of genes (called disease genes), the identification of which can help scientists uncover the mechanisms of diseases and develop new drugs and treatment strategies. Due to the huge cost and time of experimental identification techniques, many computational algorithms have been proposed to predict disease genes. Although several review publications in recent years have discussed many computational methods, some of them focus on cancer driver genes while others focus on biomolecular networks, which only cover a specific aspect of existing methods. In this review, we summarize existing methods and classify them into three categories based on their rationales. Then, the algorithms, biological data, and evaluation methods used in the computational prediction are discussed. Finally, we highlight the limitations of existing methods and point out some future directions for improving these algorithms. This review could help investigators understand the principles of existing methods, and thus develop new methods to advance the computational prediction of disease genes.This article is categorized under:Technologies > Machine LearningTechnologies > PredictionAlgorithmic Development > Biological Data Mining
Collapse
Affiliation(s)
- Ping Luo
- Division of Biomedical Engineering University of Saskatchewan Saskatoon Canada
- Princess Margaret Cancer Centre University Health Network Toronto Canada
| | - Bolin Chen
- School of Computer Science and Technology Northwestern Polytechnical University China
| | - Bo Liao
- School of Mathematics and Statistics Hainan Normal University Haikou China
| | - Fang‐Xiang Wu
- Department of Mechanical Engineering and Department of Computer Science University of Saskatchewan Saskatoon Canada
| |
Collapse
|
17
|
Erten C, Houdjedj A, Kazan H. Ranking cancer drivers via betweenness-based outlier detection and random walks. BMC Bioinformatics 2021; 22:62. [PMID: 33568049 PMCID: PMC7877041 DOI: 10.1186/s12859-021-03989-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Accepted: 01/31/2021] [Indexed: 12/04/2022] Open
Abstract
Background Recent cancer genomic studies have generated detailed molecular data on a large number of cancer patients. A key remaining problem in cancer genomics is the identification of driver genes. Results We propose BetweenNet, a computational approach that integrates genomic data with a protein-protein interaction network to identify cancer driver genes. BetweenNet utilizes a measure based on betweenness centrality on patient specific networks to identify the so-called outlier genes that correspond to dysregulated genes for each patient. Setting up the relationship between the mutated genes and the outliers through a bipartite graph, it employs a random-walk process on the graph, which provides the final prioritization of the mutated genes. We compare BetweenNet against state-of-the art cancer gene prioritization methods on lung, breast, and pan-cancer datasets. Conclusions Our evaluations show that BetweenNet is better at recovering known cancer genes based on multiple reference databases. Additionally, we show that the GO terms and the reference pathways enriched in BetweenNet ranked genes and those that are enriched in known cancer genes overlap significantly when compared to the overlaps achieved by the rankings of the alternative methods.
Collapse
Affiliation(s)
- Cesim Erten
- Department of Computer Engineering, Antalya Bilim University, Antalya, Turkey
| | - Aissa Houdjedj
- Electrical and Computer Engineering Graduate Program, Antalya Bilim University, Antalya, Turkey
| | - Hilal Kazan
- Department of Computer Engineering, Antalya Bilim University, Antalya, Turkey.
| |
Collapse
|
18
|
Identifying patient-specific flow of signal transduction perturbed by multiple single-nucleotide alterations. QUANTITATIVE BIOLOGY 2020. [DOI: 10.1007/s40484-020-0227-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
19
|
Baali I, Erten C, Kazan H. DriveWays: a method for identifying possibly overlapping driver pathways in cancer. Sci Rep 2020; 10:21971. [PMID: 33319839 PMCID: PMC7738685 DOI: 10.1038/s41598-020-78852-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2020] [Accepted: 11/19/2020] [Indexed: 11/22/2022] Open
Abstract
The majority of the previous methods for identifying cancer driver modules output nonoverlapping modules. This assumption is biologically inaccurate as genes can participate in multiple molecular pathways. This is particularly true for cancer-associated genes as many of them are network hubs connecting functionally distinct set of genes. It is important to provide combinatorial optimization problem definitions modeling this biological phenomenon and to suggest efficient algorithms for its solution. We provide a formal definition of the Overlapping Driver Module Identification in Cancer (ODMIC) problem. We show that the problem is NP-hard. We propose a seed-and-extend based heuristic named DriveWays that identifies overlapping cancer driver modules from the graph built from the IntAct PPI network. DriveWays incorporates mutual exclusivity, coverage, and the network connectivity information of the genes. We show that DriveWays outperforms the state-of-the-art methods in recovering well-known cancer driver genes performed on TCGA pan-cancer data. Additionally, DriveWay’s output modules show a stronger enrichment for the reference pathways in almost all cases. Overall, we show that enabling modules to overlap improves the recovery of functional pathways filtered with known cancer drivers, which essentially constitute the reference set of cancer-related pathways.
Collapse
Affiliation(s)
- Ilyes Baali
- Electrical and Computer Engineering Graduate Program, Antalya Bilim University, 07190, Antalya, Turkey
| | - Cesim Erten
- Department of Computer Engineering, Antalya Bilim University, 07190, Antalya, Turkey.
| | - Hilal Kazan
- Department of Computer Engineering, Antalya Bilim University, 07190, Antalya, Turkey.
| |
Collapse
|
20
|
Ata SK, Wu M, Fang Y, Ou-Yang L, Kwoh CK, Li XL. Recent advances in network-based methods for disease gene prediction. Brief Bioinform 2020; 22:6023077. [PMID: 33276376 DOI: 10.1093/bib/bbaa303] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 09/29/2020] [Accepted: 10/10/2020] [Indexed: 01/28/2023] Open
Abstract
Disease-gene association through genome-wide association study (GWAS) is an arduous task for researchers. Investigating single nucleotide polymorphisms that correlate with specific diseases needs statistical analysis of associations. Considering the huge number of possible mutations, in addition to its high cost, another important drawback of GWAS analysis is the large number of false positives. Thus, researchers search for more evidence to cross-check their results through different sources. To provide the researchers with alternative and complementary low-cost disease-gene association evidence, computational approaches come into play. Since molecular networks are able to capture complex interplay among molecules in diseases, they become one of the most extensively used data for disease-gene association prediction. In this survey, we aim to provide a comprehensive and up-to-date review of network-based methods for disease gene prediction. We also conduct an empirical analysis on 14 state-of-the-art methods. To summarize, we first elucidate the task definition for disease gene prediction. Secondly, we categorize existing network-based efforts into network diffusion methods, traditional machine learning methods with handcrafted graph features and graph representation learning methods. Thirdly, an empirical analysis is conducted to evaluate the performance of the selected methods across seven diseases. We also provide distinguishing findings about the discussed methods based on our empirical analysis. Finally, we highlight potential research directions for future studies on disease gene prediction.
Collapse
Affiliation(s)
- Sezin Kircali Ata
- School of Computer Science and Engineering Nanyang Technological University (NTU)
| | - Min Wu
- Institute for Infocomm Research (I2R), A*STAR, Singapore
| | - Yuan Fang
- School of Information Systems, Singapore Management University, Singapore
| | - Le Ou-Yang
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen China
| | | | - Xiao-Li Li
- Department head and principal scientist at I2R, A*STAR, Singapore
| |
Collapse
|
21
|
Ahmed R, Baali I, Erten C, Hoxha E, Kazan H. MEXCOwalk: mutual exclusion and coverage based random walk to identify cancer modules. Bioinformatics 2020; 36:872-879. [PMID: 31432076 DOI: 10.1093/bioinformatics/btz655] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2019] [Revised: 07/03/2019] [Accepted: 08/18/2019] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION Genomic analyses from large cancer cohorts have revealed the mutational heterogeneity problem which hinders the identification of driver genes based only on mutation profiles. One way to tackle this problem is to incorporate the fact that genes act together in functional modules. The connectivity knowledge present in existing protein-protein interaction (PPI) networks together with mutation frequencies of genes and the mutual exclusivity of cancer mutations can be utilized to increase the accuracy of identifying cancer driver modules. RESULTS We present a novel edge-weighted random walk-based approach that incorporates connectivity information in the form of protein-protein interactions (PPIs), mutual exclusivity and coverage to identify cancer driver modules. MEXCOwalk outperforms several state-of-the-art computational methods on TCGA pan-cancer data in terms of recovering known cancer genes, providing modules that are capable of classifying normal and tumor samples and that are enriched for mutations in specific cancer types. Furthermore, the risk scores determined with output modules can stratify patients into low-risk and high-risk groups in multiple cancer types. MEXCOwalk identifies modules containing both well-known cancer genes and putative cancer genes that are rarely mutated in the pan-cancer data. The data, the source code and useful scripts are available at: https://github.com/abu-compbio/MEXCOwalk. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rafsan Ahmed
- Electrical and Computer Engineering Graduate Program, Department of Computer Engineering, Antalya Bilim University, Antalya 07190, Turkey
| | - Ilyes Baali
- Electrical and Computer Engineering Graduate Program, Department of Computer Engineering, Antalya Bilim University, Antalya 07190, Turkey
| | - Cesim Erten
- Department of Computer Engineering, Antalya Bilim University, Antalya 07190, Turkey
| | - Evis Hoxha
- Department of Computer Engineering, Antalya Bilim University, Antalya 07190, Turkey
| | - Hilal Kazan
- Department of Computer Engineering, Antalya Bilim University, Antalya 07190, Turkey
| |
Collapse
|
22
|
Abstract
Since the initial success of genome-wide association studies (GWAS) in 2005, tens of thousands of genetic variants have been identified for hundreds of human diseases and traits. In a GWAS, genotype information at up to millions of genetic markers is collected from up to hundreds of thousands of individuals, together with their phenotype information. Several scientific goals can be accomplished through the analysis of GWAS data, including the identification of variants, genes, and pathways associated with diseases and traits of interest; the inference of the genetic architecture of these traits; and the development of genetic risk prediction models. In this review, we provide an overview of the statistical challenges in achieving these goals and recent progress in statistical methodology to address these challenges.
Collapse
Affiliation(s)
- Ning Sun
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut 06520, USA
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut 06520, USA
| |
Collapse
|
23
|
Altuntas V, Gok M, Kahveci T. Stability Analysis of Biological Networks' Diffusion State. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1406-1418. [PMID: 30452376 DOI: 10.1109/tcbb.2018.2881887] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Computational knowledge acquired from noisy networks is not reliable and the network topology determines the reliability. Protein-protein interaction networks have uncertain topologies and noise that contain false positive and false negative edges at high rates. In this study, we analyze effects of the existing mutations in a network topology to the diffusion state of that network. To evaluate the sensitivity of the diffusion state, we derive the fitness measures based on the mathematically defined stability of a network. Searching for an influential set of edges in a network is a difficult problem. We handle the computational challenge by developing a novel metaheuristic optimization method and we find influential mutations time-efficiently. Our experiments, conducted on both synthetic and real networks from public databases, demonstrated that our method obtained better results than competitors for all types of network topologies. This is the first-time that the diffusion has been evaluated under topological mutations. Our analysis identifies significant biological results about the stability of biological - synthetic networks and diffusion state. In this manner, mutations in protein-protein interaction network topologies have a significant influence on the diffusion state of the network. Network stability is more affected by the network model than the network size.
Collapse
|
24
|
Singha M, Pu L, Shawky A, Busch K, Wu H, Ramanujam J, Brylinski M. GraphGR: A graph neural network to predict the effect of pharmacotherapy on the cancer cell growth.. [DOI: 10.1101/2020.05.20.107458] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]
Abstract
AbstractGenomic profiles of cancer cells provide valuable information on genetic alterations in cancer. Several recent studies employed these data to predict the response of cancer cell lines to treatment with drugs. Nonetheless, due to the multifactorial phenotypes and intricate mechanisms of cancer, the accurate prediction of the effect of pharmacotherapy on a specific cell line based on the genetic information alone is problematic. High prediction accuracies reported in the literature likely result from significant overlaps among training, validation, and testing sets, making many predictors inapplicable to new data. To address these issues, we developed GraphGR, a graph neural network with sophisticated attention propagation mechanisms to predict the therapeutic effects of kinase inhibitors across various tumors. Emphasizing on the system-level complexity of cancer, GraphGR integrates multiple heterogeneous data, such as biological networks, genomics, inhibitor profiling, and genedisease associations, into a unified graph structure. In order to construct diverse and information-rich cancer-specific networks, we devised a novel graph reduction protocol based on not only the topological information, but also the biological knowledge. The performance of GraphGR, properly cross-validated at the tissue level, is 0.83 in terms of the area under the receiver operating characteristics, which is notably higher than those measured for other approaches on the same data. Finally, several new predictions are validated against the biomedical literature demonstrating that GraphGR generalizes well to unseen data, i.e. it can predict therapeutic effects across a variety of cancer cell lines and inhibitors. GraphGR is freely available to the academic community at https://github.com/pulimeng/GraphGR.
Collapse
|
25
|
Blatti C, Emad A, Berry MJ, Gatzke L, Epstein M, Lanier D, Rizal P, Ge J, Liao X, Sobh O, Lambert M, Post CS, Xiao J, Groves P, Epstein AT, Chen X, Srinivasan S, Lehnert E, Kalari KR, Wang L, Weinshilboum RM, Song JS, Jongeneel CV, Han J, Ravaioli U, Sobh N, Bushell CB, Sinha S. Knowledge-guided analysis of "omics" data using the KnowEnG cloud platform. PLoS Biol 2020; 18:e3000583. [PMID: 31971940 PMCID: PMC6977717 DOI: 10.1371/journal.pbio.3000583] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2019] [Accepted: 12/19/2019] [Indexed: 12/19/2022] Open
Abstract
We present Knowledge Engine for Genomics (KnowEnG), a free-to-use computational system for analysis of genomics data sets, designed to accelerate biomedical discovery. It includes tools for popular bioinformatics tasks such as gene prioritization, sample clustering, gene set analysis, and expression signature analysis. The system specializes in "knowledge-guided" data mining and machine learning algorithms, in which user-provided data are analyzed in light of prior information about genes, aggregated from numerous knowledge bases and encoded in a massive "Knowledge Network." KnowEnG adheres to "FAIR" principles (findable, accessible, interoperable, and reuseable): its tools are easily portable to diverse computing environments, run on the cloud for scalable and cost-effective execution, and are interoperable with other computing platforms. The analysis tools are made available through multiple access modes, including a web portal with specialized visualization modules. We demonstrate the KnowEnG system's potential value in democratization of advanced tools for the modern genomics era through several case studies that use its tools to recreate and expand upon the published analysis of cancer data sets.
Collapse
Affiliation(s)
- Charles Blatti
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Amin Emad
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Department of Electrical and Computer Engineering, McGill University, Montreal, Canada
| | - Matthew J. Berry
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Lisa Gatzke
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Milt Epstein
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Daniel Lanier
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Pramod Rizal
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Jing Ge
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Xiaoxia Liao
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Omar Sobh
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Mike Lambert
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Corey S. Post
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Jinfeng Xiao
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Peter Groves
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Aidan T. Epstein
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Xi Chen
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Subhashini Srinivasan
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Erik Lehnert
- Seven Bridges Genomics, Charlestown, Massachusetts, United States of America
| | - Krishna R. Kalari
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Liewei Wang
- Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Richard M. Weinshilboum
- Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Jun S. Song
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Cancer Center at Illinois, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - C. Victor Jongeneel
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Jiawei Han
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Cancer Center at Illinois, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Umberto Ravaioli
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Nahil Sobh
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Colleen B. Bushell
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Saurabh Sinha
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Cancer Center at Illinois, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- * E-mail:
| |
Collapse
|
26
|
Cowman T, Coşkun M, Grama A, Koyutürk M. Integrated querying and version control of context-specific biological networks. Database (Oxford) 2020; 2020:baaa018. [PMID: 32294194 PMCID: PMC7158887 DOI: 10.1093/database/baaa018] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2019] [Revised: 01/13/2020] [Accepted: 02/21/2020] [Indexed: 01/26/2023]
Abstract
MOTIVATION Biomolecular data stored in public databases is increasingly specialized to organisms, context/pathology and tissue type, potentially resulting in significant overhead for analyses. These networks are often specializations of generic interaction sets, presenting opportunities for reducing storage and computational cost. Therefore, it is desirable to develop effective compression and storage techniques, along with efficient algorithms and a flexible query interface capable of operating on compressed data structures. Current graph databases offer varying levels of support for network integration. However, these solutions do not provide efficient methods for the storage and querying of versioned networks. RESULTS We present VerTIoN, a framework consisting of novel data structures and associated query mechanisms for integrated querying of versioned context-specific biological networks. As a use case for our framework, we study network proximity queries in which the user can select and compose a combination of tissue-specific and generic networks. Using our compressed version tree data structure, in conjunction with state-of-the-art numerical techniques, we demonstrate real-time querying of large network databases. CONCLUSION Our results show that it is possible to support flexible queries defined on heterogeneous networks composed at query time while drastically reducing response time for multiple simultaneous queries. The flexibility offered by VerTIoN in composing integrated network versions opens significant new avenues for the utilization of ever increasing volume of context-specific network data in a broad range of biomedical applications. AVAILABILITY AND IMPLEMENTATION VerTIoN is implemented as a C++ library and is available at http://compbio.case.edu/omics/software/vertion and https://github.com/tjcowman/vertion. CONTACT tyler.cowman@case.edu.
Collapse
Affiliation(s)
- Tyler Cowman
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH 44106, USA
| | - Mustafa Coşkun
- Department of Computer Engineering, Abdullah Gül University, Kayseri 38080, Turkey
| | - Ananth Grama
- Department of Computer Science, Purdue University, West Lafayette, IN 47906, USA
| | - Mehmet Koyutürk
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH 44106, USA
- Center for Proteomics and Bioinformatics, Case Western Reserve University, Cleveland, OH 44106, USA
| |
Collapse
|
27
|
Biological Network Approaches and Applications in Rare Disease Studies. Genes (Basel) 2019; 10:genes10100797. [PMID: 31614842 PMCID: PMC6827097 DOI: 10.3390/genes10100797] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2019] [Revised: 09/30/2019] [Accepted: 10/10/2019] [Indexed: 12/26/2022] Open
Abstract
Network biology has the capability to integrate, represent, interpret, and model complex biological systems by collectively accommodating biological omics data, biological interactions and associations, graph theory, statistical measures, and visualizations. Biological networks have recently been shown to be very useful for studies that decipher biological mechanisms and disease etiologies and for studies that predict therapeutic responses, at both the molecular and system levels. In this review, we briefly summarize the general framework of biological network studies, including data resources, network construction methods, statistical measures, network topological properties, and visualization tools. We also introduce several recent biological network applications and methods for the studies of rare diseases.
Collapse
|
28
|
Zolotareva O, Kleine M. A Survey of Gene Prioritization Tools for Mendelian and Complex Human Diseases. J Integr Bioinform 2019; 16:/j/jib.ahead-of-print/jib-2018-0069/jib-2018-0069.xml. [PMID: 31494632 PMCID: PMC7074139 DOI: 10.1515/jib-2018-0069] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2018] [Accepted: 07/12/2019] [Indexed: 12/16/2022] Open
Abstract
Modern high-throughput experiments provide us with numerous potential associations between genes and diseases. Experimental validation of all the discovered associations, let alone all the possible interactions between them, is time-consuming and expensive. To facilitate the discovery of causative genes, various approaches for prioritization of genes according to their relevance for a given disease have been developed. In this article, we explain the gene prioritization problem and provide an overview of computational tools for gene prioritization. Among about a hundred of published gene prioritization tools, we select and briefly describe 14 most up-to-date and user-friendly. Also, we discuss the advantages and disadvantages of existing tools, challenges of their validation, and the directions for future research.
Collapse
Affiliation(s)
- Olga Zolotareva
- Bielefeld University, Faculty of Technology and Center for Biotechnology, International Research Training Group "Computational Methods for the Analysis of the Diversity and Dynamics of Genomes" and Genome Informatics, Universitätsstraße 25, Bielefeld, Germany
| | - Maren Kleine
- Bielefeld University, Faculty of Technology, Bioinformatics/Medical Informatics Department, Universitätsstraße 25, Bielefeld, Germany
| |
Collapse
|
29
|
McGillivray P, Clarke D, Meyerson W, Zhang J, Lee D, Gu M, Kumar S, Zhou H, Gerstein M. Network Analysis as a Grand Unifier in Biomedical Data Science. Annu Rev Biomed Data Sci 2018. [DOI: 10.1146/annurev-biodatasci-080917-013444] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Biomedical data scientists study many types of networks, ranging from those formed by neurons to those created by molecular interactions. People often criticize these networks as uninterpretable diagrams termed hairballs; however, here we show that molecular biological networks can be interpreted in several straightforward ways. First, we can break down a network into smaller components, focusing on individual pathways and modules. Second, we can compute global statistics describing the network as a whole. Third, we can compare networks. These comparisons can be within the same context (e.g., between two gene regulatory networks) or cross-disciplinary (e.g., between regulatory networks and governmental hierarchies). The latter comparisons can transfer a formalism, such as that for Markov chains, from one context to another or relate our intuitions in a familiar setting (e.g., social networks) to the relatively unfamiliar molecular context. Finally, key aspects of molecular networks are dynamics and evolution, i.e., how they evolve over time and how genetic variants affect them. By studying the relationships between variants in networks, we can begin to interpret many common diseases, such as cancer and heart disease.
Collapse
Affiliation(s)
- Patrick McGillivray
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Declan Clarke
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - William Meyerson
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
| | - Jing Zhang
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
| | - Donghoon Lee
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
| | - Mengting Gu
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
- Department of Computer Science, Yale University, New Haven, Connecticut 06520, USA
| | - Sushant Kumar
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Holly Zhou
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Mark Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
- Department of Computer Science, Yale University, New Haven, Connecticut 06520, USA
| |
Collapse
|
30
|
Maxwell S, Chance MR, Koyutürk M. Linearity of network proximity measures: implications for set-based queries and significance testing. Bioinformatics 2018; 33:1354-1361. [PMID: 28453667 DOI: 10.1093/bioinformatics/btw733] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2016] [Accepted: 11/17/2016] [Indexed: 12/18/2022] Open
Abstract
Motivation In recent years, various network proximity measures have been proposed to facilitate the use of biomolecular interaction data in a broad range of applications. These applications include functional annotation, disease gene prioritization, comparative analysis of biological systems and prediction of new interactions. In such applications, a major task is the scoring or ranking of the nodes in the network in terms of their proximity to a given set of 'seed' nodes (e.g. a group of proteins that are identified to be associated with a disease, or are deferentially expressed in a certain condition). Many different network proximity measures are utilized for this purpose, and these measures are quite diverse in terms of the benefits they offer. Results We propose a unifying framework for characterizing network proximity measures for set-based queries. We observe that many existing measures are linear, in that the proximity of a node to a set of nodes can be represented as an aggregation of its proximity to the individual nodes in the set. Based on this observation, we propose methods for processing of set-based proximity queries that take advantage of sparse local proximity information. In addition, we provide an analytical framework for characterizing the distribution of proximity scores based on reference models that accurately capture the characteristics of the seed set (e.g. degree distribution and biological function). The resulting framework facilitates computation of exact figures for the statistical significance of network proximity scores, enabling assessment of the accuracy of Monte Carlo simulation based estimation methods. Availability and Implementation Implementations of the methods in this paper are available at https://bioengine.case.edu/crosstalker which includes a robust visualization for results viewing. Contact stm@case.edu or mxk331@case.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Mark R Chance
- Center for Proteomics and Bioinformatics.,Department of Nutrition
| | - Mehmet Koyutürk
- Center for Proteomics and Bioinformatics.,Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA
| |
Collapse
|
31
|
Lin L, Yang T, Fang L, Yang J, Yang F, Zhao J. Gene gravity-like algorithm for disease gene prediction based on phenotype-specific network. BMC SYSTEMS BIOLOGY 2017; 11:121. [PMID: 29212543 PMCID: PMC5718078 DOI: 10.1186/s12918-017-0519-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2017] [Accepted: 11/24/2017] [Indexed: 01/24/2023]
Abstract
Background Polygenic diseases are usually caused by the dysfunction of multiple genes. Unravelling such disease genes is crucial to fully understand the genetic landscape of diseases on molecular level. With the advent of ‘omic’ data era, network-based methods have prominently boosted disease gene discovery. However, how to make better use of different types of data for the prediction of disease genes remains a challenge. Results In this study, we improved the performance of disease gene prediction by integrating the similarity of disease phenotype, biological function and network topology. First, for each phenotype, a phenotype-specific network was specially constructed by mapping phenotype similarity information of given phenotype onto the protein-protein interaction (PPI) network. Then, we developed a gene gravity-like algorithm, to score candidate genes based on not only topological similarity but also functional similarity. We tested the proposed network and algorithm by conducting leave-one-out and leave-10%-out cross validation and compared them with state-of-art algorithms. The results showed a preference to phenotype-specific network as well as gene gravity-like algorithm. At last, we tested the predicting capacity of proposed algorithms by test gene set derived from the DisGeNET database. Also, potential disease genes of three polygenic diseases, obesity, prostate cancer and lung cancer, were predicted by proposed methods. We found that the predicted disease genes are highly consistent with literature and database evidence. Conclusions The good performance of phenotype-specific networks indicates that phenotype similarity information has positive effect on the prediction of disease genes. The proposed gene gravity-like algorithm outperforms the algorithm of Random Walk with Restart (RWR), implicating its predicting capacity by combing topological similarity with functional similarity. Our work will give an insight to the discovery of disease genes by fusing multiple similarities of genes and diseases. Electronic supplementary material The online version of this article (10.1186/s12918-017-0519-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Limei Lin
- Department of Mathematics, Army Logistics University of PLA, Chongqing, China
| | - Tinghong Yang
- Department of Mathematics, Army Logistics University of PLA, Chongqing, China
| | - Ling Fang
- Department of Mathematics, Army Logistics University of PLA, Chongqing, China
| | - Jian Yang
- School of Pharmacy, Second Military Medical University, Shanghai, China
| | - Fan Yang
- Department of Mathematics, Army Logistics University of PLA, Chongqing, China
| | - Jing Zhao
- Institute of Interdisciplinary Complex Research, Shanghai University of Traditional Chinese Medicine, Shanghai, China.
| |
Collapse
|
32
|
Huang R, He Y, Sun B, Liu B. Bioinformatic Analysis Identifies Three Potentially Key Differentially Expressed Genes in Peripheral Blood Mononuclear Cells of Patients with Takayasu's Arteritis. CELL JOURNAL 2017; 19:647-653. [PMID: 29105401 PMCID: PMC5672105 DOI: 10.22074/cellj.2018.4991] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/29/2016] [Accepted: 05/09/2017] [Indexed: 11/29/2022]
Abstract
Objective This study aimed to identify several potentially key genes associated with the pathogenesis of Takayasu’s
arteritis (TA). This identification may lead to a deeper mechanistic understanding of TA etiology and pave the way for
potential therapeutic approaches.
Materials and Methods In this experimental study, the microarray dataset GSE33910, which includes expression
data for peripheral blood mononuclear cell (PBMC) samples isolated from TA patients and normal volunteers, was
downloaded from the publicly accessible Gene Expression Omnibus (GEO) database. Differentially expressed genes
(DEGs) were identified in PBMCs of TA patients compared with normal controls. Gene ontology (GO) enrichment
analysis of DEGs and analysis of protein-protein interaction (PPI) network were carried out. Several hub proteins were
extracted from the PPI network based on node degrees and random walk algorithm. Additionally, transcription factors
(TFs) were predicted and the corresponding regulatory network was constructed.
Results A total of 932 DEGs (372 up- and 560 down-regulated genes) were identified in PBMCs from TA patients.
Interestingly, up-regulated and down-regulated genes were involved in different GO terms and pathways. A PPI network
of proteins encoded by DEGs was constructed and RHOA, FOS, EGR1, and GNB1 were considered to be hub proteins
with both higher random walk score and node degree. A total of 13 TFs were predicted to be differentially expressed. A
total of 49 DEGs had been reported to be associated with TA in the Comparative Toxicogenomics Database (CTD). The
only TA marker gene in the CTD database was NOS2, confirmed by three studies. However, NOS2 was not significantly
altered in the analyzed microarray dataset. Nevertheless,NOS3 was a significantly down-regulated gene and was
involved in the platelet activation pathway.
Conclusion RHOA, FOS, and EGR1 are potential candidate genes for the diagnosis and therapy of TA.
Collapse
Affiliation(s)
- Renping Huang
- Department of General Surgery, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Yang He
- Department of Anesthesiology, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Bei Sun
- Department of Pancreatic and Biliary Surgery, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Bing Liu
- Department of General Surgery, The First Affiliated Hospital of Harbin Medical University, Harbin, China.
| |
Collapse
|
33
|
Grewal N, Singh S, Chand T. Effect of Aggregation Operators on Network-Based Disease Gene Prioritization: A Case Study on Blood Disorders. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1276-1287. [PMID: 29220322 DOI: 10.1109/tcbb.2016.2599155] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Owing to the innate noise in the biological data sources, a single source or a single measure do not suffice for an effective disease gene prioritization. So, the integration of multiple data sources or aggregation of multiple measures is the need of the hour. The aggregation operators combine multiple related data values to a single value such that the combined value has the effect of all the individual values. In this paper, an attempt has been made for applying the fuzzy aggregation on the network-based disease gene prioritization and investigate its effect under noise conditions. This study has been conducted for a set of 15 blood disorders by fusing four different network measures, computed from the protein interaction network, using a selected set of aggregation operators and ranking the genes on the basis of the aggregated value. The aggregation operator-based rankings have been compared with the "Random walk with restart" gene prioritization method. The impact of noise has also been investigated by adding varying proportions of noise to the seed set. The results reveal that for all the selected blood disorders, the Mean of Maximal operator has relatively outperformed the other aggregation operators for noisy as well as non-noisy data.
Collapse
|
34
|
Abstract
Biological networks are powerful resources for the discovery of genes and genetic modules that drive disease. Fundamental to network analysis is the concept that genes underlying the same phenotype tend to interact; this principle can be used to combine and to amplify signals from individual genes. Recently, numerous bioinformatic techniques have been proposed for genetic analysis using networks, based on random walks, information diffusion and electrical resistance. These approaches have been applied successfully to identify disease genes, genetic modules and drug targets. In fact, all these approaches are variations of a unifying mathematical machinery - network propagation - suggesting that it is a powerful data transformation method of broad utility in genetic research.
Collapse
|
35
|
Ruffalo M, Koyutürk M, Sharan R. Network-Based Integration of Disparate Omic Data To Identify "Silent Players" in Cancer. PLoS Comput Biol 2015; 11:e1004595. [PMID: 26683094 PMCID: PMC4684294 DOI: 10.1371/journal.pcbi.1004595] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2015] [Accepted: 10/12/2015] [Indexed: 11/18/2022] Open
Abstract
Development of high-throughput monitoring technologies enables interrogation of cancer samples at various levels of cellular activity. Capitalizing on these developments, various public efforts such as The Cancer Genome Atlas (TCGA) generate disparate omic data for large patient cohorts. As demonstrated by recent studies, these heterogeneous data sources provide the opportunity to gain insights into the molecular changes that drive cancer pathogenesis and progression. However, these insights are limited by the vast search space and as a result low statistical power to make new discoveries. In this paper, we propose methods for integrating disparate omic data using molecular interaction networks, with a view to gaining mechanistic insights into the relationship between molecular changes at different levels of cellular activity. Namely, we hypothesize that genes that play a role in cancer development and progression may be implicated by neither frequent mutation nor differential expression, and that network-based integration of mutation and differential expression data can reveal these “silent players”. For this purpose, we utilize network-propagation algorithms to simulate the information flow in the cell at a sample-specific resolution. We then use the propagated mutation and expression signals to identify genes that are not necessarily mutated or differentially expressed genes, but have an essential role in tumor development and patient outcome. We test the proposed method on breast cancer and glioblastoma multiforme data obtained from TCGA. Our results show that the proposed method can identify important proteins that are not readily revealed by molecular data, providing insights beyond what can be gleaned by analyzing different types of molecular data in isolation. Identification of cancer-related genes is an important task, made more difficult by heterogeneity between samples and even within individual patients. Methods for identifying disease-related genes typically focus on individual data sets such as mutational and differential expression data, and therefore are limited to genes that are implicated by each data set in isolation. In this work we propose a method that uses protein interaction network information to integrate mutational and differential expression data on a sample-specific level, and combine this information across samples in ways that respect the commonalities and differences between distinct mutation and differential expression profiles. We use this information to identify genes that are associated with cancer but not readily identifiable by mutations or differential expression alone. Our method highlights the features that significantly predict a gene’s association with cancer, shows improved predictive power in recovering cancer-related genes in known pathways, and identifies genes that are neither frequently mutated nor differentially expressed but show significant association with survival.
Collapse
Affiliation(s)
- Matthew Ruffalo
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, Ohio, United States of America
| | - Mehmet Koyutürk
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, Ohio, United States of America
- Center for Proteomics and Bioinformatics, Case Western Reserve University, Cleveland, Ohio, United States of America
- * E-mail: (MK); (RS)
| | - Roded Sharan
- School of Computer Science, Tel Aviv University, Tel Aviv, Israel
- * E-mail: (MK); (RS)
| |
Collapse
|
36
|
Ma C, Chen Y, Wilkins D, Chen X, Zhang J. An unsupervised learning approach to find ovarian cancer genes through integration of biological data. BMC Genomics 2015; 16 Suppl 9:S3. [PMID: 26328548 PMCID: PMC4547402 DOI: 10.1186/1471-2164-16-s9-s3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Cancer is a disease characterized largely by the accumulation of out-of-control somatic mutations during the lifetime of a patient. Distinguishing driver mutations from passenger mutations has posed a challenge in modern cancer research. With the advanced development of microarray experiments and clinical studies, a large numbers of candidate cancer genes have been extracted and distinguishing informative genes out of them is essential. As a matter of fact, we proposed to find the informative genes for cancer by using mutation data from ovarian cancers in our framework. In our model we utilized the patient gene mutation profile, gene expression data and gene gene interactions network to construct a graphical representation of genes and patients. Markov processes for mutation and patients are triggered separately. After this process, cancer genes are prioritized automatically by examining their scores at their stationary distributions in the eigenvector. Extensive experiments demonstrate that the integration of heterogeneous sources of information is essential in finding important cancer genes.
Collapse
|
37
|
Browne F, Wang H, Zheng H. A computational framework for the prioritization of disease-gene candidates. BMC Genomics 2015; 16 Suppl 9:S2. [PMID: 26330267 PMCID: PMC4547404 DOI: 10.1186/1471-2164-16-s9-s2] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Background The identification of genes and uncovering the role they play in diseases is an important and complex challenge. Genome-wide linkage and association studies have made advancements in identifying genetic variants that underpin human disease. An important challenge now is to identify meaningful disease-associated genes from a long list of candidate genes implicated by these analyses. The application of gene prioritization can enhance our understanding of disease mechanisms and aid in the discovery of drug targets. The integration of protein-protein interaction networks along with disease datasets and contextual information is an important tool in unraveling the molecular basis of diseases. Results In this paper we propose a computational pipeline for the prioritization of disease-gene candidates. Diverse heterogeneous data including: gene-expression, protein-protein interaction network, ontology-based similarity and topological measures and tissue-specific are integrated. The pipeline was applied to prioritize Alzheimer's Disease (AD) genes, whereby a list of 32 prioritized genes was generated. This approach correctly identified key AD susceptible genes: PSEN1 and TRAF1. Biological process enrichment analysis revealed the prioritized genes are modulated in AD pathogenesis including: regulation of neurogenesis and generation of neurons. Relatively high predictive performance (AUC: 0.70) was observed when classifying AD and normal gene expression profiles from individuals using leave-one-out cross validation. Conclusions This work provides a foundation for future investigation of diverse heterogeneous data integration for disease-gene prioritization.
Collapse
|
38
|
ProSim: A Method for Prioritizing Disease Genes Based on Protein Proximity and Disease Similarity. BIOMED RESEARCH INTERNATIONAL 2015; 2015:213750. [PMID: 26339594 PMCID: PMC4538409 DOI: 10.1155/2015/213750] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/15/2014] [Accepted: 01/16/2015] [Indexed: 01/19/2023]
Abstract
Predicting disease genes for a particular genetic disease is very challenging in bioinformatics. Based on current research studies, this challenge can be tackled via network-based approaches. Furthermore, it has been highlighted that it is necessary to consider disease similarity along with the protein's proximity to disease genes in a protein-protein interaction (PPI) network in order to improve the accuracy of disease gene prioritization. In this study we propose a new algorithm called proximity disease similarity algorithm (ProSim), which takes both of the aforementioned properties into consideration, to prioritize disease genes. To illustrate the proposed algorithm, we have conducted six case studies, namely, prostate cancer, Alzheimer's disease, diabetes mellitus type 2, breast cancer, colorectal cancer, and lung cancer. We employed leave-one-out cross validation, mean enrichment, tenfold cross validation, and ROC curves to evaluate our proposed method and other existing methods. The results show that our proposed method outperforms existing methods such as PRINCE, RWR, and DADA.
Collapse
|
39
|
Ayati M, Erten S, Chance MR, Koyutürk M. MOBAS: identification of disease-associated protein subnetworks using modularity-based scoring. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2015; 2015:7. [PMID: 28194175 PMCID: PMC5270451 DOI: 10.1186/s13637-015-0025-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/06/2015] [Accepted: 06/02/2015] [Indexed: 11/23/2022]
Abstract
Network-based analyses are commonly used as powerful tools to interpret the findings of genome-wide association studies (GWAS) in a functional context. In particular, identification of disease-associated functional modules, i.e., highly connected protein-protein interaction (PPI) subnetworks with high aggregate disease association, are shown to be promising in uncovering the functional relationships among genes and proteins associated with diseases. An important issue in this regard is the scoring of subnetworks by integrating two quantities: disease association of individual gene products and network connectivity among proteins. Current scoring schemes either disregard the level of connectivity and focus on the aggregate disease association of connected proteins or use a linear combination of these two quantities. However, such scoring schemes may produce arbitrarily large subnetworks which are often not statistically significant or require tuning of parameters that are used to weigh the contributions of network connectivity and disease association. Here, we propose a parameter-free scoring scheme that aims to score subnetworks by assessing the disease association of interactions between pairs of gene products. We also incorporate the statistical significance of network connectivity and disease association into the scoring function. We test the proposed scoring scheme on a GWAS dataset for two complex diseases type II diabetes (T2D) and psoriasis (PS). Our results suggest that subnetworks identified by commonly used methods may fail tests of statistical significance after correction for multiple hypothesis testing. In contrast, the proposed scoring scheme yields highly significant subnetworks, which contain biologically relevant proteins that cannot be identified by analysis of genome-wide association data alone. We also show that the proposed scoring scheme identifies subnetworks that are reproducible across different cohorts, and it can robustly recover relevant subnetworks at lower sampling rates.
Collapse
Affiliation(s)
- Marzieh Ayati
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, 10900 Eucid Ave., Cleveland, 44106 OH USA
| | - Sinan Erten
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, 10900 Eucid Ave., Cleveland, 44106 OH USA
| | - Mark R Chance
- Center for Proteomics and Bioinformatics, Case Western Reserve University, 10900 Eucid Ave., Cleveland, 44106 OH USA
| | - Mehmet Koyutürk
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, 10900 Eucid Ave., Cleveland, 44106 OH USA.,Center for Proteomics and Bioinformatics, Case Western Reserve University, 10900 Eucid Ave., Cleveland, 44106 OH USA
| |
Collapse
|
40
|
Fang M, Hu X, Wang Y, Zhao J, Shen X, He T. NDRC: A Disease-Causing Genes Prioritized Method Based on Network Diffusion and Rank Concordance. IEEE Trans Nanobioscience 2015; 14:521-7. [PMID: 26080386 DOI: 10.1109/tnb.2015.2443852] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Disease-causing genes prioritization is very important to understand disease mechanisms and biomedical applications, such as design of drugs. Previous studies have shown that promising candidate genes are mostly ranked according to their relatedness to known disease genes or closely related disease genes. Therefore, a dangling gene (isolated gene) with no edges in the network can not be effectively prioritized. These approaches tend to prioritize those genes that are highly connected in the PPI network while perform poorly when they are applied to loosely connected disease genes. To address these problems, we propose a new disease-causing genes prioritization method that based on network diffusion and rank concordance (NDRC). The method is evaluated by leave-one-out cross validation on 1931 diseases in which at least one gene is known to be involved, and it is able to rank the true causal gene first in 849 of all 2542 cases. The experimental results suggest that NDRC significantly outperforms other existing methods such as RWR, VAVIEN, DADA and PRINCE on identifying loosely connected disease genes and successfully put dangling genes as potential candidate disease genes. Furthermore, we apply NDRC method to study three representative diseases, Meckel syndrome 1, Protein C deficiency and Peroxisome biogenesis disorder 1A (Zellweger). Our study has also found that certain complex disease-causing genes can be divided into several modules that are closely associated with different disease phenotype.
Collapse
|
41
|
Biased random walk model for the prioritization of drug resistance associated proteins. Sci Rep 2015; 5:10857. [PMID: 26039373 PMCID: PMC4454201 DOI: 10.1038/srep10857] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2014] [Accepted: 04/30/2015] [Indexed: 01/07/2023] Open
Abstract
Multi-drug resistance is the main cause of treatment failure in cancer patients. How to identify molecules underlying drug resistance from multi-omics data remains a great challenge. Here, we introduce a data biased strategy, ProteinRank, to prioritize drug-resistance associated proteins in cancer cells. First, we identified differentially expressed proteins in Adriamycin and Vincristine resistant gastric cancer cells compared to their parental cells using iTRAQ combined with LC-MS/MS experiments, and then mapped them to human protein-protein interaction network; second, we applied ProteinRank to analyze the whole network and rank proteins similar to known drug resistance related proteins. Cross validations demonstrated a better performance of ProteinRank compared to the method without usage of MS data. Further validations confirmed the altered expressions or activities of several top ranked proteins. Functional study showed PIM3 or CAV1 silencing was sufficient to reverse the drug resistance phenotype. These results indicated ProteinRank could prioritize key proteins related to drug resistance in gastric cancer and provided important clues for cancer research.
Collapse
|
42
|
Cao M, Pietras CM, Feng X, Doroschak KJ, Schaffner T, Park J, Zhang H, Cowen LJ, Hescott BJ. New directions for diffusion-based network prediction of protein function: incorporating pathways with confidence. ACTA ACUST UNITED AC 2014; 30:i219-27. [PMID: 24931987 PMCID: PMC4058952 DOI: 10.1093/bioinformatics/btu263] [Citation(s) in RCA: 95] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Motivation: It has long been hypothesized that incorporating models of network noise as well as edge directions and known pathway information into the representation of protein–protein interaction (PPI) networks might improve their utility for functional inference. However, a simple way to do this has not been obvious. We find that diffusion state distance (DSD), our recent diffusion-based metric for measuring dissimilarity in PPI networks, has natural extensions that incorporate confidence, directions and can even express coherent pathways by calculating DSD on an augmented graph. Results: We define three incremental versions of DSD which we term cDSD, caDSD and capDSD, where the capDSD matrix incorporates confidence, known directed edges, and pathways into the measure of how similar each pair of nodes is according to the structure of the PPI network. We test four popular function prediction methods (majority vote, weighted majority vote, multi-way cut and functional flow) using these different matrices on the Baker’s yeast PPI network in cross-validation. The best performing method is weighted majority vote using capDSD. We then test the performance of our augmented DSD methods on an integrated heterogeneous set of protein association edges from the STRING database. The superior performance of capDSD in this context confirms that treating the pathways as probabilistic units is more powerful than simply incorporating pathway edges independently into the network. Availability: All source code for calculating the confidences, for extracting pathway information from KEGG XML files, and for calculating the cDSD, caDSD and capDSD matrices are available from http://dsd.cs.tufts.edu/capdsd Contact:lenore.cowen@tufts.edu or benjamin.hescott@tufts.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mengfei Cao
- Department of Computer Science, Tufts University, Medford, MA 02155, USA and Department of Computer Science, University of Minnesota, Minneapolis, MN 55455, USA
| | - Christopher M Pietras
- Department of Computer Science, Tufts University, Medford, MA 02155, USA and Department of Computer Science, University of Minnesota, Minneapolis, MN 55455, USA
| | - Xian Feng
- Department of Computer Science, Tufts University, Medford, MA 02155, USA and Department of Computer Science, University of Minnesota, Minneapolis, MN 55455, USA
| | - Kathryn J Doroschak
- Department of Computer Science, Tufts University, Medford, MA 02155, USA and Department of Computer Science, University of Minnesota, Minneapolis, MN 55455, USA
| | - Thomas Schaffner
- Department of Computer Science, Tufts University, Medford, MA 02155, USA and Department of Computer Science, University of Minnesota, Minneapolis, MN 55455, USA
| | - Jisoo Park
- Department of Computer Science, Tufts University, Medford, MA 02155, USA and Department of Computer Science, University of Minnesota, Minneapolis, MN 55455, USA
| | - Hao Zhang
- Department of Computer Science, Tufts University, Medford, MA 02155, USA and Department of Computer Science, University of Minnesota, Minneapolis, MN 55455, USA
| | - Lenore J Cowen
- Department of Computer Science, Tufts University, Medford, MA 02155, USA and Department of Computer Science, University of Minnesota, Minneapolis, MN 55455, USA
| | - Benjamin J Hescott
- Department of Computer Science, Tufts University, Medford, MA 02155, USA and Department of Computer Science, University of Minnesota, Minneapolis, MN 55455, USA
| |
Collapse
|
43
|
Zhang SW, Shao DD, Zhang SY, Wang YB. Prioritization of candidate disease genes by enlarging the seed set and fusing information of the network topology and gene expression. MOLECULAR BIOSYSTEMS 2014; 10:1400-8. [PMID: 24695957 DOI: 10.1039/c3mb70588a] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
The identification of disease genes is very important not only to provide greater understanding of gene function and cellular mechanisms which drive human disease, but also to enhance human disease diagnosis and treatment. Recently, high-throughput techniques have been applied to detect dozens or even hundreds of candidate genes. However, experimental approaches to validate the many candidates are usually time-consuming, tedious and expensive, and sometimes lack reproducibility. Therefore, numerous theoretical and computational methods (e.g. network-based approaches) have been developed to prioritize candidate disease genes. Many network-based approaches implicitly utilize the observation that genes causing the same or similar diseases tend to correlate with each other in gene-protein relationship networks. Of these network approaches, the random walk with restart algorithm (RWR) is considered to be a state-of-the-art approach. To further improve the performance of RWR, we propose a novel method named ESFSC to identify disease-related genes, by enlarging the seed set according to the centrality of disease genes in a network and fusing information of the protein-protein interaction (PPI) network topological similarity and the gene expression correlation. The ESFSC algorithm restarts at all of the nodes in the seed set consisting of the known disease genes and their k-nearest neighbor nodes, then walks in the global network separately guided by the similarity transition matrix constructed with PPI network topological similarity properties and the correlational transition matrix constructed with the gene expression profiles. As a result, all the genes in the network are ranked by weighted fusing the above results of the RWR guided by two types of transition matrices. Comprehensive simulation results of the 10 diseases with 97 known disease genes collected from the Online Mendelian Inheritance in Man (OMIM) database show that ESFSC outperforms existing methods for prioritizing candidate disease genes. The top prediction results of Alzheimer's disease are consistent with previous literature reports.
Collapse
Affiliation(s)
- Shao-Wu Zhang
- College of Automation, Northwestern Polytechnical University, 710072, Xi'an, China.
| | | | | | | |
Collapse
|
44
|
Abstract
Bioinformatics aids in the understanding of the biological processes of living beings and the genetic architecture of human diseases. The discovery of disease-related genes improves the diagnosis and therapy design for the disease. To save the cost and time involved in the experimental verification of the candidate genes, computational methods are employed for ranking the genes according to their likelihood of being associated with the disease. Only top-ranked genes are then verified experimentally. A variety of methods have been conceived by the researchers for the prioritization of the disease candidate genes, which differ in the data source being used or the scoring function used for ranking the genes. A review of various aspects of computational disease gene prioritization and its research issues is presented in this article. The aspects covered are gene prioritization process, data sources used, types of prioritization methods, and performance assessment methods. This article provides a brief overview and acts as a quick guide for disease gene prioritization.
Collapse
Affiliation(s)
- Nivit Gill
- 1 Punjabi University Regional Centre For IT and Management , Mohali, Punjab, India
| | | | | |
Collapse
|
45
|
Leiserson MDM, Eldridge JV, Ramachandran S, Raphael BJ. Network analysis of GWAS data. Curr Opin Genet Dev 2013; 23:602-10. [PMID: 24287332 PMCID: PMC3867794 DOI: 10.1016/j.gde.2013.09.003] [Citation(s) in RCA: 64] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2013] [Revised: 09/19/2013] [Accepted: 09/23/2013] [Indexed: 02/07/2023]
Abstract
Genome-wide association studies (GWAS) identify genetic variants that distinguish a control population from a population with a specific trait. Two challenges in GWAS are: (1) identification of the causal variant within a longer haplotype that is associated with the trait; (2) identification of causal variants for polygenic traits that are caused by variants in multiple genes within a pathway. We review recent methods that use information in protein-protein and protein-DNA interaction networks to address these two challenges.
Collapse
Affiliation(s)
- Mark D M Leiserson
- Department of Computer Science, Brown University, Providence, RI 02912, United States; Center for Computational Molecular Biology, Brown University, Providence, RI 02912, United States
| | | | | | | |
Collapse
|
46
|
Li P, Hua X, Zhang Z, Li J, Wang J. Characterization of regulatory features of housekeeping and tissue-specific regulators within tissue regulatory networks. BMC SYSTEMS BIOLOGY 2013; 7:112. [PMID: 24172660 PMCID: PMC3843562 DOI: 10.1186/1752-0509-7-112] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/29/2013] [Accepted: 10/28/2013] [Indexed: 01/10/2023]
Abstract
Background Transcription factors (TFs) and miRNAs are essential for the regulation of gene expression; however, the global view of human gene regulatory networks remains poorly understood. For example, how is the expression of so many genes regulated by limited cohorts of regulators and how are genes differentially expressed in different tissues despite the genetic code being the same in all tissues? Results We analyzed the network properties of housekeeping and tissue-specific genes in gene regulatory networks from seven human tissues. Our results show that different classes of genes behave quite differently in these networks. Tissue-specific miRNAs show a higher average target number compared with non-tissue specific miRNAs, which indicates that tissue-specific miRNAs tend to regulate different sets of targets. Tissue-specific TFs exhibit higher in-degree, out-degree, cluster coefficient and betweenness values, indicating that they occupy central positions in the regulatory network and that they transfer genetic information from upstream genes to downstream genes more quickly than other TFs. Housekeeping TFs tend to have higher cluster coefficients compared with other genes that are neither housekeeping nor tissue specific, indicating that housekeeping TFs tend to regulate their targets synergistically. Several topological properties of disease-associated miRNAs and genes were found to be significantly different from those of non-disease-associated miRNAs and genes. Conclusions Tissue-specific miRNAs, TFs and disease genes have particular topological properties within the transcriptional regulatory networks of the seven human tissues examined. The tendency of tissue-specific miRNAs to regulate different sets of genes shows that a particular tissue-specific miRNA and its target gene set may form a regulatory module to execute particular functions in the process of tissue differentiation. The regulatory patterns of tissue-specific TFs reflect their vital role in regulatory networks and their importance to biological functions in their respective tissues. The topological differences between disease and non-disease genes may aid the discovery of new disease genes or drug targets. Determining the network properties of these regulatory factors will help define the basic principles of human gene regulation and the molecular mechanisms of disease.
Collapse
Affiliation(s)
| | | | | | - Jie Li
- The State Key Laboratory of Pharmaceutical Biotechnology, Jiangsu Engineering Research Center for MicroRNA Biology and Biotechnology, School of Life Science, Nanjing University, Nanjing, China.
| | | |
Collapse
|
47
|
Cao M, Zhang H, Park J, Daniels NM, Crovella ME, Cowen LJ, Hescott B. Going the distance for protein function prediction: a new distance metric for protein interaction networks. PLoS One 2013; 8:e76339. [PMID: 24194834 PMCID: PMC3806810 DOI: 10.1371/journal.pone.0076339] [Citation(s) in RCA: 81] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2013] [Accepted: 08/23/2013] [Indexed: 01/17/2023] Open
Abstract
In protein-protein interaction (PPI) networks, functional similarity is often inferred based on the function of directly interacting proteins, or more generally, some notion of interaction network proximity among proteins in a local neighborhood. Prior methods typically measure proximity as the shortest-path distance in the network, but this has only a limited ability to capture fine-grained neighborhood distinctions, because most proteins are close to each other, and there are many ties in proximity. We introduce diffusion state distance (DSD), a new metric based on a graph diffusion property, designed to capture finer-grained distinctions in proximity for transfer of functional annotation in PPI networks. We present a tool that, when input a PPI network, will output the DSD distances between every pair of proteins. We show that replacing the shortest-path metric by DSD improves the performance of classical function prediction methods across the board.
Collapse
Affiliation(s)
- Mengfei Cao
- Department of Computer Science, Tufts University, Medford, Massachusetts, United States of America
| | - Hao Zhang
- Department of Computer Science, Tufts University, Medford, Massachusetts, United States of America
| | - Jisoo Park
- Department of Computer Science, Tufts University, Medford, Massachusetts, United States of America
| | - Noah M. Daniels
- Department of Computer Science, Tufts University, Medford, Massachusetts, United States of America
| | - Mark E. Crovella
- Department of Computer Science, Boston University, Boston, Massachusetts, United States of America
| | - Lenore J. Cowen
- Department of Computer Science, Tufts University, Medford, Massachusetts, United States of America
- * E-mail: (LJC); (BH)
| | - Benjamin Hescott
- Department of Computer Science, Tufts University, Medford, Massachusetts, United States of America
- * E-mail: (LJC); (BH)
| |
Collapse
|
48
|
Abstract
High throughput technologies have been applied to investigate the underlying mechanisms of complex diseases, identify disease-associations and help to improve treatment. However it is challenging to derive biological insight from conventional single gene based analysis of "omics" data from high throughput experiments due to sample and patient heterogeneity. To address these challenges, many novel pathway and network based approaches were developed to integrate various "omics" data, such as gene expression, copy number alteration, Genome Wide Association Studies, and interaction data. This review will cover recent methodological developments in pathway analysis for the detection of dysregulated interactions and disease-associated subnetworks, prioritization of candidate disease genes, and disease classifications. For each application, we will also discuss the associated challenges and potential future directions.
Collapse
|
49
|
Csermely P, Korcsmáros T, Kiss HJM, London G, Nussinov R. Structure and dynamics of molecular networks: a novel paradigm of drug discovery: a comprehensive review. Pharmacol Ther 2013; 138:333-408. [PMID: 23384594 PMCID: PMC3647006 DOI: 10.1016/j.pharmthera.2013.01.016] [Citation(s) in RCA: 512] [Impact Index Per Article: 46.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2013] [Accepted: 01/22/2013] [Indexed: 02/02/2023]
Abstract
Despite considerable progress in genome- and proteome-based high-throughput screening methods and in rational drug design, the increase in approved drugs in the past decade did not match the increase of drug development costs. Network description and analysis not only give a systems-level understanding of drug action and disease complexity, but can also help to improve the efficiency of drug design. We give a comprehensive assessment of the analytical tools of network topology and dynamics. The state-of-the-art use of chemical similarity, protein structure, protein-protein interaction, signaling, genetic interaction and metabolic networks in the discovery of drug targets is summarized. We propose that network targeting follows two basic strategies. The "central hit strategy" selectively targets central nodes/edges of the flexible networks of infectious agents or cancer cells to kill them. The "network influence strategy" works against other diseases, where an efficient reconfiguration of rigid networks needs to be achieved by targeting the neighbors of central nodes/edges. It is shown how network techniques can help in the identification of single-target, edgetic, multi-target and allo-network drug target candidates. We review the recent boom in network methods helping hit identification, lead selection optimizing drug efficacy, as well as minimizing side-effects and drug toxicity. Successful network-based drug development strategies are shown through the examples of infections, cancer, metabolic diseases, neurodegenerative diseases and aging. Summarizing >1200 references we suggest an optimized protocol of network-aided drug development, and provide a list of systems-level hallmarks of drug quality. Finally, we highlight network-related drug development trends helping to achieve these hallmarks by a cohesive, global approach.
Collapse
Affiliation(s)
- Peter Csermely
- Department of Medical Chemistry, Semmelweis University, P.O. Box 260, H-1444 Budapest 8, Hungary.
| | | | | | | | | |
Collapse
|
50
|
Ghersi D, Singh M. Disentangling function from topology to infer the network properties of disease genes. BMC SYSTEMS BIOLOGY 2013; 7:5. [PMID: 23324116 PMCID: PMC3614482 DOI: 10.1186/1752-0509-7-5] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/22/2012] [Accepted: 01/04/2013] [Indexed: 12/20/2022]
Abstract
BACKGROUND The topological features of disease genes within interaction networks are the subject of intense study, as they shed light on common mechanisms of pathology and are useful for uncovering additional disease genes. Computational analyses typically try to uncover whether disease genes exhibit distinct network features, as compared to all genes. RESULTS We demonstrate that the functional composition of disease gene sets is an important confounding factor in these types of analyses. We consider five disease sets and show that while they indeed have distinct topological features, they are also enriched in functions that a priori exhibit distinct network properties. To address this, we develop a computational framework to assess the network properties of disease genes based on a sampling algorithm that generates control gene sets that are functionally similar to the disease set. Using our function-constrained sampling approach, we demonstrate that for most of the topological properties studied, disease genes are more similar to sets of genes with similar functional make-up than they are to randomly selected genes; this suggests that these observed differences in topological properties reflect not only the distinguishing network features of disease genes but also their functional composition. Nevertheless, we also highlight many cases where disease genes have distinct topological properties even when accounting for function. CONCLUSIONS Our approach is an important first step in extracting the residual topological differences in disease genes when accounting for function, and leads to new insights into the network properties of disease genes.
Collapse
Affiliation(s)
- Dario Ghersi
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA
| | | |
Collapse
|