1
|
Martins S, Coletti R, Lopes MB. Disclosing transcriptomics network-based signatures of glioma heterogeneity using sparse methods. BioData Min 2023; 16:26. [PMID: 37752578 PMCID: PMC10523751 DOI: 10.1186/s13040-023-00341-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 08/13/2023] [Indexed: 09/28/2023] Open
Abstract
Gliomas are primary malignant brain tumors with poor survival and high resistance to available treatments. Improving the molecular understanding of glioma and disclosing novel biomarkers of tumor development and progression could help to find novel targeted therapies for this type of cancer. Public databases such as The Cancer Genome Atlas (TCGA) provide an invaluable source of molecular information on cancer tissues. Machine learning tools show promise in dealing with the high dimension of omics data and extracting relevant information from it. In this work, network inference and clustering methods, namely Joint Graphical lasso and Robust Sparse K-means Clustering, were applied to RNA-sequencing data from TCGA glioma patients to identify shared and distinct gene networks among different types of glioma (glioblastoma, astrocytoma, and oligodendroglioma) and disclose new patient groups and the relevant genes behind groups' separation. The results obtained suggest that astrocytoma and oligodendroglioma have more similarities compared with glioblastoma, highlighting the molecular differences between glioblastoma and the others glioma subtypes. After a comprehensive literature search on the relevant genes pointed our from our analysis, we identified potential candidates for biomarkers of glioma. Further molecular validation of these genes is encouraged to understand their potential role in diagnosis and in the design of novel therapies.
Collapse
Affiliation(s)
- Sofia Martins
- NOVA School of Science and Technology, NOVA University of Lisbon, Caparica, 2829-516, Portugal
| | - Roberta Coletti
- Center for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, 2829-516, Portugal.
| | - Marta B Lopes
- NOVA School of Science and Technology, NOVA University of Lisbon, Caparica, 2829-516, Portugal.
- Center for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, 2829-516, Portugal.
- NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), NOVA School of Science and Technology, Caparica, 2829-516, Portugal.
- UNIDEMI, Department of Mechanical and Industrial Engineering, NOVA School of Science and Technology, Caparica, 2829-516, Portugal.
| |
Collapse
|
2
|
Buck L, Schmidt T, Feist M, Schwarzfischer P, Kube D, Oefner PJ, Zacharias HU, Altenbuchinger M, Dettmer K, Gronwald W, Spang R. Anomaly detection in mixed high-dimensional molecular data. Bioinformatics 2023; 39:btad501. [PMID: 37584673 PMCID: PMC10457663 DOI: 10.1093/bioinformatics/btad501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2023] [Revised: 07/21/2023] [Accepted: 08/14/2023] [Indexed: 08/17/2023] Open
Abstract
MOTIVATION Mixed molecular data combines continuous and categorical features of the same samples, such as OMICS profiles with genotypes, diagnoses, or patient sex. Like all high-dimensional molecular data, it is prone to incorrect values that can stem from various sources for example the technical limitations of the measurement devices, errors in the sample preparation, or contamination. Most anomaly detection algorithms identify complete samples as outliers or anomalies. However, in most cases, not all measurements of those samples are erroneous but only a few one-dimensional features within the samples are incorrect. These one-dimensional data errors are continuous measurements that are either located outside or inside the normal ranges of their features but in both cases show atypical values given all other continuous and categorical features in the sample. Additionally, categorical anomalies can occur for example when the genotype or diagnosis was submitted wrongly. RESULTS We introduce ADMIRE (Anomaly Detection using MIxed gRaphical modEls), a novel approach for the detection and correction of anomalies in mixed high-dimensional data. Hereby, we focus on the detection of single (one-dimensional) data errors in the categorical and continuous features of a sample. For that the joint distribution of continuous and categorical features is learned by mixed graphical models, anomalies are detected by the difference between measured and model-based estimations and are corrected using imputation. We evaluated ADMIRE in simulation and by screening for anomalies in one of our own metabolic datasets. In simulation experiments, ADMIRE outperformed the state-of-the-art methods of Local Outlier Factor, stray, and Isolation Forest. AVAILABILITY AND IMPLEMENTATION All data and code is available at https://github.com/spang-lab/adadmire. ADMIRE is implemented in a Python package called adadmire which can be found at https://pypi.org/project/adadmire.
Collapse
Affiliation(s)
- Lena Buck
- Department of Statistical Bioinformatics, University of Regensburg, 93040 Regensburg, Germany
| | - Tobias Schmidt
- Department of Statistical Bioinformatics, University of Regensburg, 93040 Regensburg, Germany
| | - Maren Feist
- Department of Hematology and Medical Oncology, University Medicine Gottingen, 37075 Gottingen, Germany
| | | | - Dieter Kube
- Department of Hematology and Medical Oncology, University Medicine Gottingen, 37075 Gottingen, Germany
| | - Peter J Oefner
- Institute of Functional Genomics, University of Regensburg, 93040 Regensburg, Germany
| | - Helena U Zacharias
- Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, Hannover Medical School, 30625 Hannover, Germany
| | - Michael Altenbuchinger
- Department of Medical Bioinformatics, University Medical Center Göttingen, 37075 Göttingen, Germany
| | - Katja Dettmer
- Institute of Functional Genomics, University of Regensburg, 93040 Regensburg, Germany
| | - Wolfram Gronwald
- Institute of Functional Genomics, University of Regensburg, 93040 Regensburg, Germany
| | - Rainer Spang
- Department of Statistical Bioinformatics, University of Regensburg, 93040 Regensburg, Germany
| |
Collapse
|
3
|
Saint-Antoine M, Singh A. Benchmarking Gene Regulatory Network Inference Methods on Simulated and Experimental Data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.12.540581. [PMID: 37215029 PMCID: PMC10197678 DOI: 10.1101/2023.05.12.540581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Although the challenge of gene regulatory network inference has been studied for more than a decade, it is still unclear how well network inference methods work when applied to real data. Attempts to benchmark these methods on experimental data have yielded mixed results, in which sometimes even the best methods fail to outperform random guessing, and in other cases they perform reasonably well. So, one of the most valuable contributions one can currently make to the field of network inference is to benchmark methods on experimental data for which the true underlying network is already known, and report the results so that we can get a clearer picture of their efficacy. In this paper, we report results from the first, to our knowledge, benchmarking of network inference methods on single cell E. coli transcriptomic data. We report a moderate level of accuracy for the methods, better than random chance but still far from perfect. We also find that some methods that were quite strong and accurate on microarray and bulk RNA-seq data did not perform as well on the single cell data. Additionally, we benchmark a simple network inference method (Pearson correlation), on data generated through computer simulations in order to draw conclusions about general best practices in network inference studies. We predict that network inference would be more accurate using proteomic data rather than transcriptomic data, which could become relevant if high-throughput proteomic experimental methods are developed in the future. We also show through simulations that using a simplified model of gene expression that skips the mRNA step tends to substantially overestimate the accuracy of network inference methods, and advise against using this model for future in silico benchmarking studies.
Collapse
Affiliation(s)
- Michael Saint-Antoine
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE USA 19716
| | - Abhyudai Singh
- Department of Electrical and Computer Engineering, Biomedical Engineering, Mathematical Sciences, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE USA 19716
| |
Collapse
|
4
|
Aldirawi H, Morales FG. Univariate and Multivariate Statistical Analysis of Microbiome Data: An Overview. Appl Microbiol 2023. [DOI: 10.3390/applmicrobiol3020023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/30/2023]
Abstract
Microbiome data is high dimensional, sparse, compositional, and over-dispersed. Therefore, modeling microbiome data is very challenging and it is an active research area. Microbiome analysis has become a progressing area of research as microorganisms constitute a large part of life. Since many methods of microbiome data analysis have been presented, this review summarizes the challenges, methods used, and the advantages and disadvantages of those methods, to serve as an updated guide for those in the field. This review also compared different methods of analysis to progress the development of newer methods.
Collapse
|
5
|
Vásquez AR, Márquez Urbina JU, González Farías G, Escarela G. Controlling the false discovery rate by a Latent Gaussian Copula Knockoff procedure. Comput Stat 2023. [DOI: 10.1007/s00180-023-01346-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2023]
|
6
|
Chen Y, Zhang XF, Ou-Yang L. Inferring cancer common and specific gene networks via multi-layer joint graphical model. Comput Struct Biotechnol J 2023; 21:974-990. [PMID: 36733706 PMCID: PMC9873583 DOI: 10.1016/j.csbj.2023.01.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Revised: 01/08/2023] [Accepted: 01/14/2023] [Indexed: 01/19/2023] Open
Abstract
Cancer is a complex disease caused primarily by genetic variants. Reconstructing gene networks within tumors is essential for understanding the functional regulatory mechanisms of carcinogenesis. Advances in high-throughput sequencing technologies have provided tremendous opportunities for inferring gene networks via computational approaches. However, due to the heterogeneity of the same cancer type and the similarities between different cancer types, it remains a challenge to systematically investigate the commonalities and specificities between gene networks of different cancer types, which is a crucial step towards precision cancer diagnosis and treatment. In this study, we propose a new sparse regularized multi-layer decomposition graphical model to jointly estimate the gene networks of multiple cancer types. Our model can handle various types of gene expression data and decomposes each cancer-type-specific network into three components, i.e., globally shared, partially shared and cancer-type-unique components. By identifying the globally and partially shared gene network components, our model can explore the heterogeneous similarities between different cancer types, and our identified cancer-type-unique components can help to reveal the regulatory mechanisms unique to each cancer type. Extensive experiments on synthetic data illustrate the effectiveness of our model in joint estimation of multiple gene networks. We also apply our model to two real data sets to infer the gene networks of multiple cancer subtypes or cell lines. By analyzing our estimated globally shared, partially shared, and cancer-type-unique components, we identified a number of important genes associated with common and specific regulatory mechanisms across different cancer types.
Collapse
Affiliation(s)
- Yuanxiao Chen
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), Shenzhen University, Shenzhen, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, China
| | - Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), Shenzhen University, Shenzhen, China,Corresponding author.
| |
Collapse
|
7
|
Seal S, Li Q, Basner EB, Saba LM, Kechris K. RCFGL: Rapid Condition adaptive Fused Graphical Lasso and application to modeling brain region co-expression networks. PLoS Comput Biol 2023; 19:e1010758. [PMID: 36607897 PMCID: PMC9821764 DOI: 10.1371/journal.pcbi.1010758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Accepted: 11/24/2022] [Indexed: 01/07/2023] Open
Abstract
Inferring gene co-expression networks is a useful process for understanding gene regulation and pathway activity. The networks are usually undirected graphs where genes are represented as nodes and an edge represents a significant co-expression relationship. When expression data of multiple (p) genes in multiple (K) conditions (e.g., treatments, tissues, strains) are available, joint estimation of networks harnessing shared information across them can significantly increase the power of analysis. In addition, examining condition-specific patterns of co-expression can provide insights into the underlying cellular processes activated in a particular condition. Condition adaptive fused graphical lasso (CFGL) is an existing method that incorporates condition specificity in a fused graphical lasso (FGL) model for estimating multiple co-expression networks. However, with computational complexity of O(p2K log K), the current implementation of CFGL is prohibitively slow even for a moderate number of genes and can only be used for a maximum of three conditions. In this paper, we propose a faster alternative of CFGL named rapid condition adaptive fused graphical lasso (RCFGL). In RCFGL, we incorporate the condition specificity into another popular model for joint network estimation, known as fused multiple graphical lasso (FMGL). We use a more efficient algorithm in the iterative steps compared to CFGL, enabling faster computation with complexity of O(p2K) and making it easily generalizable for more than three conditions. We also present a novel screening rule to determine if the full network estimation problem can be broken down into estimation of smaller disjoint sub-networks, thereby reducing the complexity further. We demonstrate the computational advantage and superior performance of our method compared to two non-condition adaptive methods, FGL and FMGL, and one condition adaptive method, CFGL in both simulation study and real data analysis. We used RCFGL to jointly estimate the gene co-expression networks in different brain regions (conditions) using a cohort of heterogeneous stock rats. We also provide an accommodating C and Python based package that implements RCFGL.
Collapse
Affiliation(s)
- Souvik Seal
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, Colorado, United States of America
| | - Qunhua Li
- Department of Statistics, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Elle Butler Basner
- Department of Statistics, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Laura M. Saba
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Anschutz Medical Campus, Aurora, Colorado, United States of America
| | - Katerina Kechris
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, Colorado, United States of America
| |
Collapse
|
8
|
Elgart M, Goodman MO, Isasi C, Chen H, Morrison AC, de Vries PS, Xu H, Manichaikul AW, Guo X, Franceschini N, Psaty BM, Rich SS, Rotter JI, Lloyd-Jones DM, Fornage M, Correa A, Heard-Costa NL, Vasan RS, Hernandez R, Kaplan RC, Redline S, Sofer T. Correlations between complex human phenotypes vary by genetic background, gender, and environment. Cell Rep Med 2022; 3:100844. [PMID: 36513073 PMCID: PMC9797952 DOI: 10.1016/j.xcrm.2022.100844] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Revised: 07/11/2022] [Accepted: 11/09/2022] [Indexed: 12/15/2022]
Abstract
We develop a closed-form Haseman-Elston estimator for genetic and environmental correlation coefficients between complex phenotypes, which we term HEc, that is as precise as GCTA yet ∼20× faster. We estimate genetic and environmental correlations between over 7,000 phenotype pairs in subgroups from the Trans-Omics in Precision Medicine (TOPMed) program. We demonstrate substantial differences in both heritabilities and genetic correlations for multiple phenotypes and phenotype pairs between individuals of self-reported Black, Hispanic/Latino, and White backgrounds. We similarly observe differences in many of the genetic and environmental correlations between genders. To estimate the contribution of genetics to the observed phenotypic correlation, we introduce "fractional genetic correlation" as the fraction of phenotypic correlation explained by genetics. Finally, we quantify the enrichment of correlations between phenotypic domains, each of which is comprised of multiple phenotypes. Altogether, we demonstrate that the observed correlations between complex human phenotypes depend on the genetic background of the individuals, their gender, and their environment.
Collapse
Affiliation(s)
- Michael Elgart
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA; Department of Medicine, Harvard Medical School, Boston, MA, USA.
| | - Matthew O Goodman
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA; Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - Carmen Isasi
- Department of Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, NY, USA
| | - Han Chen
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA; Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Alanna C Morrison
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Paul S de Vries
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Huichun Xu
- Department of Medicine, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Ani W Manichaikul
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
| | - Xiuqing Guo
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Nora Franceschini
- Department of Epidemiology, University of North Carolina, Chapel Hill, NC, USA
| | - Bruce M Psaty
- Cardiovascular Health Research Unit, Departments of Medicine, Epidemiology, and Health Services, University of Washington, Seattle, WA, USA
| | - Stephen S Rich
- Center for Public Health Genomics, University of Virginia School of Medicine, Charlottesville, VA, USA
| | - Jerome I Rotter
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | | | - Myriam Fornage
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA; Brown Foundation Institute of Molecular Medicine, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Adolfo Correa
- Department of Population Health Science, University of Mississippi Medical Center, Jackson, MS, USA
| | - Nancy L Heard-Costa
- Boston University and National Heart Lung and Blood Institute's Framingham Heart Study, Framingham, MA, USA; Department of Neurology, Boston University School of Medicine, Boston, MA, USA
| | - Ramachandran S Vasan
- Boston University and National Heart Lung and Blood Institute's Framingham Heart Study, Framingham, MA, USA; Preventive Medicine & Epidemiology, and Cardiovascular Medicine, Medicine, Boston University School of Medicine, and Epidemiology, Boston University School of Public Health, Boston, MA, USA
| | - Ryan Hernandez
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA
| | - Robert C Kaplan
- Department of Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, NY, USA; Fred Hutchinson Cancer Research Center, Division of Public Health Sciences, Seattle, WA, USA
| | - Susan Redline
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA; Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - Tamar Sofer
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA; Department of Medicine, Harvard Medical School, Boston, MA, USA; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| |
Collapse
|
9
|
Learning complex dependency structure of gene regulatory networks from high dimensional microarray data with Gaussian Bayesian networks. Sci Rep 2022; 12:18704. [PMID: 36333425 PMCID: PMC9636198 DOI: 10.1038/s41598-022-21957-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 10/06/2022] [Indexed: 11/06/2022] Open
Abstract
Reconstruction of Gene Regulatory Networks (GRNs) of gene expression data with Probabilistic Network Models (PNMs) is an open problem. Gene expression datasets consist of thousand of genes with relatively small sample sizes (i.e. are large-p-small-n). Moreover, dependencies of various orders coexist in the datasets. On the one hand transcription factor encoding genes act like hubs and regulate target genes, on the other hand target genes show local dependencies. In the field of Undirected Network Models (UNMs)-a subclass of PNMs-the Glasso algorithm has been proposed to deal with high dimensional microarray datasets forcing sparsity. To overcome the problem of the complex structure of interactions, modifications of the default Glasso algorithm have been developed that integrate the expected dependency structure in the UNMs beforehand. In this work we advocate the use of a simple score-based Hill Climbing algorithm (HC) that learns Gaussian Bayesian networks leaning on directed acyclic graphs. We compare HC with Glasso and variants in the UNM framework based on their capability to reconstruct GRNs from microarray data from the benchmarking synthetic dataset from the DREAM5 challenge and from real-world data from the Escherichia coli genome. We conclude that dependencies in complex data are learned best by the HC algorithm, presenting them most accurately and efficiently, simultaneously modelling strong local and weaker but significant global connections coexisting in the gene expression dataset. The HC algorithm adapts intrinsically to the complex dependency structure of the dataset, without forcing a specific structure in advance.
Collapse
|
10
|
Leng J, Wu LY. Interaction-based transcriptome analysis via differential network inference. Brief Bioinform 2022; 23:6768051. [PMID: 36274239 PMCID: PMC9677477 DOI: 10.1093/bib/bbac466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Revised: 09/13/2022] [Accepted: 09/28/2022] [Indexed: 12/14/2022] Open
Abstract
Gene-based transcriptome analysis, such as differential expression analysis, can identify the key factors causing disease production, cell differentiation and other biological processes. However, this is not enough because basic life activities are mainly driven by the interactions between genes. Although there have been already many differential network inference methods for identifying the differential gene interactions, currently, most studies still only use the information of nodes in the network for downstream analyses. To investigate the insight into differential gene interactions, we should perform interaction-based transcriptome analysis (IBTA) instead of gene-based analysis after obtaining the differential networks. In this paper, we illustrated a workflow of IBTA by developing a Co-hub Differential Network inference (CDN) algorithm, and a novel interaction-based metric, pivot APC2. We confirmed the superior performance of CDN through simulation experiments compared with other popular differential network inference algorithms. Furthermore, three case studies are given using colorectal cancer, COVID-19 and triple-negative breast cancer datasets to demonstrate the ability of our interaction-based analytical process to uncover causative mechanisms.
Collapse
Affiliation(s)
- Jiacheng Leng
- IAM, MADIS, NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China,School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Ling-Yun Wu
- Corresponding author. Ling-Yun Wu, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China; School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China. E-mail:
| |
Collapse
|
11
|
On principal graphical models with application to gene network. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2021.107344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
12
|
Bodein A, Scott-Boyer MP, Perin O, Lê Cao KA, Droit A. Interpretation of network-based integration from multi-omics longitudinal data. Nucleic Acids Res 2021; 50:e27. [PMID: 34883510 PMCID: PMC8934642 DOI: 10.1093/nar/gkab1200] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Revised: 10/19/2021] [Accepted: 11/22/2021] [Indexed: 12/26/2022] Open
Abstract
Multi-omics integration is key to fully understand complex biological processes in an holistic manner. Furthermore, multi-omics combined with new longitudinal experimental design can unreveal dynamic relationships between omics layers and identify key players or interactions in system development or complex phenotypes. However, integration methods have to address various experimental designs and do not guarantee interpretable biological results. The new challenge of multi-omics integration is to solve interpretation and unlock the hidden knowledge within the multi-omics data. In this paper, we go beyond integration and propose a generic approach to face the interpretation problem. From multi-omics longitudinal data, this approach builds and explores hybrid multi-omics networks composed of both inferred and known relationships within and between omics layers. With smart node labelling and propagation analysis, this approach predicts regulation mechanisms and multi-omics functional modules. We applied the method on 3 case studies with various multi-omics designs and identified new multi-layer interactions involved in key biological functions that could not be revealed with single omics analysis. Moreover, we highlighted interplay in the kinetics that could help identify novel biological mechanisms. This method is available as an R package netOmics to readily suit any application.
Collapse
Affiliation(s)
- Antoine Bodein
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Marie-Pier Scott-Boyer
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Olivier Perin
- Digital Sciences Department, L'Oréal Advanced Research, Aulnay-sous-bois, France
| | - Kim-Anh Lê Cao
- Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Melbourne, VIC, Australia
| | - Arnaud Droit
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| |
Collapse
|
13
|
Abstract
Cancer is a genetic disease in which multiple genes are perturbed. Thus, information about the regulatory relationships between genes is necessary for the identification of biomarkers and therapeutic targets. In this review, methods for inference of gene regulatory networks (GRNs) from transcriptomics data that are used in cancer research are introduced. The methods are classified into three categories according to the analysis model. The first category includes methods that use pair-wise measures between genes, including correlation coefficient and mutual information. The second category includes methods that determine the genetic regulatory relationship using multivariate measures, which consider the expression profiles of all genes concurrently. The third category includes methods using supervised and integrative approaches. The supervised approach estimates the regulatory relationship using a supervised learning method that constructs a regression or classification model for predicting whether there is a regulatory relationship between genes with input data of gene expression profiles and class labels of prior biological knowledge. The integrative method is an expansion of the supervised method and uses more data and biological knowledge for predicting the regulatory relationship. Furthermore, simulation and experimental validation of the estimated GRNs are also discussed in this review. This review identified that most GRN inference methods are not specific for cancer transcriptome data, and such methods are required for better understanding of cancer pathophysiology. In addition, more systematic methods for validation of the estimated GRNs need to be developed in the context of cancer biology.
Collapse
|
14
|
scLink: Inferring Sparse Gene Co-expression Networks from Single-cell Expression Data. GENOMICS PROTEOMICS & BIOINFORMATICS 2021; 19:475-492. [PMID: 34252628 PMCID: PMC8896229 DOI: 10.1016/j.gpb.2020.11.006] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Revised: 10/23/2020] [Accepted: 12/26/2020] [Indexed: 11/23/2022]
Abstract
A system-level understanding of the regulation and coordination mechanisms of gene expression is essential for studying the complexity of biological processes in health and disease. With the rapid development of single-cell RNA sequencing technologies, it is now possible to investigate gene interactions in a cell type-specific manner. Here we propose the scLink method, which uses statistical network modeling to understand the co-expression relationships among genes and construct sparse gene co-expression networks from single-cell gene expression data. We use both simulation and real data studies to demonstrate the advantages of scLink and its ability to improve single-cell gene network analysis. The scLink R package is available at https://github.com/Vivianstats/scLink.
Collapse
|
15
|
Yi H, Zhang Q, Sun Y, Ma S. Assisted estimation of gene expression graphical models. Genet Epidemiol 2021; 45:372-385. [PMID: 33527531 PMCID: PMC8137544 DOI: 10.1002/gepi.22377] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2020] [Revised: 12/16/2020] [Accepted: 12/31/2020] [Indexed: 02/02/2023]
Abstract
In the study of gene expression data, network analysis has played a uniquely important role. To accommodate the high dimensionality and low sample size and generate interpretable results, regularized estimation is usually conducted in the construction of gene expression Gaussian Graphical Models (GGM). Here we use GeO-GGM to represent gene-expression-only GGM. Gene expressions are regulated by regulators. gene-expression-regulator GGMs (GeR-GGMs), which accommodate gene expressions as well as their regulators, have been constructed accordingly. In practical data analysis, with a "lack of information" caused by the large number of model parameters, limited sample size, and weak signals, the construction of both GeO-GGMs and GeR-GGMs is often unsatisfactory. In this article, we recognize that with the regulation between gene expressions and regulators, the sparsity structures of a GeO-GGM and its GeR-GGM counterpart can satisfy a hierarchy. Accordingly, we propose a joint estimation which reinforces the hierarchical structure and use the construction of a GeO-GGM to assist that of its GeR-GGM counterpart and vice versa. Consistency properties are rigorously established, and an effective computational algorithm is developed. In simulation, the assisted construction outperforms the separation construction of GeO-GGM and GeR-GGM. Two The Cancer Genome Atlas data sets are analyzed, leading to findings different from the direct competitors.
Collapse
Affiliation(s)
- Huangdi Yi
- Department of Biostatistics, Yale University
| | - Qingzhao Zhang
- Department of Statistics, School of Economics; Key Laboratory of Econometrics, Ministry of Education; The Wang Yanan Institute for Studies in Economics, Xiamen University
| | - Yifan Sun
- Center of Applied Statistics, School of Statistics, Renmin University of China
| | - Shuangge Ma
- Department of Biostatistics, Yale University
- Department of Statistics, School of Economics; Key Laboratory of Econometrics, Ministry of Education; The Wang Yanan Institute for Studies in Economics, Xiamen University
| |
Collapse
|
16
|
Hoang T, Lee J, Kim J. Differences in Dietary Patterns Identified by the Gaussian Graphical Model in Korean Adults With and Without a Self-Reported Cancer Diagnosis. J Acad Nutr Diet 2020; 121:1484-1496.e3. [PMID: 33288494 DOI: 10.1016/j.jand.2020.11.006] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2019] [Revised: 11/04/2020] [Accepted: 11/10/2020] [Indexed: 01/02/2023]
Abstract
BACKGROUND The synergistic effect of food groups on health outcomes is better captured by examining dietary patterns (DPs) than single food groups. Regarding this issue, a Gaussian graphical model (GGM) can identify pairwise correlations between food groups and adjust for the remaining items. However, the application of GGMs in the nutritional field has not been widely investigated, especially in Korean adults. OBJECTIVE The aim of this study was to identify the major DPs of Korean adults by using a GGM and to examine the associations between the DP scores and prevalence of self-reported cancer. DESIGN This cross-sectional study used baseline data from the 2007-2019 Cancer Screenee Cohort of the National Cancer Center, Korea. PARTICIPANTS/SETTING In total, 10,777 Korean adults who completed a questionnaire regarding their general medical history, including clinical test results, and a validated food frequency questionnaire were included. MAIN OUTCOME MEASURES The main outcome measure was the prevalence of self-reported cancer at baseline. STATISTICAL ANALYSIS DP networks were identified using a GGM. The GGM-identified networks were scored and categorized into tertiles, and their association with the prevalence of self-reported cancer was investigated using a multivariable logistic regression model. RESULTS The GGM identified the following 4 DP networks: principal, oil-sweet, meat, and fruit. After adjusting for covariates, the odds of moderate and high consumption of foods in the oil-sweet DP for participants who self-reported cancer were 25% and 34% lower than those for participants who did not report a cancer diagnosis (odds ratio [OR] = 0.75, 95% confidence interval [CI] = 0.62-0.90 and OR = 0.66, 95% CI = 0.53-0.81, respectively). Additionally, the odds of meat DP consumption in the self-reported cancer group was 29% lower than in participants who did not report a cancer diagnosis (OR = 0.71 and 95% CI = 0.57-0.88). In contrast, an increase in the odds of fruit DP consumption was observed for self-reported cancer participants (OR = 1.34 and 95% CI = 1.09-1.65). Similar results were observed among the female but not the male subjects. CONCLUSIONS GGM is a novel method that can distinguish the direct pairwise correlation of food groups and control for the indirect effect of other foods. Future large-scale longitudinal population-based studies are needed to build on these findings in general populations.
Collapse
|
17
|
Wu N, Yin F, Ou-Yang L, Zhu Z, Xie W. Joint learning of multiple gene networks from single-cell gene expression data. Comput Struct Biotechnol J 2020; 18:2583-2595. [PMID: 33033579 PMCID: PMC7527714 DOI: 10.1016/j.csbj.2020.09.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2020] [Revised: 08/31/2020] [Accepted: 09/01/2020] [Indexed: 11/24/2022] Open
Abstract
Inferring gene networks from gene expression data is important for understanding functional organizations within cells. With the accumulation of single-cell RNA sequencing (scRNA-seq) data, it is possible to infer gene networks at single cell level. However, due to the characteristics of scRNA-seq data, such as cellular heterogeneity and high sparsity caused by dropout events, traditional network inference methods may not be suitable for scRNA-seq data. In this study, we introduce a novel joint Gaussian copula graphical model (JGCGM) to jointly estimate multiple gene networks for multiple cell subgroups from scRNA-seq data. Our model can deal with non-Gaussian data with missing values, and identify the common and unique network structures of multiple cell subgroups, which is suitable for scRNA-seq data. Extensive experiments on synthetic data demonstrate that our proposed model outperforms other compared state-of-the-art network inference models. We apply our model to real scRNA-seq data sets to infer gene networks of different cell subgroups. Hub genes in the estimated gene networks are found to be biological significance.
Collapse
Affiliation(s)
- Nuosi Wu
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
| | - Fu Yin
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
| | - Le Ou-Yang
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), Shenzhen University, Shenzhen, China
- Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen, China
| | - Zexuan Zhu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
| | - Weixin Xie
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
| |
Collapse
|
18
|
Kim AA, Rachid Zaim S, Subbian V. Assessing reproducibility and veracity across machine learning techniques in biomedicine: A case study using TCGA data. Int J Med Inform 2020; 141:104148. [DOI: 10.1016/j.ijmedinf.2020.104148] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2019] [Revised: 03/22/2020] [Accepted: 04/16/2020] [Indexed: 11/28/2022]
|
19
|
Jiang S, Xiao G, Koh AY, Chen Y, Yao B, Li Q, Zhan X. HARMONIES: A Hybrid Approach for Microbiome Networks Inference via Exploiting Sparsity. Front Genet 2020; 11:445. [PMID: 32582274 PMCID: PMC7283552 DOI: 10.3389/fgene.2020.00445] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Accepted: 04/14/2020] [Indexed: 12/19/2022] Open
Abstract
The human microbiome is a collection of microorganisms. They form complex communities and collectively affect host health. Recently, the advances in next-generation sequencing technology enable the high-throughput profiling of the human microbiome. This calls for a statistical model to construct microbial networks from the microbiome sequencing count data. As microbiome count data are high-dimensional and suffer from uneven sampling depth, over-dispersion, and zero-inflation, these characteristics can bias the network estimation and require specialized analytical tools. Here we propose a general framework, HARMONIES, Hybrid Approach foR MicrobiOme Network Inferences via Exploiting Sparsity, to infer a sparse microbiome network. HARMONIES first utilizes a zero-inflated negative binomial (ZINB) distribution to model the skewness and excess zeros in the microbiome data, as well as incorporates a stochastic process prior for sample-wise normalization. This approach infers a sparse and stable network by imposing non-trivial regularizations based on the Gaussian graphical model. In comprehensive simulation studies, HARMONIES outperformed four other commonly used methods. When using published microbiome data from a colorectal cancer study, it discovered a novel community with disease-enriched bacteria. In summary, HARMONIES is a novel and useful statistical framework for microbiome network inference, and it is available at https://github.com/shuangj00/HARMONIES.
Collapse
Affiliation(s)
- Shuang Jiang
- Department of Statistical Science, Southern Methodist University, Dallas, TX, United States.,Quantitative Biomedical Research Center, Department of Population and Data Sciences, University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Guanghua Xiao
- Quantitative Biomedical Research Center, Department of Population and Data Sciences, University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Andrew Y Koh
- Departments of Pediatrics, Departments of Microbiology, University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Yingfei Chen
- Lyda Hill Department of Bioinformatics, Bioinformatics High Performance Computing, University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Bo Yao
- Quantitative Biomedical Research Center, Department of Population and Data Sciences, University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Qiwei Li
- Department of Mathematical Sciences, The University of Texas at Dallas, Richardson, TX, United States
| | - Xiaowei Zhan
- Quantitative Biomedical Research Center, Department of Population and Data Sciences, University of Texas Southwestern Medical Center, Dallas, TX, United States
| |
Collapse
|
20
|
Saint-Antoine MM, Singh A. Network inference in systems biology: recent developments, challenges, and applications. Curr Opin Biotechnol 2020; 63:89-98. [PMID: 31927423 PMCID: PMC7308210 DOI: 10.1016/j.copbio.2019.12.002] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Accepted: 12/03/2019] [Indexed: 12/12/2022]
Abstract
One of the most interesting, difficult, and potentially useful topics in computational biology is the inference of gene regulatory networks (GRNs) from expression data. Although researchers have been working on this topic for more than a decade and much progress has been made, it remains an unsolved problem and even the most sophisticated inference algorithms are far from perfect. In this paper, we review the latest developments in network inference, including state-of-the-art algorithms like PIDC, Phixer, and more. We also discuss unsolved computational challenges, including the optimal combination of algorithms, integration of multiple data sources, and pseudo-temporal ordering of static expression data. Lastly, we discuss some exciting applications of network inference in cancer research, and provide a list of useful software tools for researchers hoping to conduct their own network inference analyses.
Collapse
Affiliation(s)
- Michael M Saint-Antoine
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, Delaware 19716, USA
| | - Abhyudai Singh
- Electrical and Computer Engineering, University of Delaware, Newark, Delaware 19716, USA.
| |
Collapse
|
21
|
Abbas-Aghababazadeh F, Mo Q, Fridley BL. Statistical genomics in rare cancer. Semin Cancer Biol 2019; 61:1-10. [PMID: 31437624 DOI: 10.1016/j.semcancer.2019.08.021] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2019] [Revised: 08/14/2019] [Accepted: 08/17/2019] [Indexed: 12/26/2022]
Abstract
Rare cancers make of more than 20% of cancer cases. Due to the rare nature, less research has been conducted on rare cancers resulting in worse outcomes for patients with rare cancers compared to common cancers. The ability to study rare cancers is impaired by the ability to collect a large enough set of patients to complete an adequately powered genomic study. In this manuscript we outline analytical approaches and public genomic datasets that have been used in genomic studies of rare cancers. These statistical analysis approaches and study designs include: gene set / pathway analyses, pedigree and consortium studies, meta-analysis or horizontal integration, and integration of multiple types of genomic information or vertical integration. We also discuss some of the publicly available resources that can be leveraged in rare cancer genomic studies.
Collapse
Affiliation(s)
| | - Qianxing Mo
- Department of Biostatistics & Bioinformatics, Moffitt Cancer Center, Tampa, FL, 33612, USA.
| | - Brooke L Fridley
- Department of Biostatistics & Bioinformatics, Moffitt Cancer Center, Tampa, FL, 33612, USA.
| |
Collapse
|