1
|
Huynh-Thu VA, Geurts P. Unsupervised Gene Network Inference with Decision Trees and Random Forests. Methods Mol Biol 2019; 1883:195-215. [PMID: 30547401 DOI: 10.1007/978-1-4939-8882-2_8] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
In this chapter, we introduce the reader to a popular family of machine learning algorithms, called decision trees. We then review several approaches based on decision trees that have been developed for the inference of gene regulatory networks (GRNs). Decision trees have indeed several nice properties that make them well-suited for tackling this problem: they are able to detect multivariate interacting effects between variables, are non-parametric, have good scalability, and have very few parameters. In particular, we describe in detail the GENIE3 algorithm, a state-of-the-art method for GRN inference.
Collapse
Affiliation(s)
- Vân Anh Huynh-Thu
- Department of Electrical Engineering and Computer Science, University of Liège, Liège, Belgium.
| | - Pierre Geurts
- Department of Electrical Engineering and Computer Science, University of Liège, Liège, Belgium
| |
Collapse
|
2
|
Hanson C, Cairns J, Wang L, Sinha S. Principled multi-omic analysis reveals gene regulatory mechanisms of phenotype variation. Genome Res 2018; 28:1207-1216. [PMID: 29898900 PMCID: PMC6071639 DOI: 10.1101/gr.227066.117] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2017] [Accepted: 05/31/2018] [Indexed: 12/12/2022]
Abstract
Recent studies have analyzed large-scale data sets of gene expression to identify genes associated with interindividual variation in phenotypes ranging from cancer subtypes to drug sensitivity, promising new avenues of research in personalized medicine. However, gene expression data alone is limited in its ability to reveal cis-regulatory mechanisms underlying phenotypic differences. In this study, we develop a new probabilistic model, called pGENMi, that integrates multi-omic data to investigate the transcriptional regulatory mechanisms underlying interindividual variation of a specific phenotype—that of cell line response to cytotoxic treatment. In particular, pGENMi simultaneously analyzes genotype, DNA methylation, gene expression, and transcription factor (TF)-DNA binding data, along with phenotypic measurements, to identify TFs regulating the phenotype. It does so by combining statistical information about expression quantitative trait loci (eQTLs) and expression-correlated methylation marks (eQTMs) located within TF binding sites, as well as observed correlations between gene expression and phenotype variation. Application of pGENMi to data from a panel of lymphoblastoid cell lines treated with 24 drugs, in conjunction with ENCODE TF ChIP data, yielded a number of known as well as novel (TF, Drug) associations. Experimental validations by TF knockdown confirmed 41% of the predicted and tested associations, compared to a 12% confirmation rate of tested nonassociations (controls). An extensive literature survey also corroborated 62% of the predicted associations above a stringent threshold. Moreover, associations predicted only when combining eQTL and eQTM data showed higher precision compared to an eQTL-only or eQTM-only analysis using pGENMi, further demonstrating the value of multi-omic integrative analysis.
Collapse
Affiliation(s)
- Casey Hanson
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
| | - Junmei Cairns
- Division of Clinical Pharmacology, Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, Minnesota 55905, USA
| | - Liewei Wang
- Division of Clinical Pharmacology, Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, Minnesota 55905, USA
| | - Saurabh Sinha
- Department of Computer Science and Institute of Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
| |
Collapse
|
3
|
Mall R, Cerulo L, Garofano L, Frattini V, Kunji K, Bensmail H, Sabedot TS, Noushmehr H, Lasorella A, Iavarone A, Ceccarelli M. RGBM: regularized gradient boosting machines for identification of the transcriptional regulators of discrete glioma subtypes. Nucleic Acids Res 2018; 46:e39. [PMID: 29361062 PMCID: PMC6283452 DOI: 10.1093/nar/gky015] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2017] [Accepted: 01/06/2018] [Indexed: 01/05/2023] Open
Abstract
We propose a generic framework for gene regulatory network (GRN) inference approached as a feature selection problem. GRNs obtained using Machine Learning techniques are often dense, whereas real GRNs are rather sparse. We use a Tikonov regularization inspired optimal L-curve criterion that utilizes the edge weight distribution for a given target gene to determine the optimal set of TFs associated with it. Our proposed framework allows to incorporate a mechanistic active biding network based on cis-regulatory motif analysis. We evaluate our regularization framework in conjunction with two non-linear ML techniques, namely gradient boosting machines (GBM) and random-forests (GENIE), resulting in a regularized feature selection based method specifically called RGBM and RGENIE respectively. RGBM has been used to identify the main transcription factors that are causally involved as master regulators of the gene expression signature activated in the FGFR3-TACC3-positive glioblastoma. Here, we illustrate that RGBM identifies the main regulators of the molecular subtypes of brain tumors. Our analysis reveals the identity and corresponding biological activities of the master regulators characterizing the difference between G-CIMP-high and G-CIMP-low subtypes and between PA-like and LGm6-GBM, thus providing a clue to the yet undetermined nature of the transcriptional events among these subtypes.
Collapse
Affiliation(s)
- Raghvendra Mall
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Luigi Cerulo
- Department of Science and Technology, University of Sannio, Benevento, Italy
- BIOGEM Istituto di Ricerche Genetiche “G. Salvatore”, Ariano Irpino, Italy
| | - Luciano Garofano
- Department of Science and Technology, University of Sannio, Benevento, Italy
- BIOGEM Istituto di Ricerche Genetiche “G. Salvatore”, Ariano Irpino, Italy
| | - Veronique Frattini
- Institute for Cancer Genetics, Columbia University Medical Center, New York, NY 10032, USA
| | - Khalid Kunji
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Halima Bensmail
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Thais S Sabedot
- Department of Neurosurgery, Brain Tumor Center, Henry Ford Health System, Detroit, MI, USA
- Department of Genetics (CISBi/NAP), Department of Surgery and Anatomy, Ribeirão Preto Medical School, University of Sao Paulo, Monte Alegre, Ribeirao Preto, Brazil
| | - Houtan Noushmehr
- Department of Neurosurgery, Brain Tumor Center, Henry Ford Health System, Detroit, MI, USA
- Department of Genetics (CISBi/NAP), Department of Surgery and Anatomy, Ribeirão Preto Medical School, University of Sao Paulo, Monte Alegre, Ribeirao Preto, Brazil
| | - Anna Lasorella
- Institute for Cancer Genetics, Columbia University Medical Center, New York, NY 10032, USA
- Department of Pathology and Cell Biology, Columbia University Medical Center, New York, New York 10032, USA
- Department of Pediatrics, Columbia University Medical Center, New York, New York 10032, USA
| | - Antonio Iavarone
- Institute for Cancer Genetics, Columbia University Medical Center, New York, NY 10032, USA
- Department of Pathology and Cell Biology, Columbia University Medical Center, New York, New York 10032, USA
- Department of Neurology, Columbia University Medical Center, New York, New York 10032, USA
| | - Michele Ceccarelli
- Department of Science and Technology, University of Sannio, Benevento, Italy
- BIOGEM Istituto di Ricerche Genetiche “G. Salvatore”, Ariano Irpino, Italy
| |
Collapse
|
4
|
Multiple Linear Regression for Reconstruction of Gene Regulatory Networks in Solving Cascade Error Problems. Adv Bioinformatics 2017; 2017:4827171. [PMID: 28250767 PMCID: PMC5303608 DOI: 10.1155/2017/4827171] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2016] [Revised: 10/10/2016] [Accepted: 10/19/2016] [Indexed: 11/17/2022] Open
Abstract
Gene regulatory network (GRN) reconstruction is the process of identifying regulatory gene interactions from experimental data through computational analysis. One of the main reasons for the reduced performance of previous GRN methods had been inaccurate prediction of cascade motifs. Cascade error is defined as the wrong prediction of cascade motifs, where an indirect interaction is misinterpreted as a direct interaction. Despite the active research on various GRN prediction methods, the discussion on specific methods to solve problems related to cascade errors is still lacking. In fact, the experiments conducted by the past studies were not specifically geared towards proving the ability of GRN prediction methods in avoiding the occurrences of cascade errors. Hence, this research aims to propose Multiple Linear Regression (MLR) to infer GRN from gene expression data and to avoid wrongly inferring of an indirect interaction (A → B → C) as a direct interaction (A → C). Since the number of observations of the real experiment datasets was far less than the number of predictors, some predictors were eliminated by extracting the random subnetworks from global interaction networks via an established extraction method. In addition, the experiment was extended to assess the effectiveness of MLR in dealing with cascade error by using a novel experimental procedure that had been proposed in this work. The experiment revealed that the number of cascade errors had been very minimal. Apart from that, the Belsley collinearity test proved that multicollinearity did affect the datasets used in this experiment greatly. All the tested subnetworks obtained satisfactory results, with AUROC values above 0.5.
Collapse
|
5
|
Mohamed Salleh FH, Arif SM, Zainudin S, Firdaus-Raih M. Reconstructing gene regulatory networks from knock-out data using Gaussian Noise Model and Pearson Correlation Coefficient. Comput Biol Chem 2015; 59 Pt B:3-14. [DOI: 10.1016/j.compbiolchem.2015.04.012] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2014] [Revised: 04/16/2015] [Accepted: 04/27/2015] [Indexed: 11/26/2022]
|
6
|
Zhang J, Le TD, Liu L, He J, Li J. A novel framework for inferring condition-specific TF and miRNA co-regulation of protein-protein interactions. Gene 2015; 577:55-64. [PMID: 26611531 DOI: 10.1016/j.gene.2015.11.023] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2015] [Revised: 10/16/2015] [Accepted: 11/17/2015] [Indexed: 12/11/2022]
Abstract
Recent studies have shown that transcription factors (TFs) and microRNAs (miRNAs), while independently regulate their downstream targets, collaborate with each other to regulate gene expression. However, their synergistic roles in protein-protein interactions (PPIs) remain mostly unknown. In this paper, we present a novel framework (called CoRePPI) for inferring TF and miRNA co-regulation of PPIs. Particularly, CoRePPI is aimed at discovering the co-regulation specific to a condition of interest, by using heterogeneous data, including miRNA and messenger RNA (mRNA) expression profiles, putative miRNA targets, TF targets and PPIs. CoRePPI firstly finds the network motifs indicating the co-regulation of PPIs by TFs and miRNAs in tumor and normal conditions separately. Then by identifying the differential motifs found in one condition but not in the other, it builds the networks consisting of TFs, miRNAs and their co-regulated PPIs specific to different conditions respectively. To validate CoRePPI, we apply it to the Pan-Cancer dataset which includes the expression profiles of 12 cancer types from TCGA. Through network topology analysis, we found that the tumor and normal CoRePPI networks are scale-free. Furthermore, the results of differential and intersected network analysis between the tumor and normal CoRePPI networks suggest that only a small fraction of the regulatory relationships between TFs and miRNAs are conserved in both conditions but they co-regulate different downstream PPIs in tumor and normal conditions; and in different conditions the majority of the regulatory relationships between TFs and miRNAs are different although they may regulate the same PPIs in their respective conditions. The CoRePPI sub-networks constructed for the three types of cancers (breast cancer, lung cancer and ovarian cancer) are all scale-free, and the intersection of these CoRePPI sub-networks can be utilized as the biomarker CoRePPI sub-network of the three types of cancers. The PPI enrichment analyses of the tumor and normal CoRePPI networks suggest that the co-regulating TFs and miRNAs are significantly associated with the specific biological processes, diseases and pathways. In addition, comparing with the two non-condition-specific approaches, the tumor CoRePPI network is found to have the most enriched cancer-related PPIs. Altogether, the results uncover the combined regulatory patterns of TFs and miRNAs on the PPIs, and may provide new insights for research in cancer-associated TFs and miRNAs.
Collapse
Affiliation(s)
- Junpeng Zhang
- School of Engineering, Dali University, Dali, Yunnan 671003, China.
| | - Thuc Duy Le
- School of Information Technology and Mathematical Sciences, University of South Australia, Mawson Lakes, SA 5095, Australia
| | - Lin Liu
- School of Information Technology and Mathematical Sciences, University of South Australia, Mawson Lakes, SA 5095, Australia
| | - Jianfeng He
- School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, Yunnan, 650500, China
| | - Jiuyong Li
- School of Information Technology and Mathematical Sciences, University of South Australia, Mawson Lakes, SA 5095, Australia.
| |
Collapse
|
7
|
Huang X, Zi Z. Inferring cellular regulatory networks with Bayesian model averaging for linear regression (BMALR). MOLECULAR BIOSYSTEMS 2015; 10:2023-30. [PMID: 24899235 DOI: 10.1039/c4mb00053f] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
Bayesian network and linear regression methods have been widely applied to reconstruct cellular regulatory networks. In this work, we propose a Bayesian model averaging for linear regression (BMALR) method to infer molecular interactions in biological systems. This method uses a new closed form solution to compute the posterior probabilities of the edges from regulators to the target gene within a hybrid framework of Bayesian model averaging and linear regression methods. We have assessed the performance of BMALR by benchmarking on both in silico DREAM datasets and real experimental datasets. The results show that BMALR achieves both high prediction accuracy and high computational efficiency across different benchmarks. A pre-processing of the datasets with the log transformation can further improve the performance of BMALR, leading to a new top overall performance. In addition, BMALR can achieve robust high performance in community predictions when it is combined with other competing methods. The proposed method BMALR is competitive compared to the existing network inference methods. Therefore, BMALR will be useful to infer regulatory interactions in biological networks. A free open source software tool for the BMALR algorithm is available at https://sites.google.com/site/bmalr4netinfer/.
Collapse
Affiliation(s)
- Xun Huang
- BIOSS Centre for Biological Signalling Studies, University of Freiburg, 79104, Freiburg, Germany.
| | | |
Collapse
|
8
|
Vermeirssen V, De Clercq I, Van Parys T, Van Breusegem F, Van de Peer Y. Arabidopsis ensemble reverse-engineered gene regulatory network discloses interconnected transcription factors in oxidative stress. THE PLANT CELL 2014; 26:4656-79. [PMID: 25549671 PMCID: PMC4311199 DOI: 10.1105/tpc.114.131417] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/23/2014] [Revised: 11/27/2014] [Accepted: 12/10/2014] [Indexed: 05/19/2023]
Abstract
The abiotic stress response in plants is complex and tightly controlled by gene regulation. We present an abiotic stress gene regulatory network of 200,014 interactions for 11,938 target genes by integrating four complementary reverse-engineering solutions through average rank aggregation on an Arabidopsis thaliana microarray expression compendium. This ensemble performed the most robustly in benchmarking and greatly expands upon the availability of interactions currently reported. Besides recovering 1182 known regulatory interactions, cis-regulatory motifs and coherent functionalities of target genes corresponded with the predicted transcription factors. We provide a valuable resource of 572 abiotic stress modules of coregulated genes with functional and regulatory information, from which we deduced functional relationships for 1966 uncharacterized genes and many regulators. Using gain- and loss-of-function mutants of seven transcription factors grown under control and salt stress conditions, we experimentally validated 141 out of 271 predictions (52% precision) for 102 selected genes and mapped 148 additional transcription factor-gene regulatory interactions (49% recall). We identified an intricate core oxidative stress regulatory network where NAC13, NAC053, ERF6, WRKY6, and NAC032 transcription factors interconnect and function in detoxification. Our work shows that ensemble reverse-engineering can generate robust biological hypotheses of gene regulation in a multicellular eukaryote that can be tested by medium-throughput experimental validation.
Collapse
Affiliation(s)
- Vanessa Vermeirssen
- Department of Plant Systems Biology, VIB, 9052 Gent, Belgium Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Gent, Belgium
| | - Inge De Clercq
- Department of Plant Systems Biology, VIB, 9052 Gent, Belgium Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Gent, Belgium
| | - Thomas Van Parys
- Department of Plant Systems Biology, VIB, 9052 Gent, Belgium Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Gent, Belgium
| | - Frank Van Breusegem
- Department of Plant Systems Biology, VIB, 9052 Gent, Belgium Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Gent, Belgium
| | - Yves Van de Peer
- Department of Plant Systems Biology, VIB, 9052 Gent, Belgium Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Gent, Belgium Genomics Research Institute, University of Pretoria, Pretoria 0028, South Africa
| |
Collapse
|
9
|
Sławek J, Arodź T. ENNET: inferring large gene regulatory networks from expression data using gradient boosting. BMC SYSTEMS BIOLOGY 2013; 7:106. [PMID: 24148309 PMCID: PMC4015806 DOI: 10.1186/1752-0509-7-106] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/24/2013] [Accepted: 10/17/2013] [Indexed: 01/19/2023]
Abstract
BACKGROUND The regulation of gene expression by transcription factors is a key determinant of cellular phenotypes. Deciphering genome-wide networks that capture which transcription factors regulate which genes is one of the major efforts towards understanding and accurate modeling of living systems. However, reverse-engineering the network from gene expression profiles remains a challenge, because the data are noisy, high dimensional and sparse, and the regulation is often obscured by indirect connections. RESULTS We introduce a gene regulatory network inference algorithm ENNET, which reverse-engineers networks of transcriptional regulation from a variety of expression profiles with a superior accuracy compared to the state-of-the-art methods. The proposed method relies on the boosting of regression stumps combined with a relative variable importance measure for the initial scoring of transcription factors with respect to each gene. Then, we propose a technique for using a distribution of the initial scores and information about knockouts to refine the predictions. We evaluated the proposed method on the DREAM3, DREAM4 and DREAM5 data sets and achieved higher accuracy than the winners of those competitions and other established methods. CONCLUSIONS Superior accuracy achieved on the three different benchmark data sets shows that ENNET is a top contender in the task of network inference. It is a versatile method that uses information about which gene was knocked-out in which experiment if it is available, but remains the top performer even without such information. ENNET is available for download from https://github.com/slawekj/ennet under the GNU GPLv3 license.
Collapse
Affiliation(s)
- Janusz Sławek
- Department of Computer Science, Virginia Commonwealth University, Richmond, Virginia
| | - Tomasz Arodź
- Department of Computer Science, Virginia Commonwealth University, Richmond, Virginia
| |
Collapse
|