1
|
Cingiz MÖ. k- Strong Inference Algorithm: A Hybrid Information Theory Based Gene Network Inference Algorithm. Mol Biotechnol 2023:10.1007/s12033-023-00929-2. [PMID: 37950851 DOI: 10.1007/s12033-023-00929-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Accepted: 10/05/2023] [Indexed: 11/13/2023]
Abstract
Gene networks allow researchers to understand the underlying mechanisms between diseases and genes while reducing the need for wet lab experiments. Numerous gene network inference (GNI) algorithms have been presented in the literature to infer accurate gene networks. We proposed a hybrid GNI algorithm, k-Strong Inference Algorithm (ksia), to infer more reliable and robust gene networks from omics datasets. To increase reliability, ksia integrates Pearson correlation coefficient (PCC) and Spearman rank correlation coefficient (SCC) scores to determine mutual information scores between molecules to increase diversity of relation predictions. To infer a more robust gene network, ksia applies three different elimination steps to remove redundant and spurious relations between genes. The performance of ksia was evaluated on microbe microarrays database in the overlap analysis with other GNI algorithms, namely ARACNE, C3NET, CLR, and MRNET. Ksia inferred less number of relations due to its strict elimination steps. However, ksia generally performed better on Escherichia coli (E.coli) and Saccharomyces cerevisiae (yeast) gene expression datasets due to F- measure and precision values. The integration of association estimator scores and three elimination stages slightly increases the performance of ksia based gene networks. Users can access ksia R package and user manual of package via https://github.com/ozgurcingiz/ksia .
Collapse
Affiliation(s)
- Mustafa Özgür Cingiz
- Computer Engineering Department, Faculty of Engineering and Natural Sciences, Bursa Technical University, Mimar Sinan Campus, Yildirim, 16310, Bursa, Turkey.
| |
Collapse
|
2
|
Altay G, Zapardiel-Gonzalo J, Peters B. RNA-seq preprocessing and sample size considerations for gene network inference. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.02.522518. [PMID: 36711979 PMCID: PMC9881880 DOI: 10.1101/2023.01.02.522518] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
Background Gene network inference (GNI) methods have the potential to reveal functional relationships between different genes and their products. Most GNI algorithms have been developed for microarray gene expression datasets and their application to RNA-seq data is relatively recent. As the characteristics of RNA-seq data are different from microarray data, it is an unanswered question what preprocessing methods for RNA-seq data should be applied prior to GNI to attain optimal performance, or what the required sample size for RNA-seq data is to obtain reliable GNI estimates. Results We ran 9144 analysis of 7 different RNA-seq datasets to evaluate 300 different preprocessing combinations that include data transformations, normalizations and association estimators. We found that there was no single best performing preprocessing combination but that there were several good ones. The performance varied widely over various datasets, which emphasized the importance of choosing an appropriate preprocessing configuration before GNI. Two preprocessing combinations appeared promising in general: First, Log-2 TPM (transcript per million) with Variance-stabilizing transformation (VST) and Pearson Correlation Coefficient (PCC) association estimator. Second, raw RNA-seq count data with PCC. Along with these two, we also identified 18 other good preprocessing combinations. Any of these algorithms might perform best in different datasets. Therefore, the GNI performances of these approaches should be measured on any new dataset to select the best performing one for it. In terms of the required biological sample size of RNA-seq data, we found that between 30 to 85 samples were required to generate reliable GNI estimates. Conclusions This study provides practical recommendations on default choices for data preprocessing prior to GNI analysis of RNA-seq data to obtain optimal performance results.
Collapse
Affiliation(s)
- Gökmen Altay
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA 92037, USA
| | | | - Bjoern Peters
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA 92037, USA
| |
Collapse
|
3
|
Chatrabgoun O, Hosseinian-Far A, Daneshkhah A. Constructing gene regulatory networks from microarray data using non-Gaussian pair-copula Bayesian networks. J Bioinform Comput Biol 2020; 18:2050023. [PMID: 32706288 DOI: 10.1142/s0219720020500237] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Many biological and biomedical research areas such as drug design require analyzing the Gene Regulatory Networks (GRNs) to provide clear insight and understanding of the cellular processes in live cells. Under normality assumption for the genes, GRNs can be constructed by assessing the nonzero elements of the inverse covariance matrix. Nevertheless, such techniques are unable to deal with non-normality, multi-modality and heavy tailedness that are commonly seen in current massive genetic data. To relax this limitative constraint, one can apply copula function which is a multivariate cumulative distribution function with uniform marginal distribution. However, since the dependency structures of different pairs of genes in a multivariate problem are very different, the regular multivariate copula will not allow for the construction of an appropriate model. The solution to this problem is using Pair-Copula Constructions (PCCs) which are decompositions of a multivariate density into a cascade of bivariate copula, and therefore, assign different bivariate copula function for each local term. In fact, in this paper, we have constructed inverse covariance matrix based on the use of PCCs when the normality assumption can be moderately or severely violated for capturing a wide range of distributional features and complex dependency structure. To learn the non-Gaussian model for the considered GRN with non-Gaussian genomic data, we apply modified version of copula-based PC algorithm in which normality assumption of marginal densities is dropped. This paper also considers the Dynamic Time Warping (DTW) algorithm to determine the existence of a time delay relation between two genes. Breast cancer is one of the most common diseases in the world where GRN analysis of its subtypes is considerably important; Since by revealing the differences in the GRNs of these subtypes, new therapies and drugs can be found. The findings of our research are used to construct GRNs with high performance, for various subtypes of breast cancer rather than simply using previous models.
Collapse
Affiliation(s)
- O Chatrabgoun
- Department of Statistics, Malayer University, Malayer, Iran
| | - A Hosseinian-Far
- Department of Business Systems & Operations, University of Northampton, NN1 5PH, UK
| | - A Daneshkhah
- Faculty of Engineering, Environment & Computing, Coventry University, CV1 5FB, UK
| |
Collapse
|
4
|
Liang Y, Kelemen A. Dynamic modeling and network approaches for omics time course data: overview of computational approaches and applications. Brief Bioinform 2019; 19:1051-1068. [PMID: 28430854 DOI: 10.1093/bib/bbx036] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2016] [Indexed: 12/23/2022] Open
Abstract
Inferring networks and dynamics of genes, proteins, cells and other biological entities from high-throughput biological omics data is a central and challenging issue in computational and systems biology. This is essential for understanding the complexity of human health, disease susceptibility and pathogenesis for Predictive, Preventive, Personalized and Participatory (P4) system and precision medicine. The delineation of the possible interactions of all genes/proteins in a genome/proteome is a task for which conventional experimental techniques are ill suited. Urgently needed are rapid and inexpensive computational and statistical methods that can identify interacting candidate disease genes or drug targets out of thousands that can be further investigated or validated by experimentations. Moreover, identifying biological dynamic systems, and simultaneously estimating the important kinetic structural and functional parameters, which may not be experimentally accessible could be important directions for drug-disease-gene network studies. In this article, we present an overview and comparison of recent developments of dynamic modeling and network approaches for time-course omics data, and their applications to various biological systems, health conditions and disease statuses. Moreover, various data reduction and analytical schemes ranging from mathematical to computational to statistical methods are compared including their merits, drawbacks and limitations. The most recent software, associated web resources and other potentials for the compared methods are also presented and discussed in detail.
Collapse
Affiliation(s)
- Yulan Liang
- Department of Family and Community Health, University of Maryland, Baltimore, MD, USA
| | - Arpad Kelemen
- Department of Family and Community Health, University of Maryland, Baltimore, MD, USA
| |
Collapse
|
5
|
Legeay M, Aubourg S, Renou JP, Duval B. Large scale study of anti-sense regulation by differential network analysis. BMC SYSTEMS BIOLOGY 2018; 12:95. [PMID: 30458828 PMCID: PMC6245689 DOI: 10.1186/s12918-018-0613-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Background Systems biology aims to analyse regulation mechanisms into the cell. By mapping interactions observed in different situations, differential network analysis has shown its power to reveal specific cellular responses or specific dysfunctional regulations. In this work, we propose to explore on a large scale the role of natural anti-sense transcription on gene regulation mechanisms, and we focus our study on apple (Malus domestica) in the context of fruit ripening in cold storage. Results We present a differential functional analysis of the sense and anti-sense transcriptomic data that reveals functional terms linked to the ripening process. To develop our differential network analysis, we introduce our inference method of an Extended Core Network; this method is inspired by C3NET, but extends the notion of significant interactions. By comparing two extended core networks, one inferred with sense data and the other one inferred with sense and anti-sense data, our differential analysis is first performed on a local view and reveals AS-impacted genes, genes that have important interactions impacted by anti-sense transcription. The motifs surrounding AS-impacted genes gather transcripts with functions mostly consistent with the biological context of the data used and the method allows us to identify new actors involved in ripening and cold acclimation pathways and to decipher their interactions. Then from a more global view, we compute minimal sub-networks that connect the AS-impacted genes using Steiner trees. Those Steiner trees allow us to study the rewiring of the AS-impacted genes in the network with anti-sense actors. Conclusion Anti-sense transcription is usually ignored in transcriptomic studies. The large-scale differential analysis of apple data that we propose reveals that anti-sense regulation may have an important impact in several cellular stress response mechanisms. Our data mining process enables to highlight specific interactions that deserve further experimental investigations. Electronic supplementary material The online version of this article (10.1186/s12918-018-0613-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Marc Legeay
- LERIA, Université d'Angers, 2 bd Lavoisier, Angers, 49045, France.,IRHS, Agrocampus-Ouest, INRA, Université d'Angers, SFR 4207 QuaSaV, Beaucouzé, 49071, France
| | - Sébastien Aubourg
- IRHS, Agrocampus-Ouest, INRA, Université d'Angers, SFR 4207 QuaSaV, Beaucouzé, 49071, France
| | - Jean-Pierre Renou
- IRHS, Agrocampus-Ouest, INRA, Université d'Angers, SFR 4207 QuaSaV, Beaucouzé, 49071, France
| | - Béatrice Duval
- LERIA, Université d'Angers, 2 bd Lavoisier, Angers, 49045, France.
| |
Collapse
|
6
|
Pirayre A, Couprie C, Duval L, Pesquet JC. BRANE Clust: Cluster-Assisted Gene Regulatory Network Inference Refinement. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:850-860. [PMID: 28368827 DOI: 10.1109/tcbb.2017.2688355] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Discovering meaningful gene interactions is crucial for the identification of novel regulatory processes in cells. Building accurately the related graphs remains challenging due to the large number of possible solutions from available data. Nonetheless, enforcing a priori on the graph structure, such as modularity, may reduce network indeterminacy issues. BRANE Clust (Biologically-Related A priori Network Enhancement with Clustering) refines gene regulatory network (GRN) inference thanks to cluster information. It works as a post-processing tool for inference methods (i.e., CLR, GENIE3). In BRANE Clust, the clustering is based on the inversion of a system of linear equations involving a graph-Laplacian matrix promoting a modular structure. Our approach is validated on DREAM4 and DREAM5 datasets with objective measures, showing significant comparative improvements. We provide additional insights on the discovery of novel regulatory or co-expressed links in the inferred Escherichia coli network evaluated using the STRING database. The comparative pertinence of clustering is discussed computationally (SIMoNe, WGCNA, X-means) and biologically (RegulonDB). BRANE Clust software is available at: http://www-syscom.univ-mlv.fr/~pirayre/Codes-GRN-BRANE-clust.html.
Collapse
|
7
|
Erdoğan C, Kurt Z, Diri B. Estimation of the proteomic cancer co-expression sub networks by using association estimators. PLoS One 2017; 12:e0188016. [PMID: 29145449 PMCID: PMC5690670 DOI: 10.1371/journal.pone.0188016] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2017] [Accepted: 10/29/2017] [Indexed: 01/02/2023] Open
Abstract
In this study, the association estimators, which have significant influences on the gene network inference methods and used for determining the molecular interactions, were examined within the co-expression network inference concept. By using the proteomic data from five different cancer types, the hub genes/proteins within the disease-associated gene-gene/protein-protein interaction sub networks were identified. Proteomic data from various cancer types is collected from The Cancer Proteome Atlas (TCPA). Correlation and mutual information (MI) based nine association estimators that are commonly used in the literature, were compared in this study. As the gold standard to measure the association estimators’ performance, a multi-layer data integration platform on gene-disease associations (DisGeNET) and the Molecular Signatures Database (MSigDB) was used. Fisher's exact test was used to evaluate the performance of the association estimators by comparing the created co-expression networks with the disease-associated pathways. It was observed that the MI based estimators provided more successful results than the Pearson and Spearman correlation approaches, which are used in the estimation of biological networks in the weighted correlation network analysis (WGCNA) package. In correlation-based methods, the best average success rate for five cancer types was 60%, while in MI-based methods the average success ratio was 71% for James-Stein Shrinkage (Shrink) and 64% for Schurmann-Grassberger (SG) association estimator, respectively. Moreover, the hub genes and the inferred sub networks are presented for the consideration of researchers and experimentalists.
Collapse
Affiliation(s)
- Cihat Erdoğan
- Department of Computer Engineering, Namik Kemal University, Tekirdag, Turkey
- * E-mail:
| | - Zeyneb Kurt
- Department of Integrative Biology and Physiology, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Computer Engineering, Yildiz Technical University, Istanbul, Turkey
| | - Banu Diri
- Department of Computer Engineering, Yildiz Technical University, Istanbul, Turkey
| |
Collapse
|
8
|
Abstract
The inference of gene regulatory networks is an important process that contributes to a better understanding of biological and biomedical problems. These networks aim to capture the causal molecular interactions of biological processes and provide valuable information about normal cell physiology. In this book chapter, we introduce GNI methods, namely C3NET, RN, ARACNE, CLR, and MRNET and describe their components and working mechanisms. We present a comparison of the performance of these algorithms using the results of our previously published studies. According to the study results, which were obtained from simulated as well as expression data sets, the inference algorithm C3NET provides consistently better results than the other widely used methods.
Collapse
|
9
|
Riccadonna S, Jurman G, Visintainer R, Filosi M, Furlanello C. DTW-MIC Coexpression Networks from Time-Course Data. PLoS One 2016; 11:e0152648. [PMID: 27031641 PMCID: PMC4816347 DOI: 10.1371/journal.pone.0152648] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2014] [Accepted: 03/17/2016] [Indexed: 01/01/2023] Open
Abstract
When modeling coexpression networks from high-throughput time course data, Pearson Correlation Coefficient (PCC) is one of the most effective and popular similarity functions. However, its reliability is limited since it cannot capture non-linear interactions and time shifts. Here we propose to overcome these two issues by employing a novel similarity function, Dynamic Time Warping Maximal Information Coefficient (DTW-MIC), combining a measure taking care of functional interactions of signals (MIC) and a measure identifying time lag (DTW). By using the Hamming-Ipsen-Mikhailov (HIM) metric to quantify network differences, the effectiveness of the DTW-MIC approach is demonstrated on a set of four synthetic and one transcriptomic datasets, also in comparison to TimeDelay ARACNE and Transfer Entropy.
Collapse
Affiliation(s)
| | - Giuseppe Jurman
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Italy
| | - Roberto Visintainer
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Italy
| | - Michele Filosi
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Italy
| | - Cesare Furlanello
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Italy
| |
Collapse
|