1
|
McCain JSP, Britten GL, Hackett SR, Follows MJ, Li GW. Microbial reaction rate estimation using proteins and proteomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.13.607198. [PMID: 39185172 PMCID: PMC11343155 DOI: 10.1101/2024.08.13.607198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/27/2024]
Abstract
Microbes transform their environments using diverse enzymatic reactions. However, it remains challenging to measure microbial reaction rates in natural environments. Despite advances in global quantification of enzyme abundances, the individual relationships between enzyme abundances and their reaction rates have not been systematically examined. Using matched proteomic and reaction rate data from microbial cultures, we show that enzyme abundance is often insufficient to predict its corresponding reaction rate. However, we discovered that global proteomic measurements can be used to make accurate rate predictions of individual reaction rates (median R 2 = 0.78). Accurate rate predictions required only a small number of proteins and they did not need explicit prior mechanistic knowledge or environmental context. These results indicate that proteomes are encoders of cellular reaction rates, potentially enabling proteomic measurements in situ to estimate the rates of microbially mediated reactions in natural systems. Significance One of the most basic phenotypes of a microbe is its set of associated reaction rates, but quantifying these rates in situ remains extremely challenging, especially in natural systems. We used molecular data and statistical models to estimate microbial rates in steady state cultures. We found that many reaction rates are highly predictable using proteomic data, though single proteins are typically not informative for their associated reaction rates. This result suggests that gene expression data from complex microbial communities could be used to estimate in situ reaction rates, providing new clues into the lives and environmental function of microbes.
Collapse
|
2
|
Chee FT, Harun S, Mohd Daud K, Sulaiman S, Nor Muhammad NA. Exploring gene regulation and biological processes in insects: Insights from omics data using gene regulatory network models. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2024; 189:1-12. [PMID: 38604435 DOI: 10.1016/j.pbiomolbio.2024.04.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 12/18/2023] [Accepted: 04/03/2024] [Indexed: 04/13/2024]
Abstract
Gene regulatory network (GRN) comprises complicated yet intertwined gene-regulator relationships. Understanding the GRN dynamics will unravel the complexity behind the observed gene expressions. Insect gene regulation is often complicated due to their complex life cycles and diverse ecological adaptations. The main interest of this review is to have an update on the current mathematical modelling methods of GRNs to explain insect science. Several popular GRN architecture models are discussed, together with examples of applications in insect science. In the last part of this review, each model is compared from different aspects, including network scalability, computation complexity, robustness to noise and biological relevancy.
Collapse
Affiliation(s)
- Fong Ting Chee
- Institute of Systems Biology, Universiti Kebangsaan Malaysia, 43600 UKM Bangi, Selangor, Malaysia
| | - Sarahani Harun
- Institute of Systems Biology, Universiti Kebangsaan Malaysia, 43600 UKM Bangi, Selangor, Malaysia
| | - Kauthar Mohd Daud
- Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 43600, UKM Bangi, Selangor, Malaysia
| | - Suhaila Sulaiman
- FGV R&D Sdn Bhd, FGV Innovation Center, PT23417 Lengkuk Teknologi, Bandar Baru Enstek, 71760 Nilai, Negeri Sembilan, Malaysia
| | - Nor Azlan Nor Muhammad
- Institute of Systems Biology, Universiti Kebangsaan Malaysia, 43600 UKM Bangi, Selangor, Malaysia.
| |
Collapse
|
3
|
Pinchas A, Ben-Gal I, Painsky A. A Comparative Analysis of Discrete Entropy Estimators for Large-Alphabet Problems. ENTROPY (BASEL, SWITZERLAND) 2024; 26:369. [PMID: 38785618 PMCID: PMC11120205 DOI: 10.3390/e26050369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/15/2024] [Revised: 04/25/2024] [Accepted: 04/25/2024] [Indexed: 05/25/2024]
Abstract
This paper presents a comparative study of entropy estimation in a large-alphabet regime. A variety of entropy estimators have been proposed over the years, where each estimator is designed for a different setup with its own strengths and caveats. As a consequence, no estimator is known to be universally better than the others. This work addresses this gap by comparing twenty-one entropy estimators in the studied regime, starting with the simplest plug-in estimator and leading up to the most recent neural network-based and polynomial approximate estimators. Our findings show that the estimators' performance highly depends on the underlying distribution. Specifically, we distinguish between three types of distributions, ranging from uniform to degenerate distributions. For each class of distribution, we recommend the most suitable estimator. Further, we propose a sample-dependent approach, which again considers three classes of distribution, and report the top-performing estimators in each class. This approach provides a data-dependent framework for choosing the desired estimator in practical setups.
Collapse
Affiliation(s)
- Assaf Pinchas
- School of Electrical Engineering, The Iby and Aladar Fleischman Faculty of Engineering, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Irad Ben-Gal
- Industrial Engineering Department, The Iby and Aladar Fleischman Faculty of Engineering, Tel Aviv University, Tel Aviv 6997801, Israel; (I.B.-G.); (A.P.)
| | - Amichai Painsky
- Industrial Engineering Department, The Iby and Aladar Fleischman Faculty of Engineering, Tel Aviv University, Tel Aviv 6997801, Israel; (I.B.-G.); (A.P.)
| |
Collapse
|
4
|
Tian J, Lei J, Roeder K. From local to global gene co-expression estimation using single-cell RNA-seq data. Biometrics 2024; 80:ujae001. [PMID: 38465983 PMCID: PMC10926266 DOI: 10.1093/biomtc/ujae001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 10/01/2023] [Accepted: 01/15/2024] [Indexed: 03/12/2024]
Abstract
In genomics studies, the investigation of gene relationships often brings important biological insights. Currently, the large heterogeneous datasets impose new challenges for statisticians because gene relationships are often local. They change from one sample point to another, may only exist in a subset of the sample, and can be nonlinear or even nonmonotone. Most previous dependence measures do not specifically target local dependence relationships, and the ones that do are computationally costly. In this paper, we explore a state-of-the-art network estimation technique that characterizes gene relationships at the single cell level, under the name of cell-specific gene networks. We first show that averaging the cell-specific gene relationship over a population gives a novel univariate dependence measure, the averaged Local Density Gap (aLDG), that accumulates local dependence and can detect any nonlinear, nonmonotone relationship. Together with a consistent nonparametric estimator, we establish its robustness on both the population and empirical levels. Then, we show that averaging the cell-specific gene relationship over mini-batches determined by some external structure information (eg, spatial or temporal factor) better highlights meaningful local structure change points. We explore the application of aLDG and its minibatch variant in many scenarios, including pairwise gene relationship estimation, bifurcating point detection in cell trajectory, and spatial transcriptomics structure visualization. Both simulations and real data analysis show that aLDG outperforms existing ones.
Collapse
Affiliation(s)
- Jinjin Tian
- Department of Statistics and Data Science, Carnegie Mellon University, 15213, Pittsburgh, PA, United States
| | - Jing Lei
- Department of Statistics and Data Science, Carnegie Mellon University, 15213, Pittsburgh, PA, United States
| | - Kathryn Roeder
- Department of Statistics and Data Science, Carnegie Mellon University, 15213, Pittsburgh, PA, United States
| |
Collapse
|
5
|
Alanis-Lobato G, Bartlett TE, Huang Q, Simon CS, McCarthy A, Elder K, Snell P, Christie L, Niakan KK. MICA: a multi-omics method to predict gene regulatory networks in early human embryos. Life Sci Alliance 2024; 7:e202302415. [PMID: 37879938 PMCID: PMC10599980 DOI: 10.26508/lsa.202302415] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Revised: 10/12/2023] [Accepted: 10/13/2023] [Indexed: 10/27/2023] Open
Abstract
Recent advances in single-cell omics have transformed characterisation of cell types in challenging-to-study biological contexts. In contexts with limited single-cell samples, such as the early human embryo inference of transcription factor-gene regulatory network (GRN) interactions is especially difficult. Here, we assessed application of different linear or non-linear GRN predictions to single-cell simulated and human embryo transcriptome datasets. We also compared how expression normalisation impacts on GRN predictions, finding that transcripts per million reads outperformed alternative methods. GRN inferences were more reproducible using a non-linear method based on mutual information (MI) applied to single-cell transcriptome datasets refined with chromatin accessibility (CA) (called MICA), compared with alternative network prediction methods tested. MICA captures complex non-monotonic dependencies and feedback loops. Using MICA, we generated the first GRN inferences in early human development. MICA predicted co-localisation of the AP-1 transcription factor subunit proto-oncogene JUND and the TFAP2C transcription factor AP-2γ in early human embryos. Overall, our comparative analysis of GRN prediction methods defines a pipeline that can be applied to single-cell multi-omics datasets in especially challenging contexts to infer interactions between transcription factor expression and target gene regulation.
Collapse
Affiliation(s)
| | | | - Qiulin Huang
- Human Embryo and Stem Cell Laboratory, The Francis Crick Institute, London, UK
- https://ror.org/013meh722 Department of Physiology, Development and Neuroscience, The Centre for Trophoblast Research, University of Cambridge, Cambridge, UK
| | - Claire S Simon
- Human Embryo and Stem Cell Laboratory, The Francis Crick Institute, London, UK
| | - Afshan McCarthy
- Human Embryo and Stem Cell Laboratory, The Francis Crick Institute, London, UK
| | | | | | | | - Kathy K Niakan
- Human Embryo and Stem Cell Laboratory, The Francis Crick Institute, London, UK
- https://ror.org/013meh722 Department of Physiology, Development and Neuroscience, The Centre for Trophoblast Research, University of Cambridge, Cambridge, UK
- https://ror.org/013meh722 Wellcome - Medical Research Council Cambridge Stem Cell Institute, Jeffrey Cheah Biomedical Centre, University of Cambridge, Cambridge, UK
- Epigenetics Programme, Babraham Institute, Cambridge, UK
| |
Collapse
|
6
|
Karell-Albo JA, Legón-Pérez CM, Socorro-Llanes R, Rojas O, Sosa-Gómez G. Complexity Reduction in Analyzing Independence between Statistical Randomness Tests Using Mutual Information. ENTROPY (BASEL, SWITZERLAND) 2023; 25:1545. [PMID: 37998237 PMCID: PMC10670732 DOI: 10.3390/e25111545] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 11/09/2023] [Accepted: 11/13/2023] [Indexed: 11/25/2023]
Abstract
The advantages of using mutual information to evaluate the correlation between randomness tests have recently been demonstrated. However, it has been pointed out that the high complexity of this method limits its application in batteries with a greater number of tests. The main objective of this work is to reduce the complexity of the method based on mutual information for analyzing the independence between the statistical tests of randomness. The achieved complexity reduction is estimated theoretically and verified experimentally. A variant of the original method is proposed by modifying the step in which the significant values of the mutual information are determined. The correlation between the NIST battery tests was studied, and it was concluded that the modifications to the method do not significantly affect the ability to detect correlations. Due to the efficiency of the newly proposed method, its use is recommended to analyze other batteries of tests.
Collapse
Affiliation(s)
- Jorge Augusto Karell-Albo
- Instituto de Criptografía, Facultad de Matemática y Computación, Universidad de la Habana, Habana 10400, Cuba; (J.A.K.-A.); (C.M.L.-P.)
| | - Carlos Miguel Legón-Pérez
- Instituto de Criptografía, Facultad de Matemática y Computación, Universidad de la Habana, Habana 10400, Cuba; (J.A.K.-A.); (C.M.L.-P.)
| | - Raisa Socorro-Llanes
- Facultad de Ingeniería Informática, Universidad Tecnológica de la Habana José Antonio Echeverría (CUJAE), Habana 19390, Cuba;
| | - Omar Rojas
- Facultad de Ciencias Económicas y Empresariales, Universidad Panamericana, Álvaro del Portillo 49, Zapopan 45010, Jalisco, Mexico;
| | - Guillermo Sosa-Gómez
- Facultad de Ciencias Económicas y Empresariales, Universidad Panamericana, Álvaro del Portillo 49, Zapopan 45010, Jalisco, Mexico;
| |
Collapse
|
7
|
Schiffthaler B, van Zalen E, Serrano AR, Street NR, Delhomme N. Seiðr: Efficient calculation of robust ensemble gene networks. Heliyon 2023; 9:e16811. [PMID: 37313140 PMCID: PMC10258422 DOI: 10.1016/j.heliyon.2023.e16811] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2022] [Revised: 05/22/2023] [Accepted: 05/29/2023] [Indexed: 06/15/2023] Open
Abstract
Gene regulatory and gene co-expression networks are powerful research tools for identifying biological signal within high-dimensional gene expression data. In recent years, research has focused on addressing shortcomings of these techniques with regard to the low signal-to-noise ratio, non-linear interactions and dataset dependent biases of published methods. Furthermore, it has been shown that aggregating networks from multiple methods provides improved results. Despite this, few useable and scalable software tools have been implemented to perform such best-practice analyses. Here, we present Seidr (stylized Seiðr), a software toolkit designed to assist scientists in gene regulatory and gene co-expression network inference. Seidr creates community networks to reduce algorithmic bias and utilizes noise corrected network backboning to prune noisy edges in the networks. Using benchmarks in real-world conditions across three eukaryotic model organisms, Saccharomyces cerevisiae, Drosophila melanogaster, and Arabidopsis thaliana, we show that individual algorithms are biased toward functional evidence for certain gene-gene interactions. We further demonstrate that the community network is less biased, providing robust performance across different standards and comparisons for the model organisms. Finally, we apply Seidr to a network of drought stress in Norway spruce (Picea abies (L.) H. Krast) as an example application in a non-model species. We demonstrate the use of a network inferred using Seidr for identifying key components, communities and suggesting gene function for non-annotated genes.
Collapse
Affiliation(s)
- Bastian Schiffthaler
- Department of Plant Physiology, Umea Plant Science Center, Umea University, Umea, Sweden
| | - Elena van Zalen
- Department of Plant Physiology, Umea Plant Science Center, Umea University, Umea, Sweden
| | - Alonso R. Serrano
- Department of Plant Physiology, Umea Plant Science Center, Swedish University of Agricultural Sciences, Umea, Sweden
| | - Nathaniel R. Street
- Department of Plant Physiology, Umea Plant Science Center, Umea University, Umea, Sweden
| | - Nicolas Delhomme
- Department of Plant Physiology, Umea Plant Science Center, Swedish University of Agricultural Sciences, Umea, Sweden
| |
Collapse
|
8
|
Ito Y, Uda S, Kokaji T, Hirayama A, Soga T, Suzuki Y, Kuroda S, Kubota H. Comparison of hepatic responses to glucose perturbation between healthy and obese mice based on the edge type of network structures. Sci Rep 2023; 13:4758. [PMID: 36959243 PMCID: PMC10036622 DOI: 10.1038/s41598-023-31547-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 03/14/2023] [Indexed: 03/25/2023] Open
Abstract
Interactions between various molecular species in biological phenomena give rise to numerous networks. The investigation of these networks, including their statistical and biochemical interactions, supports a deeper understanding of biological phenomena. The clustering of nodes associated with molecular species and enrichment analysis is frequently applied to examine the biological significance of such network structures. However, these methods focus on delineating the function of a node. As such, in-depth investigations of the edges, which are the connections between the nodes, are rarely explored. In the current study, we aimed to investigate the functions of the edges rather than the nodes. To accomplish this, for each network, we categorized the edges and defined the edge type based on their biological annotations. Subsequently, we used the edge type to compare the network structures of the metabolome and transcriptome in the livers of healthy (wild-type) and obese (ob/ob) mice following oral glucose administration (OGTT). The findings demonstrate that the edge type can facilitate the characterization of the state of a network structure, thereby reducing the information available through datasets containing the OGTT response in the metabolome and transcriptome.
Collapse
Affiliation(s)
- Yuki Ito
- Division of Integrated Omics, Medical Research Center for High Depth Omics, Medical Institute of Bioregulation, Kyushu University, 3-1-1 Maidashi, Higashi-ku, Fukuoka, 812-8582, Japan
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba, 277-8562, Japan
| | - Shinsuke Uda
- Division of Integrated Omics, Medical Research Center for High Depth Omics, Medical Institute of Bioregulation, Kyushu University, 3-1-1 Maidashi, Higashi-ku, Fukuoka, 812-8582, Japan.
| | - Toshiya Kokaji
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba, 277-8562, Japan
- Data Science Center, Nara Institute of Science and Technology, 8916-5, Takayamacho, Ikoma, Nara, 630-0192, Japan
| | - Akiyoshi Hirayama
- Institute for Advanced Biosciences, Keio University, 246-2 Mizukami, Kakuganji, Tsuruoka, Yamagata, 997-0052, Japan
| | - Tomoyoshi Soga
- Institute for Advanced Biosciences, Keio University, 246-2 Mizukami, Kakuganji, Tsuruoka, Yamagata, 997-0052, Japan
| | - Yutaka Suzuki
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba, 277-8562, Japan
| | - Shinya Kuroda
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba, 277-8562, Japan
- Department of Biological Sciences, Graduate School of Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan
- Core Research for Evolutional Science and Technology (CREST), Japan Science and Technology Agency, Bunkyo-ku, Tokyo, 113-0033, Japan
| | - Hiroyuki Kubota
- Division of Integrated Omics, Medical Research Center for High Depth Omics, Medical Institute of Bioregulation, Kyushu University, 3-1-1 Maidashi, Higashi-ku, Fukuoka, 812-8582, Japan
| |
Collapse
|
9
|
Shachaf LI, Roberts E, Cahan P, Xiao J. Gene regulation network inference using k-nearest neighbor-based mutual information estimation: revisiting an old DREAM. BMC Bioinformatics 2023; 24:84. [PMID: 36879188 PMCID: PMC9990267 DOI: 10.1186/s12859-022-05047-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2021] [Accepted: 11/08/2022] [Indexed: 03/08/2023] Open
Abstract
BACKGROUND A cell exhibits a variety of responses to internal and external cues. These responses are possible, in part, due to the presence of an elaborate gene regulatory network (GRN) in every single cell. In the past 20 years, many groups worked on reconstructing the topological structure of GRNs from large-scale gene expression data using a variety of inference algorithms. Insights gained about participating players in GRNs may ultimately lead to therapeutic benefits. Mutual information (MI) is a widely used metric within this inference/reconstruction pipeline as it can detect any correlation (linear and non-linear) between any number of variables (n-dimensions). However, the use of MI with continuous data (for example, normalized fluorescence intensity measurement of gene expression levels) is sensitive to data size, correlation strength and underlying distributions, and often requires laborious and, at times, ad hoc optimization. RESULTS In this work, we first show that estimating MI of a bi- and tri-variate Gaussian distribution using k-nearest neighbor (kNN) MI estimation results in significant error reduction as compared to commonly used methods based on fixed binning. Second, we demonstrate that implementing the MI-based kNN Kraskov-Stoögbauer-Grassberger (KSG) algorithm leads to a significant improvement in GRN reconstruction for popular inference algorithms, such as Context Likelihood of Relatedness (CLR). Finally, through extensive in-silico benchmarking we show that a new inference algorithm CMIA (Conditional Mutual Information Augmentation), inspired by CLR, in combination with the KSG-MI estimator, outperforms commonly used methods. CONCLUSIONS Using three canonical datasets containing 15 synthetic networks, the newly developed method for GRN reconstruction-which combines CMIA, and the KSG-MI estimator-achieves an improvement of 20-35% in precision-recall measures over the current gold standard in the field. This new method will enable researchers to discover new gene interactions or better choose gene candidates for experimental validations.
Collapse
Affiliation(s)
- Lior I Shachaf
- Department of Biophysics, Johns Hopkins University, 3400 N. Charles Street, Baltimore, MD, 21218, USA.
| | - Elijah Roberts
- Department of Biophysics, Johns Hopkins University, 3400 N. Charles Street, Baltimore, MD, 21218, USA
- 10x Genomics, 6230 Stoneridge Mall Road, Pleasanton, CA, 94588-3260, USA
| | - Patrick Cahan
- Department of Biomedical Engineering, Department of Molecular Biology and Genetics, Institute for Cell Engineering, Johns Hopkins School of Medicine, 733 N. Broadway, Baltimore, MD, 21205, USA
| | - Jie Xiao
- Department of Biophysics and Biophysical Chemistry, Johns Hopkins School of Medicine, 725 N. Wolfe Street, WBSB 708, Baltimore, MD, 21205, USA
| |
Collapse
|
10
|
Afshar S, Braun PR, Han S, Lin Y. A multimodal deep learning model to infer cell-type-specific functional gene networks. BMC Bioinformatics 2023; 24:47. [PMID: 36788477 PMCID: PMC9926713 DOI: 10.1186/s12859-023-05146-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Accepted: 01/11/2023] [Indexed: 02/16/2023] Open
Abstract
BACKGROUND Functional gene networks (FGNs) capture functional relationships among genes that vary across tissues and cell types. Construction of cell-type-specific FGNs enables the understanding of cell-type-specific functional gene relationships and insights into genetic mechanisms of human diseases in disease-relevant cell types. However, most existing FGNs were developed without consideration of specific cell types within tissues. RESULTS In this study, we created a multimodal deep learning model (MDLCN) to predict cell-type-specific FGNs in the human brain by integrating single-nuclei gene expression data with global protein interaction networks. We systematically evaluated the prediction performance of the MDLCN and showed its superior performance compared to two baseline models (boosting tree and convolutional neural network). Based on the predicted cell-type-specific FGNs, we observed that cell-type marker genes had a higher level of hubness than non-marker genes in their corresponding cell type. Furthermore, we showed that risk genes underlying autism and Alzheimer's disease were more strongly connected in disease-relevant cell types, supporting the cellular context of predicted cell-type-specific FGNs. CONCLUSIONS Our study proposes a powerful deep learning approach (MDLCN) to predict FGNs underlying a diverse set of cell types in human brain. The MDLCN model enhances prediction accuracy of cell-type-specific FGNs compared to single modality convolutional neural network (CNN) and boosting tree models, as shown by higher areas under both receiver operating characteristic (ROC) and precision-recall curves for different levels of independent test datasets. The predicted FGNs also show evidence for the cellular context and distinct topological features (i.e. higher hubness and topological score) of cell-type marker genes. Moreover, we observed stronger modularity among disease-associated risk genes in FGNs of disease-relevant cell types. For example, the strength of connectivity among autism risk genes was stronger in neurons, but risk genes underlying Alzheimer's disease were more connected in microglia.
Collapse
Affiliation(s)
- Shiva Afshar
- grid.266436.30000 0004 1569 9707Department of Industrial Engineering, University of Houston, Houston, TX 77204 USA
| | - Patricia R. Braun
- grid.21107.350000 0001 2171 9311Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD 21287 USA
| | - Shizhong Han
- grid.21107.350000 0001 2171 9311Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD 21287 USA ,grid.429552.d0000 0004 5913 1291Lieber Institute for Brain Development, Baltimore, MD 21205 USA
| | - Ying Lin
- Department of Industrial Engineering, University of Houston, Houston, TX, 77204, USA.
| |
Collapse
|
11
|
Altay G, Zapardiel-Gonzalo J, Peters B. RNA-seq preprocessing and sample size considerations for gene network inference. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.02.522518. [PMID: 36711979 PMCID: PMC9881880 DOI: 10.1101/2023.01.02.522518] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
Background Gene network inference (GNI) methods have the potential to reveal functional relationships between different genes and their products. Most GNI algorithms have been developed for microarray gene expression datasets and their application to RNA-seq data is relatively recent. As the characteristics of RNA-seq data are different from microarray data, it is an unanswered question what preprocessing methods for RNA-seq data should be applied prior to GNI to attain optimal performance, or what the required sample size for RNA-seq data is to obtain reliable GNI estimates. Results We ran 9144 analysis of 7 different RNA-seq datasets to evaluate 300 different preprocessing combinations that include data transformations, normalizations and association estimators. We found that there was no single best performing preprocessing combination but that there were several good ones. The performance varied widely over various datasets, which emphasized the importance of choosing an appropriate preprocessing configuration before GNI. Two preprocessing combinations appeared promising in general: First, Log-2 TPM (transcript per million) with Variance-stabilizing transformation (VST) and Pearson Correlation Coefficient (PCC) association estimator. Second, raw RNA-seq count data with PCC. Along with these two, we also identified 18 other good preprocessing combinations. Any of these algorithms might perform best in different datasets. Therefore, the GNI performances of these approaches should be measured on any new dataset to select the best performing one for it. In terms of the required biological sample size of RNA-seq data, we found that between 30 to 85 samples were required to generate reliable GNI estimates. Conclusions This study provides practical recommendations on default choices for data preprocessing prior to GNI analysis of RNA-seq data to obtain optimal performance results.
Collapse
Affiliation(s)
- Gökmen Altay
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA 92037, USA
| | | | - Bjoern Peters
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA 92037, USA
| |
Collapse
|
12
|
Tu M, Zeng J, Zhang J, Fan G, Song G. Unleashing the power within short-read RNA-seq for plant research: Beyond differential expression analysis and toward regulomics. FRONTIERS IN PLANT SCIENCE 2022; 13:1038109. [PMID: 36570898 PMCID: PMC9773216 DOI: 10.3389/fpls.2022.1038109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Accepted: 11/21/2022] [Indexed: 06/17/2023]
Abstract
RNA-seq has become a state-of-the-art technique for transcriptomic studies. Advances in both RNA-seq techniques and the corresponding analysis tools and pipelines have unprecedently shaped our understanding in almost every aspects of plant sciences. Notably, the integration of huge amount of RNA-seq with other omic data sets in the model plants and major crop species have facilitated plant regulomics, while the RNA-seq analysis has still been primarily used for differential expression analysis in many less-studied plant species. To unleash the analytical power of RNA-seq in plant species, especially less-studied species and biomass crops, we summarize recent achievements of RNA-seq analysis in the major plant species and representative tools in the four types of application: (1) transcriptome assembly, (2) construction of expression atlas, (3) network analysis, and (4) structural alteration. We emphasize the importance of expression atlas, coexpression networks and predictions of gene regulatory relationships in moving plant transcriptomes toward regulomics, an omic view of genome-wide transcription regulation. We highlight what can be achieved in plant research with RNA-seq by introducing a list of representative RNA-seq analysis tools and resources that are developed for certain minor species or suitable for the analysis without species limitation. In summary, we provide an updated digest on RNA-seq tools, resources and the diverse applications for plant research, and our perspective on the power and challenges of short-read RNA-seq analysis from a regulomic point view. A full utilization of these fruitful RNA-seq resources will promote plant omic research to a higher level, especially in those less studied species.
Collapse
Affiliation(s)
- Min Tu
- School of Chemical and Environmental Engineering, Wuhan Polytechnic University, Wuhan, China
| | - Jian Zeng
- Guangdong Provincial Key Laboratory of Utilization and Conservation of Food and Medicinal Resources in Northern Region, Shaoguan University, Shaoguan, Guangdong, China
| | - Juntao Zhang
- School of Chemical and Environmental Engineering, Wuhan Polytechnic University, Wuhan, China
| | - Guozhi Fan
- School of Chemical and Environmental Engineering, Wuhan Polytechnic University, Wuhan, China
| | - Guangsen Song
- School of Chemical and Environmental Engineering, Wuhan Polytechnic University, Wuhan, China
| |
Collapse
|
13
|
Lei J, Cai Z, He X, Zheng W, Liu J. An approach of gene regulatory network construction using mixed entropy optimizing context-related likelihood mutual information. Bioinformatics 2022; 39:6808612. [PMID: 36342190 PMCID: PMC9805593 DOI: 10.1093/bioinformatics/btac717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Revised: 09/18/2022] [Accepted: 11/04/2022] [Indexed: 11/09/2022] Open
Abstract
MOTIVATION The question of how to construct gene regulatory networks has long been a focus of biological research. Mutual information can be used to measure nonlinear relationships, and it has been widely used in the construction of gene regulatory networks. However, this method cannot measure indirect regulatory relationships under the influence of multiple genes, which reduces the accuracy of inferring gene regulatory networks. APPROACH This work proposes a method for constructing gene regulatory networks based on mixed entropy optimizing context-related likelihood mutual information (MEOMI). First, two entropy estimators were combined to calculate the mutual information between genes. Then, distribution optimization was performed using a context-related likelihood algorithm to eliminate some indirect regulatory relationships and obtain the initial gene regulatory network. To obtain the complex interaction between genes and eliminate redundant edges in the network, the initial gene regulatory network was further optimized by calculating the conditional mutual inclusive information (CMI2) between gene pairs under the influence of multiple genes. The network was iteratively updated to reduce the impact of mutual information on the overestimation of the direct regulatory intensity. RESULTS The experimental results show that the MEOMI method performed better than several other kinds of gene network construction methods on DREAM challenge simulated datasets (DREAM3 and DREAM5), three real Escherichia coli datasets (E.coli SOS pathway network, E.coli SOS DNA repair network and E.coli community network) and two human datasets. AVAILABILITY AND IMPLEMENTATION Source code and dataset are available at https://github.com/Dalei-Dalei/MEOMI/ and http://122.205.95.139/MEOMI/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jimeng Lei
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China,Key Laboratory of Smart Farming for Agricultural Animals, Huazhong Agricultural University, Wuhan 430070, China,College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Zongheng Cai
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China,Key Laboratory of Smart Farming for Agricultural Animals, Huazhong Agricultural University, Wuhan 430070, China,College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Xinyi He
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Wanting Zheng
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | | |
Collapse
|
14
|
Functional Network: A Novel Framework for Interpretability of Deep Neural Networks. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.11.035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
15
|
Feng H, Zheng R, Wang J, Wu FX, Li M. NIMCE: A Gene Regulatory Network Inference Approach Based on Multi Time Delays Causal Entropy. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1042-1049. [PMID: 33035155 DOI: 10.1109/tcbb.2020.3029846] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Gene regulatory networks (GRNs)are involved in various biological processes, such as cell cycle, differentiation and apoptosis. The existing large amount of expression data, especially the time-series expression data, provide a chance to infer GRNs by computational methods. These data can reveal the dynamics of gene expression and imply the regulatory relationships among genes. However, identify the indirect regulatory links is still a big challenge as most studies treat time points as independent observations, while ignoring the influences of time delays. In this study, we propose a GRN inference method based on information-theory measure, called NIMCE. NIMCE incorporates the transfer entropy to measure the regulatory links between each pair of genes, then applies the causation entropy to filter indirect relationships. In addition, NIMCE applies multi time delays to identify indirect regulatory relationships from candidate genes. Experiments on simulated and colorectal cancer data show NIMCE outperforms than other competing methods. All data and codes used in this study are publicly available at https://github.com/CSUBioGroup/NIMCE.
Collapse
|
16
|
Karkowska R, Urjasz S. Linear and Nonlinear Effects in Connectedness Structure: Comparison between European Stock Markets. ENTROPY (BASEL, SWITZERLAND) 2022; 24:303. [PMID: 35205597 PMCID: PMC8870905 DOI: 10.3390/e24020303] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Revised: 02/17/2022] [Accepted: 02/18/2022] [Indexed: 12/04/2022]
Abstract
The purpose of this research is to compare the risk transfer structure in Central and Eastern European and Western European stock markets during the 2007-2009 financial crisis and the COVID-19 pandemic. Similar to the global financial crisis (GFC), the spread of coronavirus (COVID-19) created a significant level of risk, causing investors to suffer losses in a very short period of time. We use a variety of methods, including nonstandard like mutual information and transfer entropy. The results that we obtained indicate that there are significant nonlinear correlations in the capital markets that can be practically applied for investment portfolio optimization. From an investor perspective, our findings suggest that in the wake of global crisis and pandemic outbreak, the benefits of diversification will be limited by the transfer of funds between developed and developing country markets. Our study provides an insight into the risk transfer theory in developed and emerging markets as well as a cutting-edge methodology designed for analyzing the connectedness of markets. We contribute to the studies which have examined the different stock markets' response to different turbulences. The study confirms that specific market effects can still play a significant role because of the interconnection of different sectors of the global economy.
Collapse
Affiliation(s)
- Renata Karkowska
- Faculty of Management, University of Warsaw, Szturmowa Street 1/3, 02-678 Warsaw, Poland
| | - Szczepan Urjasz
- Faculty of Management, University of Warsaw, Szturmowa Street 1/3, 02-678 Warsaw, Poland
| |
Collapse
|
17
|
Identifying large scale interaction atlases using probabilistic graphs and external knowledge. J Clin Transl Sci 2022; 6:e27. [PMID: 35321220 PMCID: PMC8922291 DOI: 10.1017/cts.2022.18] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Revised: 12/29/2021] [Accepted: 02/07/2022] [Indexed: 11/17/2022] Open
Abstract
Introduction: Reconstruction of gene interaction networks from experimental data provides a deep understanding of the underlying biological mechanisms. The noisy nature of the data and the large size of the network make this a very challenging task. Complex approaches handle the stochastic nature of the data but can only do this for small networks; simpler, linear models generate large networks but with less reliability. Methods: We propose a divide-and-conquer approach using probabilistic graph representations and external knowledge. We cluster the experimental data and learn an interaction network for each cluster, which are merged using the interaction network for the representative genes selected for each cluster. Results: We generated an interaction atlas for 337 human pathways yielding a network of 11,454 genes with 17,777 edges. Simulated gene expression data from this atlas formed the basis for reconstruction. Based on the area under the curve of the precision-recall curve, the proposed approach outperformed the baseline (random classifier) by ∼15-fold and conventional methods by ∼5–17-fold. The performance of the proposed workflow is significantly linked to the accuracy of the clustering step that tries to identify the modularity of the underlying biological mechanisms. Conclusions: We provide an interaction atlas generation workflow optimizing the algorithm/parameter selection. The proposed approach integrates external knowledge in the reconstruction of the interactome using probabilistic graphs. Network characterization and understanding long-range effects in interaction atlases provide means for comparative analysis with implications in biomarker discovery and therapeutic approaches. The proposed workflow is freely available at http://otulab.unl.edu/atlas.
Collapse
|
18
|
Du Q, Campbell MT, Yu H, Liu K, Walia H, Zhang Q, Zhang C. Gene Co-expression Network Analysis and Linking Modules to Phenotyping Response in Plants. Methods Mol Biol 2022; 2539:261-268. [PMID: 35895209 DOI: 10.1007/978-1-0716-2537-8_20] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Environmental factors, including different stresses, can have an impact on the expression of genes and subsequently the phenotype and development of plants. Since a large number of genes are involved in response to the perturbation of the environment, identifying groups of co-expressed genes is meaningful. The gene co-expression network models can be used for the exploration, interpretation, and identification of genes responding to environmental changes. Once a gene co-expression network is constructed, one can determine gene modules and the association of gene modules to the phenotypic response. To link modules to phenotype, one approach is to find the correlated eigengenes of given modules or to integrate all eigengenes in regularized linear model. This manuscript describes the method from construction of co-expression network, module discovery, association between modules and phenotypic data, and finally to annotation/visualization.
Collapse
Affiliation(s)
- Qian Du
- School of Biological Sciences, Center for Plant Science and Innovation, University of Nebraska, Lincoln, NE, USA
| | - Malachy T Campbell
- Department of Agronomy and Horticulture, Center for Plant Science and Innovation, University of Nebraska, Lincoln, NE, USA
| | - Huihui Yu
- School of Biological Sciences, Center for Plant Science and Innovation, University of Nebraska, Lincoln, NE, USA
| | - Kan Liu
- School of Biological Sciences, Center for Plant Science and Innovation, University of Nebraska, Lincoln, NE, USA
| | - Harkamal Walia
- Department of Agronomy and Horticulture, Center for Plant Science and Innovation, University of Nebraska, Lincoln, NE, USA
| | - Qi Zhang
- Department of Mathematics and Statistics, College of Engineering and Physical Sciences (CEPS), University of New Hampshire, Durham, NH, USA
| | - Chi Zhang
- School of Biological Sciences, Center for Plant Science and Innovation, University of Nebraska, Lincoln, NE, USA.
| |
Collapse
|
19
|
Shang J, Wang J, Sun Y, Li F, Liu JX, Zhang H. Multiscale part mutual information for quantifying nonlinear direct associations in networks. Bioinformatics 2021; 37:2920-2929. [PMID: 33730153 DOI: 10.1093/bioinformatics/btab182] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2020] [Revised: 02/15/2021] [Accepted: 03/15/2021] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION For network-assisted analysis, which has become a popular method of data mining, network construction is a crucial task. Network construction relies on the accurate quantification of direct associations among variables. The existence of multiscale associations among variables presents several quantification challenges, especially when quantifying nonlinear direct interactions. RESULTS In this study, the multiscale part mutual information (MPMI), based on part mutual information (PMI) and nonlinear partial association (NPA), was developed for effectively quantifying nonlinear direct associations among variables in networks with multiscale associations. First, we defined the MPMI in theory and derived its five important properties. Second, an experiment in a three-node network was carried out to numerically estimate its quantification ability under two cases of strong associations. Third, experiments of the MPMI and comparisons with the PMI, NPA and conditional mutual information were performed on simulated datasets and on datasets from DREAM challenge project. Finally, the MPMI was applied to real datasets of glioblastoma and lung adenocarcinoma to validate its effectiveness. Results showed that the MPMI is an effective alternative measure for quantifying nonlinear direct associations in networks, especially those with multiscale associations. AVAILABILITY AND IMPLEMENTATION The source code of MPMI is available online at https://github.com/CDMB-lab/MPMI. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Junliang Shang
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Jing Wang
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Yan Sun
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Feng Li
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Jin-Xing Liu
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Honghai Zhang
- College of Life Science, Qufu Normal University, Qufu 273165, China
| |
Collapse
|
20
|
Single-cell analysis reveals the pan-cancer invasiveness-associated transition of adipose-derived stromal cells into COL11A1-expressing cancer-associated fibroblasts. PLoS Comput Biol 2021; 17:e1009228. [PMID: 34283835 PMCID: PMC8323949 DOI: 10.1371/journal.pcbi.1009228] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 07/30/2021] [Accepted: 06/30/2021] [Indexed: 01/01/2023] Open
Abstract
During the last ten years, many research results have been referring to a particular type of cancer-associated fibroblasts associated with poor prognosis, invasiveness, metastasis and resistance to therapy in multiple cancer types, characterized by a gene expression signature with prominent presence of genes COL11A1, THBS2 and INHBA. Identifying the underlying biological mechanisms responsible for their creation may facilitate the discovery of targets for potential pan-cancer therapeutics. Using a novel computational approach for single-cell gene expression data analysis identifying the dominant cell populations in a sequence of samples from patients at various stages, we conclude that these fibroblasts are produced by a pan-cancer cellular transition originating from a particular type of adipose-derived stromal cells naturally present in the stromal vascular fraction of normal adipose tissue, having a characteristic gene expression signature. Focusing on a rich pancreatic cancer dataset, we provide a detailed description of the continuous modification of the gene expression profiles of cells as they transition from APOD-expressing adipose-derived stromal cells to COL11A1-expressing cancer-associated fibroblasts, identifying the key genes that participate in this transition. These results also provide an explanation to the well-known fact that the adipose microenvironment contributes to cancer progression.
Collapse
|
21
|
Singh U, Li J, Seetharam A, Wurtele ES. pyrpipe: a Python package for RNA-Seq workflows. NAR Genom Bioinform 2021; 3:lqab049. [PMID: 34085037 PMCID: PMC8168212 DOI: 10.1093/nargab/lqab049] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Revised: 05/06/2021] [Accepted: 05/18/2021] [Indexed: 02/06/2023] Open
Abstract
The availability of terabytes of RNA-Seq data and continuous emergence of new analysis tools, enable unprecedented biological insight. There is a pressing requirement for a framework that allows for fast, efficient, manageable, and reproducible RNA-Seq analysis. We have developed a Python package, (pyrpipe), that enables straightforward development of flexible, reproducible and easy-to-debug computational pipelines purely in Python, in an object-oriented manner. pyrpipe provides access to popular RNA-Seq tools, within Python, via high-level APIs. Pipelines can be customized by integrating new Python code, third-party programs, or Python libraries. Users can create checkpoints in the pipeline or integrate pyrpipe into a workflow management system, thus allowing execution on multiple computing environments, and enabling efficient resource management. pyrpipe produces detailed analysis, and benchmark reports which can be shared or included in publications. pyrpipe is implemented in Python and is compatible with Python versions 3.6 and higher. To illustrate the rich functionality of pyrpipe, we provide case studies using RNA-Seq data from GTEx, SARS-CoV-2-infected human cells, and Zea mays. All source code is freely available at https://github.com/urmi-21/pyrpipe; the package can be installed from the source, from PyPI (https://pypi.org/project/pyrpipe), or from bioconda (https://anaconda.org/bioconda/pyrpipe). Documentation is available at (http://pyrpipe.rtfd.io).
Collapse
Affiliation(s)
- Urminder Singh
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50014, USA
- Center for Metabolic Biology, Iowa State University, Ames, IA 50014, USA
- Department of Genetics Development and Cell Biology, Iowa State University, Ames, IA 50014, USA
| | - Jing Li
- Center for Metabolic Biology, Iowa State University, Ames, IA 50014, USA
- Department of Genetics Development and Cell Biology, Iowa State University, Ames, IA 50014, USA
| | - Arun Seetharam
- Genome Informatics Facility, Iowa State University, Ames, IA 50014, USA
| | - Eve Syrkin Wurtele
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50014, USA
- Center for Metabolic Biology, Iowa State University, Ames, IA 50014, USA
- Department of Genetics Development and Cell Biology, Iowa State University, Ames, IA 50014, USA
| |
Collapse
|
22
|
African Americans and European Americans exhibit distinct gene expression patterns across tissues and tumors associated with immunologic functions and environmental exposures. Sci Rep 2021; 11:9905. [PMID: 33972602 PMCID: PMC8110974 DOI: 10.1038/s41598-021-89224-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2020] [Accepted: 04/21/2021] [Indexed: 12/20/2022] Open
Abstract
The COVID-19 pandemic has affected African American populations disproportionately with respect to prevalence, and mortality. Expression profiles represent snapshots of combined genetic, socio-environmental (including socioeconomic and environmental factors), and physiological effects on the molecular phenotype. As such, they have potential to improve biological understanding of differences among populations, and provide therapeutic biomarkers and environmental mitigation strategies. Here, we undertook a large-scale assessment of patterns of gene expression between African Americans and European Americans, mining RNA-Seq data from 25 non-diseased and diseased (tumor) tissue-types. We observed the widespread enrichment of pathways implicated in COVID-19 and integral to inflammation and reactive oxygen stress. Chemokine CCL3L3 expression is up-regulated in African Americans. GSTM1, encoding a glutathione S-transferase that metabolizes reactive oxygen species and xenobiotics, is upregulated. The little-studied F8A2 gene is up to 40-fold more highly expressed in African Americans; F8A2 encodes HAP40 protein, which mediates endosome movement, potentially altering the cellular response to SARS-CoV-2. African American expression signatures, superimposed on single cell-RNA reference data, reveal increased number or activity of esophageal glandular cells and lung ACE2-positive basal keratinocytes. Our findings establish basal prognostic signatures that can be used to refine approaches to minimize risk of severe infection and improve precision treatment of COVID-19 for African Americans. To enable dissection of causes of divergent molecular phenotypes, we advocate routine inclusion of metadata on genomic and socio-environmental factors for human RNA-sequencing studies.
Collapse
|
23
|
Contreras Rodríguez L, Madarro-Capó EJ, Legón-Pérez CM, Rojas O, Sosa-Gómez G. Selecting an Effective Entropy Estimator for Short Sequences of Bits and Bytes with Maximum Entropy. ENTROPY (BASEL, SWITZERLAND) 2021; 23:561. [PMID: 33946438 PMCID: PMC8147137 DOI: 10.3390/e23050561] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/05/2021] [Revised: 04/26/2021] [Accepted: 04/28/2021] [Indexed: 11/22/2022]
Abstract
Entropy makes it possible to measure the uncertainty about an information source from the distribution of its output symbols. It is known that the maximum Shannon's entropy of a discrete source of information is reached when its symbols follow a Uniform distribution. In cryptography, these sources have great applications since they allow for the highest security standards to be reached. In this work, the most effective estimator is selected to estimate entropy in short samples of bytes and bits with maximum entropy. For this, 18 estimators were compared. Results concerning the comparisons published in the literature between these estimators are discussed. The most suitable estimator is determined experimentally, based on its bias, the mean square error short samples of bytes and bits.
Collapse
Affiliation(s)
- Lianet Contreras Rodríguez
- Facultad de Matemática y Computación, Instituto de Criptografía, Universidad de la Habana, Habana 10400, Cuba; (L.C.R.); (E.J.M.-C.); (C.M.L.-P.)
| | - Evaristo José Madarro-Capó
- Facultad de Matemática y Computación, Instituto de Criptografía, Universidad de la Habana, Habana 10400, Cuba; (L.C.R.); (E.J.M.-C.); (C.M.L.-P.)
| | - Carlos Miguel Legón-Pérez
- Facultad de Matemática y Computación, Instituto de Criptografía, Universidad de la Habana, Habana 10400, Cuba; (L.C.R.); (E.J.M.-C.); (C.M.L.-P.)
| | - Omar Rojas
- Facultad de Ciencias Económicas y Empresariales, Universidad Panamericana, Álvaro del Portillo 49, Zapopan, Jalisco 45010, Mexico;
| | - Guillermo Sosa-Gómez
- Facultad de Ciencias Económicas y Empresariales, Universidad Panamericana, Álvaro del Portillo 49, Zapopan, Jalisco 45010, Mexico;
| |
Collapse
|
24
|
Wang YXR, Li L, Li JJ, Huang H. Network Modeling in Biology: Statistical Methods for Gene and Brain Networks. Stat Sci 2021; 36:89-108. [PMID: 34305304 PMCID: PMC8296984 DOI: 10.1214/20-sts792] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The rise of network data in many different domains has offered researchers new insight into the problem of modeling complex systems and propelled the development of numerous innovative statistical methodologies and computational tools. In this paper, we primarily focus on two types of biological networks, gene networks and brain networks, where statistical network modeling has found both fruitful and challenging applications. Unlike other network examples such as social networks where network edges can be directly observed, both gene and brain networks require careful estimation of edges using covariates as a first step. We provide a discussion on existing statistical and computational methods for edge esitimation and subsequent statistical inference problems in these two types of biological networks.
Collapse
Affiliation(s)
- Y X Rachel Wang
- School of Mathematics and Statistics, University of Sydney, Australia
| | - Lexin Li
- Department of Biostatistics and Epidemiology, School of Public Health, University of California, Berkeley
| | | | - Haiyan Huang
- Department of Statistics, University of California, Berkeley
| |
Collapse
|
25
|
Pan-cancer driver copy number alterations identified by joint expression/CNA data analysis. Sci Rep 2020; 10:17199. [PMID: 33057153 PMCID: PMC7566486 DOI: 10.1038/s41598-020-74276-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2020] [Accepted: 09/29/2020] [Indexed: 02/07/2023] Open
Abstract
AbstractAnalysis of large gene expression datasets from biopsies of cancer patients can identify co-expression signatures representing particular biomolecular events in cancer. Some of these signatures involve genomically co-localized genes resulting from the presence of copy number alterations (CNAs), for which analysis of the expression of the underlying genes provides valuable information about their combined role as oncogenes or tumor suppressor genes. Here we focus on the discovery and interpretation of such signatures that are present in multiple cancer types due to driver amplifications and deletions in particular regions of the genome after doing a comprehensive analysis combining both gene expression and CNA data from The Cancer Genome Atlas.
Collapse
|
26
|
Xia Y. Correlation and association analyses in microbiome study integrating multiomics in health and disease. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020; 171:309-491. [PMID: 32475527 DOI: 10.1016/bs.pmbts.2020.04.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Correlation and association analyses are one of the most widely used statistical methods in research fields, including microbiome and integrative multiomics studies. Correlation and association have two implications: dependence and co-occurrence. Microbiome data are structured as phylogenetic tree and have several unique characteristics, including high dimensionality, compositionality, sparsity with excess zeros, and heterogeneity. These unique characteristics cause several statistical issues when analyzing microbiome data and integrating multiomics data, such as large p and small n, dependency, overdispersion, and zero-inflation. In microbiome research, on the one hand, classic correlation and association methods are still applied in real studies and used for the development of new methods; on the other hand, new methods have been developed to target statistical issues arising from unique characteristics of microbiome data. Here, we first provide a comprehensive view of classic and newly developed univariate correlation and association-based methods. We discuss the appropriateness and limitations of using classic methods and demonstrate how the newly developed methods mitigate the issues of microbiome data. Second, we emphasize that concepts of correlation and association analyses have been shifted by introducing network analysis, microbe-metabolite interactions, functional analysis, etc. Third, we introduce multivariate correlation and association-based methods, which are organized by the categories of exploratory, interpretive, and discriminatory analyses and classification methods. Fourth, we focus on the hypothesis testing of univariate and multivariate regression-based association methods, including alpha and beta diversities-based, count-based, and relative abundance (or compositional)-based association analyses. We demonstrate the characteristics and limitations of each approaches. Fifth, we introduce two specific microbiome-based methods: phylogenetic tree-based association analysis and testing for survival outcomes. Sixth, we provide an overall view of longitudinal methods in analysis of microbiome and omics data, which cover standard, static, regression-based time series methods, principal trend analysis, and newly developed univariate overdispersed and zero-inflated as well as multivariate distance/kernel-based longitudinal models. Finally, we comment on current association analysis and future direction of association analysis in microbiome and multiomics studies.
Collapse
Affiliation(s)
- Yinglin Xia
- Department of Medicine, University of Illinois at Chicago, Chicago, IL, United States.
| |
Collapse
|
27
|
Uda S. Application of information theory in systems biology. Biophys Rev 2020; 12:377-384. [PMID: 32144740 PMCID: PMC7242537 DOI: 10.1007/s12551-020-00665-w] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2020] [Accepted: 02/25/2020] [Indexed: 12/12/2022] Open
Abstract
Over recent years, new light has been shed on aspects of information processing in cells. The quantification of information, as described by Shannon’s information theory, is a basic and powerful tool that can be applied to various fields, such as communication, statistics, and computer science, as well as to information processing within cells. It has also been used to infer the network structure of molecular species. However, the difficulty of obtaining sufficient sample sizes and the computational burden associated with the high-dimensional data often encountered in biology can result in bottlenecks in the application of information theory to systems biology. This article provides an overview of the application of information theory to systems biology, discussing the associated bottlenecks and reviewing recent work.
Collapse
Affiliation(s)
- Shinsuke Uda
- Division of Integrated Omics, Research Center for Transomics Medicine, Medical Institute of Bioregulation, Kyushu University, 3-1-1 Maidashi, Higashi-ku, Fukuoka, 812-8582, Japan.
| |
Collapse
|
28
|
Singh U, Hur M, Dorman K, Wurtele ES. MetaOmGraph: a workbench for interactive exploratory data analysis of large expression datasets. Nucleic Acids Res 2020; 48:e23. [PMID: 31956905 PMCID: PMC7039010 DOI: 10.1093/nar/gkz1209] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2019] [Revised: 12/05/2019] [Accepted: 12/17/2019] [Indexed: 12/17/2022] Open
Abstract
The diverse and growing omics data in public domains provide researchers with tremendous opportunity to extract hidden, yet undiscovered, knowledge. However, the vast majority of archived data remain unused. Here, we present MetaOmGraph (MOG), a free, open-source, standalone software for exploratory analysis of massive datasets. Researchers, without coding, can interactively visualize and evaluate data in the context of its metadata, honing-in on groups of samples or genes based on attributes such as expression values, statistical associations, metadata terms and ontology annotations. Interaction with data is easy via interactive visualizations such as line charts, box plots, scatter plots, histograms and volcano plots. Statistical analyses include co-expression analysis, differential expression analysis and differential correlation analysis, with significance tests. Researchers can send data subsets to R for additional analyses. Multithreading and indexing enable efficient big data analysis. A researcher can create new MOG projects from any numerical data; or explore an existing MOG project. MOG projects, with history of explorations, can be saved and shared. We illustrate MOG by case studies of large curated datasets from human cancer RNA-Seq, where we identify novel putative biomarker genes in different tumors, and microarray and metabolomics data from Arabidopsis thaliana. MOG executable and code: http://metnetweb.gdcb.iastate.edu/ and https://github.com/urmi-21/MetaOmGraph/.
Collapse
Affiliation(s)
- Urminder Singh
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA
- Center for Metabolic Biology, Iowa State University, Ames, IA 50011, USA
- Department of Genetics Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
| | - Manhoi Hur
- Center for Metabolic Biology, Iowa State University, Ames, IA 50011, USA
| | - Karin Dorman
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA
- Department of Genetics Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
- Department of Statistics, Iowa State University, Ames, IA 50011, USA
| | - Eve Syrkin Wurtele
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA
- Center for Metabolic Biology, Iowa State University, Ames, IA 50011, USA
- Department of Genetics Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
| |
Collapse
|
29
|
Li J, Lai Y, Zhang C, Zhang Q. TGCnA: temporal gene coexpression network analysis using a low-rank plus sparse framework. J Appl Stat 2019; 47:1064-1083. [PMID: 35706920 PMCID: PMC9041782 DOI: 10.1080/02664763.2019.1667311] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Accepted: 09/09/2019] [Indexed: 10/26/2022]
Abstract
Various gene network models with distinct physical nature have been widely used in biological studies. For temporal transcriptomic studies, the current dynamic models either ignore the temporal variation in the network structure or fail to scale up to a large number of genes due to severe computational bottlenecks and sample size limitation. Although the correlation-based gene networks are computationally affordable, they have limitations after being applied to gene expression time-course data. We proposed Temporal Gene Coexpression Network Analysis (TGCnA) framework for the transcriptomic time-course data. The mathematical nature of TGCnA is the joint modeling of multiple covariance matrices across time points using a 'low-rank plus sparse' framework, in which the network similarity across time points is explicitly modeled in the low-rank component. We demonstrated the advantage of TGCnA in covariance matrix estimation and gene module discovery using both simulation data and real transcriptomic data. The code is available at https://github.com/QiZhangStat/TGCnA.
Collapse
Affiliation(s)
- Jinyu Li
- Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE, USA
| | - Yutong Lai
- Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE, USA
| | - Chi Zhang
- School of Biological Sciences, University of Nebraska-Lincoln, Lincoln, NE, USA
| | - Qi Zhang
- Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE, USA
| |
Collapse
|
30
|
Masnadi-Shirazi M, Maurya MR, Pao G, Ke E, Verma IM, Subramaniam S. Time varying causal network reconstruction of a mouse cell cycle. BMC Bioinformatics 2019; 20:294. [PMID: 31142274 PMCID: PMC6542064 DOI: 10.1186/s12859-019-2895-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2019] [Accepted: 05/13/2019] [Indexed: 12/21/2022] Open
Abstract
Background Biochemical networks are often described through static or time-averaged measurements of the component macromolecules. Temporal variation in these components plays an important role in both describing the dynamical nature of the network as well as providing insights into causal mechanisms. Few methods exist, specifically for systems with many variables, for analyzing time series data to identify distinct temporal regimes and the corresponding time-varying causal networks and mechanisms. Results In this study, we use well-constructed temporal transcriptional measurements in a mammalian cell during a cell cycle, to identify dynamical networks and mechanisms describing the cell cycle. The methods we have used and developed in part deal with Granger causality, Vector Autoregression, Estimation Stability with Cross Validation and a nonparametric change point detection algorithm that enable estimating temporally evolving directed networks that provide a comprehensive picture of the crosstalk among different molecular components. We applied our approach to RNA-seq time-course data spanning nearly two cell cycles from Mouse Embryonic Fibroblast (MEF) primary cells. The change-point detection algorithm is able to extract precise information on the duration and timing of cell cycle phases. Using Least Absolute Shrinkage and Selection Operator (LASSO) and Estimation Stability with Cross Validation (ES-CV), we were able to, without any prior biological knowledge, extract information on the phase-specific causal interaction of cell cycle genes, as well as temporal interdependencies of biological mechanisms through a complete cell cycle. Conclusions The temporal dependence of cellular components we provide in our model goes beyond what is known in the literature. Furthermore, our inference of dynamic interplay of multiple intracellular mechanisms and their temporal dependence on one another can be used to predict time-varying cellular responses, and provide insight on the design of precise experiments for modulating the regulation of the cell cycle. Electronic supplementary material The online version of this article (10.1186/s12859-019-2895-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Maryam Masnadi-Shirazi
- Department of Electrical and Computer Engineering and Bioengineering, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA
| | - Mano R Maurya
- Department of Bioengineering and San Diego Supercomputer center, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA
| | - Gerald Pao
- Salk institute for Biological Studies, 10010 N Torrey Pines Rd, La Jolla, CA, 92037, USA
| | - Eugene Ke
- Salk institute for Biological Studies, 10010 N Torrey Pines Rd, La Jolla, CA, 92037, USA
| | - Inder M Verma
- Salk institute for Biological Studies, 10010 N Torrey Pines Rd, La Jolla, CA, 92037, USA
| | - Shankar Subramaniam
- Department of Bioengineering, Departments of Computer Science and Engineering, Cellular and Molecular Medicine, and the Graduate Program in Bioinformatics, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA.
| |
Collapse
|
31
|
Larmuseau M, Verbeke LPC, Marchal K. Associating expression and genomic data using co-occurrence measures. Biol Direct 2019; 14:10. [PMID: 31072345 PMCID: PMC6507230 DOI: 10.1186/s13062-019-0240-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Accepted: 04/10/2019] [Indexed: 12/11/2022] Open
Abstract
Abstract Recent technological evolutions have led to an exponential increase in data in all the omics fields. It is expected that integration of these different data sources, will drastically enhance our knowledge of the biological mechanisms behind genomic diseases such as cancer. However, the integration of different omics data still remains a challenge. In this work we propose an intuitive workflow for the integrative analysis of expression, mutation and copy number data taken from the METABRIC study on breast cancer. First, we present evidence that the expression profile of many important breast cancer genes consists of two modes or ‘regimes’, which contain important clinical information. Then, we show how the co-occurrence of these expression regimes can be used as an association measure between genes and validate our findings on the TCGA-BRCA study. Finally, we demonstrate how these co-occurrence measures can also be applied to link expression regimes to genomic aberrations, providing a more complete, integrative view on breast cancer. As a case study, an integrative analysis of the identified MLPH-FOXA1 association is performed, illustrating that the obtained expression associations are intimately linked to the underlying genomic changes. Reviewers This article was reviewed by Dirk Walther, Francisco Garcia and Isabel Nepomuceno. Electronic supplementary material The online version of this article (10.1186/s13062-019-0240-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Maarten Larmuseau
- Department of Information Technology, Ghent University - Imec, Technologiepark-Zwijnaarde 126, 9052, Ghent, Belgium
| | - Lieven P C Verbeke
- Department of Plant Biotechnology and Bioinformatics, Ghent University - Imec, Technologiepark-Zwijnaarde 126, 9052, Ghent, Belgium
| | - Kathleen Marchal
- Department of Plant Biotechnology and Bioinformatics, Ghent University - Imec, Technologiepark-Zwijnaarde 126, 9052, Ghent, Belgium.
| |
Collapse
|
32
|
Wang Y, Yang S, Zhao J, Du W, Liang Y, Wang C, Zhou F, Tian Y, Ma Q. Using Machine Learning to Measure Relatedness Between Genes: A Multi-Features Model. Sci Rep 2019; 9:4192. [PMID: 30862804 PMCID: PMC6414665 DOI: 10.1038/s41598-019-40780-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2018] [Accepted: 02/19/2019] [Indexed: 12/20/2022] Open
Abstract
Measuring conditional relatedness between a pair of genes is a fundamental technique and still a significant challenge in computational biology. Such relatedness can be assessed by gene expression similarities while suffering high false discovery rates. Meanwhile, other types of features, e.g., prior-knowledge based similarities, is only viable for measuring global relatedness. In this paper, we propose a novel machine learning model, named Multi-Features Relatedness (MFR), for accurately measuring conditional relatedness between a pair of genes by incorporating expression similarities with prior-knowledge based similarities in an assessment criterion. MFR is used to predict gene-gene interactions extracted from the COXPRESdb, KEGG, HPRD, and TRRUST databases by the 10-fold cross validation and test verification, and to identify gene-gene interactions collected from the GeneFriends and DIP databases for further verification. The results show that MFR achieves the highest area under curve (AUC) values for identifying gene-gene interactions in the development, test, and DIP datasets. Specifically, it obtains an improvement of 1.1% on average of precision for detecting gene pairs with both high expression similarities and high prior-knowledge based similarities in all datasets, comparing to other linear models and coexpression analysis methods. Regarding cancer gene networks construction and gene function prediction, MFR also obtains the results with more biological significances and higher average prediction accuracy, than other compared models and methods. A website of the MFR model and relevant datasets can be accessed from http://bmbl.sdstate.edu/MFR.
Collapse
Affiliation(s)
- Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Sen Yang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Jing Zhao
- Population Health Group, Sanford Research, Sioux Falls, SD, 57104, USA.,Department of Internal Medicine, Sanford School of Medicine, University of South Dakota, Sioux Falls, SD, 57105, USA
| | - Wei Du
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Yanchun Liang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China.,Zhuhai Laboratory of Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Department of Computer Science and Technology, Zhuhai College of Jilin University, Zhuhai, 519041, China
| | - Cankun Wang
- Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture, and Plant Science, Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, 57006, USA
| | - Fengfeng Zhou
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Yuan Tian
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China. .,School of Artificial Intelligence, Jilin University, Changchun, 130012, China.
| | - Qin Ma
- Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture, and Plant Science, Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, 57006, USA. .,Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA.
| |
Collapse
|
33
|
Wang YXR, Liu K, Theusch E, Rotter JI, Medina MW, Waterman MS, Huang H, Stegle O. Generalized correlation measure using count statistics for gene expression data with ordered samples. Bioinformatics 2018; 34:617-624. [PMID: 29040382 DOI: 10.1093/bioinformatics/btx641] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2017] [Accepted: 10/11/2017] [Indexed: 12/22/2022] Open
Abstract
Motivation Capturing association patterns in gene expression levels under different conditions or time points is important for inferring gene regulatory interactions. In practice, temporal changes in gene expression may result in complex association patterns that require more sophisticated detection methods than simple correlation measures. For instance, the effect of regulation may lead to time-lagged associations and interactions local to a subset of samples. Furthermore, expression profiles of interest may not be aligned or directly comparable (e.g. gene expression profiles from two species). Results We propose a count statistic for measuring association between pairs of gene expression profiles consisting of ordered samples (e.g. time-course), where correlation may only exist locally in subsequences separated by a position shift. The statistic is simple and fast to compute, and we illustrate its use in two applications. In a cross-species comparison of developmental gene expression levels, we show our method not only measures association of gene expressions between the two species, but also provides alignment between different developmental stages. In the second application, we applied our statistic to expression profiles from two distinct phenotypic conditions, where the samples in each profile are ordered by the associated phenotypic values. The detected associations can be useful in building correspondence between gene association networks under different phenotypes. On the theoretical side, we provide asymptotic distributions of the statistic for different regions of the parameter space and test its power on simulated data. Availability and implementation The code used to perform the analysis is available as part of the Supplementary Material. Contact msw@usc.edu or hhuang@stat.berkeley.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Y X Rachel Wang
- School of Mathematics and Statistics, University of Sydney, NSW 2006, Australia
| | - Ke Liu
- Department of Statistics, University of California, Berkeley, CA 94720, USA
| | - Elizabeth Theusch
- Children's Hospital Oakland Research Institute, Oakland, CA 94609, USA
| | - Jerome I Rotter
- The Institute for Translational Genomics and Population Sciences, Departments of Pediatrics and Medicine, LABioMed at Harbor-UCLA Medical Center, Torrance, CA 90502, USA
| | - Marisa W Medina
- Children's Hospital Oakland Research Institute, Oakland, CA 94609, USA
| | - Michael S Waterman
- Molecular and Computational Biology, University of Southern California, CA 90089, USA
| | - Haiyan Huang
- Department of Statistics, University of California, Berkeley, CA 94720, USA
| | | |
Collapse
|
34
|
Evaluation of metabolite-microbe correlation detection methods. Anal Biochem 2018; 567:106-111. [PMID: 30557528 DOI: 10.1016/j.ab.2018.12.008] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2018] [Revised: 12/07/2018] [Accepted: 12/10/2018] [Indexed: 12/13/2022]
Abstract
Different correlation detection methods have been specifically designed for the microbiome data analysis considering the compositional data structure and different sequencing depths. Along with the speedy development of omics studies, there is an increasing interest in discovering the biological associations between microbes and host metabolites. This raises the need of finding proper statistical methods that facilitate the correlation analysis across different omics studies. Here, we comprehensively evaluated six different correlation methods, i.e., Pearson correlation, Spearman correlation, Sparse Correlations for Compositional data (SparCC), Correlation inference for Compositional data through Lasso (CCLasso), Mutual Information Coefficient (MIC), and Cosine similarity methods, for the correlations detection between microbes and metabolites. Three simulated and two real-world data sets (from public databases and our lab) were used to examine the performance of each method regarding its specificity, sensitivity, similarity, accuracy, and stability with different sparsity. Our results indicate that although each method has its own pros and cons in different scenarios, Spearman correlation and MIC outperform the others with their overall performances. A strategic guidance was also proposed for the correlation analysis between microbe and metabolite.
Collapse
|
35
|
Abbaszadeh O, Khanteymoori AR, Azarpeyvand A. Parallel Algorithms for Inferring Gene Regulatory Networks: A Review. Curr Genomics 2018; 19:603-614. [PMID: 30386172 PMCID: PMC6194435 DOI: 10.2174/1389202919666180601081718] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2017] [Revised: 02/20/2018] [Accepted: 05/22/2018] [Indexed: 11/22/2022] Open
Abstract
System biology problems such as whole-genome network construction from large-scale gene expression data are sophisticated and time-consuming. Therefore, using sequential algorithms are not feasible to obtain a solution in an acceptable amount of time. Today, by using massively parallel computing, it is possible to infer large-scale gene regulatory networks. Recently, establishing gene regulatory networks from large-scale datasets have drawn the noticeable attention of researchers in the field of parallel computing and system biology. In this paper, we attempt to provide a more detailed overview of the recent parallel algorithms for constructing gene regulatory networks. Firstly, fundamentals of gene regulatory networks inference and large-scale datasets challenges are given. Secondly, a detailed description of the four parallel frameworks and libraries including CUDA, OpenMP, MPI, and Hadoop is discussed. Thirdly, parallel algorithms are reviewed. Finally, some conclusions and guidelines for parallel reverse engineering are described.
Collapse
Affiliation(s)
- Omid Abbaszadeh
- Department of Electrical and Computer Engineering, University of Zanjan, Zanjan, Iran
| | - Ali Reza Khanteymoori
- Department of Electrical and Computer Engineering, University of Zanjan, Zanjan, Iran
| | - Ali Azarpeyvand
- Department of Electrical and Computer Engineering, University of Zanjan, Zanjan, Iran
| |
Collapse
|
36
|
Abstract
Quantifying the dependence between two random variables is a fundamental issue in data analysis, and thus many measures have been proposed. Recent studies have focused on the renowned mutual information (MI) [Reshef DN, et al. (2011) Science 334:1518-1524]. However, "Unfortunately, reliably estimating mutual information from finite continuous data remains a significant and unresolved problem" [Kinney JB, Atwal GS (2014) Proc Natl Acad Sci USA 111:3354-3359]. In this paper, we examine the kernel estimation of MI and show that the bandwidths involved should be equalized. We consider a jackknife version of the kernel estimate with equalized bandwidth and allow the bandwidth to vary over an interval. We estimate the MI by the largest value among these kernel estimates and establish the associated theoretical underpinnings.
Collapse
|
37
|
Ellwanger DC, Scheibinger M, Dumont RA, Barr-Gillespie PG, Heller S. Transcriptional Dynamics of Hair-Bundle Morphogenesis Revealed with CellTrails. Cell Rep 2018; 23:2901-2914.e13. [PMID: 29874578 PMCID: PMC6089258 DOI: 10.1016/j.celrep.2018.05.002] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2017] [Revised: 03/19/2018] [Accepted: 05/01/2018] [Indexed: 11/30/2022] Open
Abstract
Protruding from the apical surface of inner ear sensory cells, hair bundles carry out mechanotransduction. Bundle growth involves sequential and overlapping cellular processes, which are concealed within gene expression profiles of individual cells. To dissect such processes, we developed CellTrails, a tool for uncovering, analyzing, and visualizing single-cell gene-expression dynamics. Utilizing quantitative gene-expression data for key bundle proteins from single cells of the developing chick utricle, we reconstructed de novo a bifurcating trajectory that spanned from progenitor cells to mature striolar and extrastriolar hair cells. Extraction and alignment of developmental trails and association of pseudotime with bundle length measurements linked expression dynamics of individual genes with bundle growth stages. Differential trail analysis revealed high-resolution dynamics of transcripts that control striolar and extrastriolar bundle development, including those that encode proteins that regulate [Ca2+]i or mediate crosslinking and lengthening of actin filaments.
Collapse
Affiliation(s)
- Daniel C Ellwanger
- Department of Otolaryngology, Head & Neck Surgery and Institute for Stem Cell Biology and Regenerative Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Mirko Scheibinger
- Department of Otolaryngology, Head & Neck Surgery and Institute for Stem Cell Biology and Regenerative Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Rachel A Dumont
- Oregon Hearing Research Center and Vollum Institute, Oregon Health & Science University, Portland, OR 97239, USA
| | - Peter G Barr-Gillespie
- Oregon Hearing Research Center and Vollum Institute, Oregon Health & Science University, Portland, OR 97239, USA.
| | - Stefan Heller
- Department of Otolaryngology, Head & Neck Surgery and Institute for Stem Cell Biology and Regenerative Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA.
| |
Collapse
|
38
|
Development of stock correlation networks using mutual information and financial big data. PLoS One 2018; 13:e0195941. [PMID: 29668715 PMCID: PMC5905993 DOI: 10.1371/journal.pone.0195941] [Citation(s) in RCA: 38] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2016] [Accepted: 03/18/2018] [Indexed: 11/19/2022] Open
Abstract
Stock correlation networks use stock price data to explore the relationship between different stocks listed in the stock market. Currently this relationship is dominantly measured by the Pearson correlation coefficient. However, financial data suggest that nonlinear relationships may exist in the stock prices of different shares. To address this issue, this work uses mutual information to characterize the nonlinear relationship between stocks. Using 280 stocks traded at the Shanghai Stocks Exchange in China during the period of 2014-2016, we first compare the effectiveness of the correlation coefficient and mutual information for measuring stock relationships. Based on these two measures, we then develop two stock networks using the Minimum Spanning Tree method and study the topological properties of these networks, including degree, path length and the power-law distribution. The relationship network based on mutual information has a better distribution of the degree and larger value of the power-law distribution than those using the correlation coefficient. Numerical results show that mutual information is a more effective approach than the correlation coefficient to measure the stock relationship in a stock market that may undergo large fluctuations of stock prices.
Collapse
|
39
|
Xiong W, Wang C, Zhang X, Yang Q, Shao R, Lai J, Du C. Highly interwoven communities of a gene regulatory network unveil topologically important genes for maize seed development. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2017; 92:1143-1156. [PMID: 29072883 DOI: 10.1111/tpj.13750] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/02/2017] [Revised: 10/10/2017] [Accepted: 10/17/2017] [Indexed: 06/07/2023]
Abstract
The complex interactions between transcription factors (TFs) and their target genes in a spatially and temporally specific manner are crucial to all cellular processes. Reconstruction of gene regulatory networks (GRNs) from gene expression profiles can help to decipher TF-gene regulations in a variety of contexts; however, the inevitable prediction errors of GRNs hinder optimal data mining of RNA-Seq transcriptome profiles. Here we perform an integrative study of Zea mays (maize) seed development in order to identify key genes in a complex developmental process. First, we reverse engineered a GRN from 78 maize seed transcriptome profiles. Then, we studied collective gene interaction patterns and uncovered highly interwoven network communities as the building blocks of the GRN. One community, composed of mostly unknown genes interacting with opaque2, brittle endosperm1 and shrunken2, contributes to seed phenotypes. Another community, composed mostly of genes expressed in the basal endosperm transfer layer, is responsible for nutrient transport. We further integrated our inferred GRN with gene expression patterns in different seed compartments and at various developmental stages and pathways. The integration facilitated a biological interpretation of the GRN. Our yeast one-hybrid assays verified six out of eight TF-promoter bindings in the reconstructed GRN. This study identified topologically important genes in interwoven network communities that may be crucial to maize seed development.
Collapse
Affiliation(s)
- Wenwei Xiong
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450002, China
- Department of Biology, Montclair State University, Montclair, NJ, 07043, USA
| | - Chunlei Wang
- National Maize Improvement Center, China Agricultural University, Beijing, 100083, China
| | - Xiangbo Zhang
- National Maize Improvement Center, China Agricultural University, Beijing, 100083, China
| | - Qinghua Yang
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450002, China
| | - Ruixin Shao
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450002, China
| | - Jinsheng Lai
- National Maize Improvement Center, China Agricultural University, Beijing, 100083, China
| | - Chunguang Du
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450002, China
- Department of Biology, Montclair State University, Montclair, NJ, 07043, USA
| |
Collapse
|
40
|
Erdoğan C, Kurt Z, Diri B. Estimation of the proteomic cancer co-expression sub networks by using association estimators. PLoS One 2017; 12:e0188016. [PMID: 29145449 PMCID: PMC5690670 DOI: 10.1371/journal.pone.0188016] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2017] [Accepted: 10/29/2017] [Indexed: 01/02/2023] Open
Abstract
In this study, the association estimators, which have significant influences on the gene network inference methods and used for determining the molecular interactions, were examined within the co-expression network inference concept. By using the proteomic data from five different cancer types, the hub genes/proteins within the disease-associated gene-gene/protein-protein interaction sub networks were identified. Proteomic data from various cancer types is collected from The Cancer Proteome Atlas (TCPA). Correlation and mutual information (MI) based nine association estimators that are commonly used in the literature, were compared in this study. As the gold standard to measure the association estimators’ performance, a multi-layer data integration platform on gene-disease associations (DisGeNET) and the Molecular Signatures Database (MSigDB) was used. Fisher's exact test was used to evaluate the performance of the association estimators by comparing the created co-expression networks with the disease-associated pathways. It was observed that the MI based estimators provided more successful results than the Pearson and Spearman correlation approaches, which are used in the estimation of biological networks in the weighted correlation network analysis (WGCNA) package. In correlation-based methods, the best average success rate for five cancer types was 60%, while in MI-based methods the average success ratio was 71% for James-Stein Shrinkage (Shrink) and 64% for Schurmann-Grassberger (SG) association estimator, respectively. Moreover, the hub genes and the inferred sub networks are presented for the consideration of researchers and experimentalists.
Collapse
Affiliation(s)
- Cihat Erdoğan
- Department of Computer Engineering, Namik Kemal University, Tekirdag, Turkey
- * E-mail:
| | - Zeyneb Kurt
- Department of Integrative Biology and Physiology, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Computer Engineering, Yildiz Technical University, Istanbul, Turkey
| | - Banu Diri
- Department of Computer Engineering, Yildiz Technical University, Istanbul, Turkey
| |
Collapse
|
41
|
Jokipii‐Lukkari S, Sundell D, Nilsson O, Hvidsten TR, Street NR, Tuominen H. NorWood: a gene expression resource for evo-devo studies of conifer wood development. THE NEW PHYTOLOGIST 2017; 216:482-494. [PMID: 28186632 PMCID: PMC6079643 DOI: 10.1111/nph.14458] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/21/2016] [Accepted: 12/22/2016] [Indexed: 05/04/2023]
Abstract
The secondary xylem of conifers is composed mainly of tracheids that differ anatomically and chemically from angiosperm xylem cells. There is currently no high-spatial-resolution data available profiling gene expression during wood formation for any coniferous species, which limits insight into tracheid development. RNA-sequencing data from replicated, high-spatial-resolution section series throughout the cambial and woody tissues of Picea abies were used to generate the NorWood.conGenIE.org web resource, which facilitates exploration of the associated gene expression profiles and co-expression networks. Integration within PlantGenIE.org enabled a comparative regulomics analysis, revealing divergent co-expression networks between P. abies and the two angiosperm species Arabidopsis thaliana and Populus tremula for the secondary cell wall (SCW) master regulator NAC Class IIB transcription factors. The SCW cellulose synthase genes (CesAs) were located in the neighbourhoods of the NAC factors in A. thaliana and P. tremula, but not in P. abies. The NorWood co-expression network enabled identification of potential SCW CesA regulators in P. abies. The NorWood web resource represents a powerful community tool for generating evo-devo insights into the divergence of wood formation between angiosperms and gymnosperms and for advancing understanding of the regulation of wood development in P. abies.
Collapse
Affiliation(s)
- Soile Jokipii‐Lukkari
- Umeå Plant Science CentreDepartment of Plant PhysiologyUmeå UniversitySE‐901 87UmeåSweden
- Umeå Plant Science CentreDepartment of Forest Genetics and Plant PhysiologySwedish University of Agricultural SciencesSE‐901 84UmeåSweden
| | - David Sundell
- Umeå Plant Science CentreDepartment of Plant PhysiologyUmeå UniversitySE‐901 87UmeåSweden
| | - Ove Nilsson
- Umeå Plant Science CentreDepartment of Forest Genetics and Plant PhysiologySwedish University of Agricultural SciencesSE‐901 84UmeåSweden
| | - Torgeir R. Hvidsten
- Umeå Plant Science CentreDepartment of Plant PhysiologyUmeå UniversitySE‐901 87UmeåSweden
- Department of Chemistry, Biotechnology and Food ScienceNorwegian University of Life Sciences1430ÅsNorway
| | - Nathaniel R. Street
- Umeå Plant Science CentreDepartment of Plant PhysiologyUmeå UniversitySE‐901 87UmeåSweden
| | - Hannele Tuominen
- Umeå Plant Science CentreDepartment of Plant PhysiologyUmeå UniversitySE‐901 87UmeåSweden
| |
Collapse
|
42
|
Abstract
The goal of the gene regulatory network (GRN) inference is to determine the interactions between genes given heterogeneous data capturing spatiotemporal gene expression. Since transcription underlines all cellular processes, the inference of GRN is the first step in deciphering the determinants of the dynamics of biological systems. Here, we first describe the generic steps of the inference approaches that rely on similarity measures and group the similarity measures based on the computational methodology used. For each group of similarity measures, we not only review the existing approaches but also describe specifically the detailed steps of the existing state-of-the-art algorithms.
Collapse
Affiliation(s)
- Nooshin Omranian
- Systems Biology and Mathematical Modeling Group, Max Planck Institute of Molecular Plant Physiology, Am Mühlenberg 1, Potsdam-Golm, 14476, Germany
| | - Zoran Nikoloski
- Systems Biology and Mathematical Modeling Group, Max Planck Institute of Molecular Plant Physiology, Am Mühlenberg 1, Potsdam-Golm, 14476, Germany.
| |
Collapse
|
43
|
Ray SS, Misra S. A supervised weighted similarity measure for gene expressions using biological knowledge. Gene 2016; 595:150-160. [PMID: 27688070 DOI: 10.1016/j.gene.2016.09.033] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2016] [Revised: 08/18/2016] [Accepted: 09/22/2016] [Indexed: 11/17/2022]
Abstract
A supervised similarity measure for Saccharomyces cerevisiae gene expressions is developed which can capture the gene similarity when multiple types of experimental conditions like cell cycle, heat shock are available for all the genes. The measure is called Weighted Pearson correlation (WPC), where the weights are systematically determined for each type of experiment by maximizing the positive predictive value for gene pairs having Pearson correlation greater than 0.80. The positive predictive value is computed by using the annotation information available from yeast GO-Slim process annotations in Saccharomyces Genome Database (SGD). Genes are then clustered by k-medoid algorithm using the newly computed WPC, and functions of 135 unclassified genes are predicted with a p-value cutoff 10-5 using Munich Information for Protein Sequences (MIPS) annotations. Out of these genes, functional categories of 55 gene are predicted with p-value cutoff greater than 10-10 and reported in this investigation. The superiority of WPC as compared to some existing similarity measures like Pearson correlation and Euclidean distance is demonstrated using positive predictive (PPV) values of gene pairs for different Saccharomyces cerevisiae data sets. The related code is available at http://www.sampa.droppages.com/WPC.html.
Collapse
Affiliation(s)
- Shubhra Sankar Ray
- Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, India; Center for Soft Computing Research, Indian Statistical Institute, Kolkata 700108, India.
| | - Sampa Misra
- Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, India.
| |
Collapse
|
44
|
Discovering Genome-Wide Tag SNPs Based on the Mutual Information of the Variants. PLoS One 2016; 11:e0167994. [PMID: 27992465 PMCID: PMC5161470 DOI: 10.1371/journal.pone.0167994] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2016] [Accepted: 11/23/2016] [Indexed: 01/01/2023] Open
Abstract
Exploring linkage disequilibrium (LD) patterns among the single nucleotide polymorphism (SNP) sites can improve the accuracy and cost-effectiveness of genomic association studies, whereby representative (tag) SNPs are identified to sufficiently represent the genomic diversity in populations. There has been considerable amount of effort in developing efficient algorithms to select tag SNPs from the growing large-scale data sets. Methods using the classical pairwise-LD and multi-locus LD measures have been proposed that aim to reduce the computational complexity and to increase the accuracy, respectively. The present work solves the tag SNP selection problem by efficiently balancing the computational complexity and accuracy, and improves the coverage in genomic diversity in a cost-effective manner. The employed algorithm makes use of mutual information to explore the multi-locus association between SNPs and can handle different data types and conditions. Experiments with benchmark HapMap data sets show comparable or better performance against the state-of-the-art algorithms. In particular, as a novel application, the genome-wide SNP tagging is performed in the 1000 Genomes Project data sets, and produced a well-annotated database of tagging variants that capture the common genotype diversity in 2,504 samples from 26 human populations. Compared to conventional methods, the algorithm requires as input only the genotype (or haplotype) sequences, can scale up to genome-wide analyses, and produces accurate solutions with more information-rich output, providing an improved platform for researchers towards the subsequent association studies.
Collapse
|
45
|
Chockalingam S, Aluru M, Aluru S. Microarray Data Processing Techniques for Genome-Scale Network Inference from Large Public Repositories. MICROARRAYS 2016; 5:microarrays5030023. [PMID: 27657141 PMCID: PMC5040970 DOI: 10.3390/microarrays5030023] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2016] [Revised: 09/06/2016] [Accepted: 09/13/2016] [Indexed: 11/16/2022]
Abstract
Pre-processing of microarray data is a well-studied problem. Furthermore, all popular platforms come with their own recommended best practices for differential analysis of genes. However, for genome-scale network inference using microarray data collected from large public repositories, these methods filter out a considerable number of genes. This is primarily due to the effects of aggregating a diverse array of experiments with different technical and biological scenarios. Here we introduce a pre-processing pipeline suitable for inferring genome-scale gene networks from large microarray datasets. We show that partitioning of the available microarray datasets according to biological relevance into tissue- and process-specific categories significantly extends the limits of downstream network construction. We demonstrate the effectiveness of our pre-processing pipeline by inferring genome-scale networks for the model plant Arabidopsis thaliana using two different construction methods and a collection of 11,760 Affymetrix ATH1 microarray chips. Our pre-processing pipeline and the datasets used in this paper are made available at http://alurulab.cc.gatech.edu/microarray-pp.
Collapse
Affiliation(s)
- Sriram Chockalingam
- Department of Computer Science and Engineering, Indian Institute of Technology Bombay, Mumbai 40076, India.
| | - Maneesha Aluru
- School of Biology, Georgia Institute of Technology, Atlanta, GA 30332, USA.
| | - Srinivas Aluru
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA.
| |
Collapse
|
46
|
Rasanen OJ, Saarinen JP. Sequence Prediction With Sparse Distributed Hyperdimensional Coding Applied to the Analysis of Mobile Phone Use Patterns. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2016; 27:1878-1889. [PMID: 26285224 DOI: 10.1109/tnnls.2015.2462721] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Modeling and prediction of temporal sequences is central to many signal processing and machine learning applications. Prediction based on sequence history is typically performed using parametric models, such as fixed-order Markov chains ( n -grams), approximations of high-order Markov processes, such as mixed-order Markov models or mixtures of lagged bigram models, or with other machine learning techniques. This paper presents a method for sequence prediction based on sparse hyperdimensional coding of the sequence structure and describes how higher order temporal structures can be utilized in sparse coding in a balanced manner. The method is purely incremental, allowing real-time online learning and prediction with limited computational resources. Experiments with prediction of mobile phone use patterns, including the prediction of the next launched application, the next GPS location of the user, and the next artist played with the phone media player, reveal that the proposed method is able to capture the relevant variable-order structure from the sequences. In comparison with the n -grams and the mixed-order Markov models, the sparse hyperdimensional predictor clearly outperforms its peers in terms of unweighted average recall and achieves an equal level of weighted average recall as the mixed-order Markov chain but without the batch training of the mixed-order model.
Collapse
|
47
|
Uncovering Driver DNA Methylation Events in Nonsmoking Early Stage Lung Adenocarcinoma. BIOMED RESEARCH INTERNATIONAL 2016; 2016:2090286. [PMID: 27610367 PMCID: PMC5005773 DOI: 10.1155/2016/2090286] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/28/2016] [Revised: 06/28/2016] [Accepted: 07/05/2016] [Indexed: 01/04/2023]
Abstract
As smoking rates decrease, proportionally more cases with lung adenocarcinoma occur in never-smokers, while aberrant DNA methylation has been suggested to contribute to the tumorigenesis of lung adenocarcinoma. It is extremely difficult to distinguish which genes play key roles in tumorigenic processes via DNA methylation-mediated gene silencing from a large number of differentially methylated genes. By integrating gene expression and DNA methylation data, a pipeline combined with the differential network analysis is designed to uncover driver methylation genes and responsive modules, which demonstrate distinctive expressions and network topology in tumors with aberrant DNA methylation. Totally, 135 genes are recognized as candidate driver genes in early stage lung adenocarcinoma and top ranked 30 genes are recognized as driver methylation genes. Functional annotation and the differential network analysis indicate the roles of identified driver genes in tumorigenesis, while literature study reveals significant correlations of the top 30 genes with early stage lung adenocarcinoma in never-smokers. The analysis pipeline can also be employed in identification of driver epigenetic events for other cancers characterized by matched gene expression data and DNA methylation data.
Collapse
|
48
|
Chen Y, Cao D, Gao J, Yuan Z. Discovering Pair-wise Synergies in Microarray Data. Sci Rep 2016; 6:30672. [PMID: 27470995 PMCID: PMC4965793 DOI: 10.1038/srep30672] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2016] [Accepted: 07/07/2016] [Indexed: 01/01/2023] Open
Abstract
Informative gene selection can have important implications for the improvement of cancer diagnosis and the identification of new drug targets. Individual-gene-ranking methods ignore interactions between genes. Furthermore, popular pair-wise gene evaluation methods, e.g. TSP and TSG, are helpless for discovering pair-wise interactions. Several efforts to discover pair-wise synergy have been made based on the information approach, such as EMBP and FeatKNN. However, the methods which are employed to estimate mutual information, e.g. binarization, histogram-based and KNN estimators, depend on known data or domain characteristics. Recently, Reshef et al. proposed a novel maximal information coefficient (MIC) measure to capture a wide range of associations between two variables that has the property of generality. An extension from MIC(X; Y) to MIC(X1; X2; Y) is therefore desired. We developed an approximation algorithm for estimating MIC(X1; X2; Y) where Y is a discrete variable. MIC(X1; X2; Y) is employed to detect pair-wise synergy in simulation and cancer microarray data. The results indicate that MIC(X1; X2; Y) also has the property of generality. It can discover synergic genes that are undetectable by reference feature selection methods such as MIC(X; Y) and TSG. Synergic genes can distinguish different phenotypes. Finally, the biological relevance of these synergic genes is validated with GO annotation and OUgene database.
Collapse
Affiliation(s)
- Yuan Chen
- Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, Hunan, 410128, China.,Hunan Provincial Key Laboratory for Germplasm Innovation and Utilization of Crop, Hunan Agricultural University, Changsha, Hunan, 410128, China
| | - Dan Cao
- Orient Science &Technology College of Hunan Agricultural University, Changsha, Hunan, 410128, China
| | - Jun Gao
- College of Resources &Environment, Hunan Agricultural University, Changsha, Hunan, 410128, China.,Department of Biochemistry and Molecular Biology, University of Arkansas for Medical Sciences, Little Rock, Arkansas, 72205, USA
| | - Zheming Yuan
- Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, Hunan, 410128, China.,Hunan Provincial Key Laboratory for Germplasm Innovation and Utilization of Crop, Hunan Agricultural University, Changsha, Hunan, 410128, China
| |
Collapse
|
49
|
Hu Y, Zhao H. CCor: A whole genome network-based similarity measure between two genes. Biometrics 2016; 72:1216-1225. [PMID: 26953524 DOI: 10.1111/biom.12508] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2015] [Revised: 01/01/2016] [Accepted: 02/01/2016] [Indexed: 12/29/2022]
Abstract
Measuring the similarity between genes is often the starting point for building gene regulatory networks. Most similarity measures used in practice only consider pairwise information with a few also consider network structure. Although theoretical properties of pairwise measures are well understood in the statistics literature, little is known about their statistical properties of those similarity measures based on network structure. In this article, we consider a new whole genome network-based similarity measure, called CCor, that makes use of information of all the genes in the network. We derive a concentration inequality of CCor and compare it with the commonly used Pearson correlation coefficient for inferring network modules. Both theoretical analysis and real data example demonstrate the advantages of CCor over existing measures for inferring gene modules.
Collapse
Affiliation(s)
- Yiming Hu
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, U.S.A
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, U.S.A.,Program of Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, U.S.A
| |
Collapse
|
50
|
Discovering gene re-ranking efficiency and conserved gene-gene relationships derived from gene co-expression network analysis on breast cancer data. Sci Rep 2016; 6:20518. [PMID: 26892392 PMCID: PMC4759568 DOI: 10.1038/srep20518] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2015] [Accepted: 01/05/2016] [Indexed: 12/18/2022] Open
Abstract
Systemic approaches are essential in the discovery of disease-specific genes, offering a different perspective and new tools on the analysis of several types of molecular relationships, such as gene co-expression or protein-protein interactions. However, due to lack of experimental information, this analysis is not fully applicable. The aim of this study is to reveal the multi-potent contribution of statistical network inference methods in highlighting significant genes and interactions. We have investigated the ability of statistical co-expression networks to highlight and prioritize genes for breast cancer subtypes and stages in terms of: (i) classification efficiency, (ii) gene network pattern conservation, (iii) indication of involved molecular mechanisms and (iv) systems level momentum to drug repurposing pipelines. We have found that statistical network inference methods are advantageous in gene prioritization, are capable to contribute to meaningful network signature discovery, give insights regarding the disease-related mechanisms and boost drug discovery pipelines from a systems point of view.
Collapse
|