1
|
Saha E, Fanfani V, Mandros P, Ben Guebila M, Fischer J, Shutta KH, DeMeo DL, Lopes-Ramos CM, Quackenbush J. Bayesian inference of sample-specific coexpression networks. Genome Res 2024; 34:1397-1410. [PMID: 39134413 PMCID: PMC11529861 DOI: 10.1101/gr.279117.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 07/31/2024] [Indexed: 08/28/2024]
Abstract
Gene regulatory networks (GRNs) are effective tools for inferring complex interactions between molecules that regulate biological processes and hence can provide insights into drivers of biological systems. Inferring coexpression networks is a critical element of GRN inference, as the correlation between expression patterns may indicate that genes are coregulated by common factors. However, methods that estimate coexpression networks generally derive an aggregate network representing the mean regulatory properties of the population and so fail to fully capture population heterogeneity. Bayesian optimized networks obtained by assimilating omic data (BONOBO) is a scalable Bayesian model for deriving individual sample-specific coexpression matrices that recognizes variations in molecular interactions across individuals. For each sample, BONOBO assumes a Gaussian distribution on the log-transformed centered gene expression and a conjugate prior distribution on the sample-specific coexpression matrix constructed from all other samples in the data. Combining the sample-specific gene coexpression with the prior distribution, BONOBO yields a closed-form solution for the posterior distribution of the sample-specific coexpression matrices, thus allowing the analysis of large data sets. We demonstrate BONOBO's utility in several contexts, including analyzing gene regulation in yeast transcription factor knockout studies, the prognostic significance of miRNA-mRNA interaction in human breast cancer subtypes, and sex differences in gene regulation within human thyroid tissue. We find that BONOBO outperforms other methods that have been used for sample-specific coexpression network inference and provides insight into individual differences in the drivers of biological processes.
Collapse
Affiliation(s)
- Enakshi Saha
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts 02115, USA
| | - Viola Fanfani
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts 02115, USA
| | - Panagiotis Mandros
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts 02115, USA
| | - Marouen Ben Guebila
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts 02115, USA
| | - Jonas Fischer
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts 02115, USA
| | - Katherine H Shutta
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts 02115, USA
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts 02115, USA
| | - Dawn L DeMeo
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts 02115, USA
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Camila M Lopes-Ramos
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts 02115, USA
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts 02115, USA
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - John Quackenbush
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts 02115, USA;
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts 02115, USA
- Department of Data Science, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
| |
Collapse
|
2
|
Diaz JEL, Barcessat V, Bahamon C, Hecht C, Das TK, Cagan RL. Functional exploration of copy number alterations in a Drosophila model of triple-negative breast cancer. Dis Model Mech 2024; 17:dmm050191. [PMID: 38721669 PMCID: PMC11247506 DOI: 10.1242/dmm.050191] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Accepted: 04/30/2024] [Indexed: 07/04/2024] Open
Abstract
Accounting for 10-20% of breast cancer cases, triple-negative breast cancer (TNBC) is associated with a disproportionate number of breast cancer deaths. One challenge in studying TNBC is its genomic profile: with the exception of TP53 loss, most breast cancer tumors are characterized by a high number of copy number alterations (CNAs), making modeling the disease in whole animals challenging. We computationally analyzed 186 CNA regions previously identified in breast cancer tumors to rank genes within each region by likelihood of acting as a tumor driver. We then used a Drosophila p53-Myc TNBC model to identify 48 genes as functional drivers. To demonstrate the utility of this functional database, we established six 3-hit models; altering candidate genes led to increased aspects of transformation as well as resistance to the chemotherapeutic drug fluorouracil. Our work provides a functional database of CNA-associated TNBC drivers, and a template for an integrated computational/whole-animal approach to identify functional drivers of transformation and drug resistance within CNAs in other tumor types.
Collapse
Affiliation(s)
- Jennifer E L Diaz
- Department of Cell, Development, and Regenerative Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Internal Medicine, UCLA David Geffen School of Medicine, CA 90095, USA
| | - Vanessa Barcessat
- Department of Cell, Development, and Regenerative Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- Precision Immunology Institute, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Christian Bahamon
- Department of Cell, Development, and Regenerative Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Chana Hecht
- Department of Cell, Development, and Regenerative Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Tirtha K Das
- Department of Cell, Development, and Regenerative Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Ross L Cagan
- Department of Cell, Development, and Regenerative Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
- School of Cancer Sciences and Wolfson Wohl Cancer Research Centre, University of Glasgow, Glasgow G61 1BD, UK
| |
Collapse
|
3
|
Sibilio P, Conte F, Huang Y, Castaldi PJ, Hersh CP, DeMeo DL, Silverman EK, Paci P. Correlation-based network integration of lung RNA sequencing and DNA methylation data in chronic obstructive pulmonary disease. Heliyon 2024; 10:e31301. [PMID: 38807864 PMCID: PMC11130701 DOI: 10.1016/j.heliyon.2024.e31301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 05/08/2024] [Accepted: 05/14/2024] [Indexed: 05/30/2024] Open
Abstract
Chronic Obstructive Pulmonary Disease (COPD) is a heterogeneous, chronic inflammatory process of the lungs and, like other complex diseases, is caused by both genetic and environmental factors. Detailed understanding of the molecular mechanisms of complex diseases requires the study of the interplay among different biomolecular layers, and thus the integration of different omics data types. In this study, we investigated COPD-associated molecular mechanisms through a correlation-based network integration of lung tissue RNA-seq and DNA methylation data of COPD cases (n = 446) and controls (n = 346) derived from the Lung Tissue Research Consortium. First, we performed a SWIM-network based analysis to build separate correlation networks for RNA-seq and DNA methylation data for our case-control study population. Then, we developed a method to integrate the results into a coupled network of differentially expressed and differentially methylated genes to investigate their relationships across both molecular layers. The functional enrichment analysis of the nodes of the coupled network revealed a strikingly significant enrichment in Immune System components, both innate and adaptive, as well as immune-system component communication (interleukin and cytokine-cytokine signaling). Our analysis allowed us to reveal novel putative COPD-associated genes and to analyze their relationships, both at the transcriptomics and epigenomics levels, thus contributing to an improved understanding of COPD pathogenesis.
Collapse
Affiliation(s)
- Pasquale Sibilio
- Department of Computer, Control and Management Engineering, Sapienza University of Rome, Rome, Italy
- Institute for Systems Analysis and Computer Science "Antonio Ruberti", National Research Council, Rome, Italy
| | - Federica Conte
- Institute for Systems Analysis and Computer Science "Antonio Ruberti", National Research Council, Rome, Italy
| | - Yichen Huang
- Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Peter J Castaldi
- Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Craig P Hersh
- Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Dawn L DeMeo
- Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Edwin K Silverman
- Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Paola Paci
- Department of Computer, Control and Management Engineering, Sapienza University of Rome, Rome, Italy
- Institute for Systems Analysis and Computer Science "Antonio Ruberti", National Research Council, Rome, Italy
- Karolinska Institutet, 17177, Stockholm, Sweden
| |
Collapse
|
4
|
Saha E, Fanfani V, Mandros P, Ben-Guebila M, Fischer J, Hoff-Shutta K, Glass K, DeMeo DL, Lopes-Ramos C, Quackenbush J. Bayesian Optimized sample-specific Networks Obtained By Omics data (BONOBO). BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.16.567119. [PMID: 38014256 PMCID: PMC10680741 DOI: 10.1101/2023.11.16.567119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Gene regulatory networks (GRNs) are effective tools for inferring complex interactions between molecules that regulate biological processes and hence can provide insights into drivers of biological systems. Inferring co-expression networks is a critical element of GRN inference as the correlation between expression patterns may indicate that genes are coregulated by common factors. However, methods that estimate co-expression networks generally derive an aggregate network representing the mean regulatory properties of the population and so fail to fully capture population heterogeneity. To address these concerns, we introduce BONOBO (Bayesian Optimized Networks Obtained By assimilating Omics data), a scalable Bayesian model for deriving individual sample-specific co-expression networks by recognizing variations in molecular interactions across individuals. For every sample, BONOBO assumes a Gaussian distribution on the log-transformed centered gene expression and a conjugate prior distribution on the sample-specific co-expression matrix constructed from all other samples in the data. Combining the sample-specific gene expression with the prior distribution, BONOBO yields a closed-form solution for the posterior distribution of the sample-specific co-expression matrices, thus making the method extremely scalable. We demonstrate the utility of BONOBO in several contexts, including analyzing gene regulation in yeast transcription factor knockout studies, prognostic significance of miRNA-mRNA interaction in human breast cancer subtypes, and sex differences in gene regulation within human thyroid tissue. We find that BONOBO outperforms other sample-specific co-expression network inference methods and provides insight into individual differences in the drivers of biological processes.
Collapse
Affiliation(s)
- Enakshi Saha
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
| | - Viola Fanfani
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
| | - Panagiotis Mandros
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
| | - Marouen Ben-Guebila
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
| | - Jonas Fischer
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
| | - Katherine Hoff-Shutta
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, USA
| | - Kimberly Glass
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - Dawn Lisa DeMeo
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - Camila Lopes-Ramos
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - John Quackenbush
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, USA
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA
| |
Collapse
|
5
|
Chen X, Han M, Li Y, Li X, Zhang J, Zhu Y. Identification of functional gene modules by integrating multi-omics data and known molecular interactions. Front Genet 2023; 14:1082032. [PMID: 36760999 PMCID: PMC9902936 DOI: 10.3389/fgene.2023.1082032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 01/11/2023] [Indexed: 01/25/2023] Open
Abstract
Multi-omics data integration has emerged as a promising approach to identify patient subgroups. However, in terms of grouping genes (or gene products) into co-expression modules, data integration methods suffer from two main drawbacks. First, most existing methods only consider genes or samples measured in all different datasets. Second, known molecular interactions (e.g., transcriptional regulatory interactions, protein-protein interactions and biological pathways) cannot be utilized to assist in module detection. Herein, we present a novel data integration framework, Correlation-based Local Approximation of Membership (CLAM), which provides two methodological innovations to address these limitations: 1) constructing a trans-omics neighborhood matrix by integrating multi-omics datasets and known molecular interactions, and 2) using a local approximation procedure to define gene modules from the matrix. Applying Correlation-based Local Approximation of Membership to human colorectal cancer (CRC) and mouse B-cell differentiation multi-omics data obtained from The Cancer Genome Atlas (TCGA), Clinical Proteomics Tumor Analysis Consortium (CPTAC), Gene Expression Omnibus (GEO) and ProteomeXchange database, we demonstrated its superior ability to recover biologically relevant modules and gene ontology (GO) terms. Further investigation of the colorectal cancer modules revealed numerous transcription factors and KEGG pathways that played crucial roles in colorectal cancer progression. Module-based survival analysis constructed four survival-related networks in which pairwise gene correlations were significantly correlated with colorectal cancer patient survival. Overall, the series of evaluations demonstrated the great potential of Correlation-based Local Approximation of Membership for identifying modular biomarkers for complex diseases. We implemented Correlation-based Local Approximation of Membership as a user-friendly application available at https://github.com/free1234hm/CLAM.
Collapse
Affiliation(s)
- Xiaoqing Chen
- Basic Medical School, Anhui Medical University, Hefei, China,National Center for Protein Sciences (Beijing), Beijing Proteome Research Center, Beijing Institute of Lifeomics, Beijing, China
| | - Mingfei Han
- National Center for Protein Sciences (Beijing), Beijing Proteome Research Center, Beijing Institute of Lifeomics, Beijing, China
| | - Yingxing Li
- Central Research Laboratory, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Xiao Li
- National Center for Protein Sciences (Beijing), Beijing Proteome Research Center, Beijing Institute of Lifeomics, Beijing, China
| | - Jiaqi Zhang
- National Center for Protein Sciences (Beijing), Beijing Proteome Research Center, Beijing Institute of Lifeomics, Beijing, China
| | - Yunping Zhu
- Basic Medical School, Anhui Medical University, Hefei, China,National Center for Protein Sciences (Beijing), Beijing Proteome Research Center, Beijing Institute of Lifeomics, Beijing, China,*Correspondence: Yunping Zhu,
| |
Collapse
|
6
|
Vahabi N, Michailidis G. Unsupervised Multi-Omics Data Integration Methods: A Comprehensive Review. Front Genet 2022; 13:854752. [PMID: 35391796 PMCID: PMC8981526 DOI: 10.3389/fgene.2022.854752] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Accepted: 02/28/2022] [Indexed: 12/26/2022] Open
Abstract
Through the developments of Omics technologies and dissemination of large-scale datasets, such as those from The Cancer Genome Atlas, Alzheimer’s Disease Neuroimaging Initiative, and Genotype-Tissue Expression, it is becoming increasingly possible to study complex biological processes and disease mechanisms more holistically. However, to obtain a comprehensive view of these complex systems, it is crucial to integrate data across various Omics modalities, and also leverage external knowledge available in biological databases. This review aims to provide an overview of multi-Omics data integration methods with different statistical approaches, focusing on unsupervised learning tasks, including disease onset prediction, biomarker discovery, disease subtyping, module discovery, and network/pathway analysis. We also briefly review feature selection methods, multi-Omics data sets, and resources/tools that constitute critical components for carrying out the integration.
Collapse
Affiliation(s)
- Nasim Vahabi
- Informatics Institute, University of Florida, Gainesville, FL, United States
| | - George Michailidis
- Informatics Institute, University of Florida, Gainesville, FL, United States
| |
Collapse
|
7
|
Akhavan-Safar M, Teimourpour B, Nowzari-Dalini A. A network-based method for detecting cancer driver gene in transcriptional regulatory networks using the structure analysis of weighted regulatory interactions. Curr Bioinform 2022. [DOI: 10.2174/1574893617666220127094224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
The identification of genes that instigate cell anomalies and cause cancer in humans is an important field in oncology research. Abnormalities in these genes are transferred to other genes in the cell, disrupting its normal functionality. Such genes are known as cancer driver genes (CDGs). Various methods have been proposed for predicting CDGs, most of which are based on genomic data and computational methods. Some novel bioinformatic approaches have been developed.
Objective:
In this article, we propose a network-based algorithm, SalsaDriver (Stochastic approach for link-structure analysis to driver detection), which can calculate the receiving and influencing power of each gene using the stochastic analysis of regulatory interaction structures in gene regulatory networks.
Method:
First, regulatory networks related to breast, colon, and lung cancers were constructed using gene expression data and a list of regulatory interactions, the weights of which were then calculated using biological and topological features of the network. After that, the weighted regulatory interactions were used in the structure analysis of interactions achieved using two separate Markov chains on the bipartite graph taken from the main graph of the gene network and implementing the stochastic approach for link-structure analysis. The proposed algorithm categorizes higher-ranked genes as driver genes.
Results:
The proposed algorithm was compared with 24 other computational and network tools based on the F-measure value and the number of detected CDGs. The results were validated using four valid databases. The findings of this study show that SalsaDriver outperforms other methods and can identify a significant number of driver genes not identified using other methods.
Conclusion:
The SalsaDriver network-based approach is suitable for predicting CDGs and can be used as a complementary method along with other computational tools.
Collapse
Affiliation(s)
- Mostafa Akhavan-Safar
- Department of Computer and Information Technology Engineering, Payame Noor University (PNU), P.O. Box, 19395-4697, Tehran, Iran
- Department of Information Technology Engineering, School of Systems and Industrial Engineering, Tarbiat Modares University (TMU), Tehran, Iran
| | - Babak Teimourpour
- Department of Information Technology Engineering, School of Systems and Industrial Engineering, Tarbiat Modares University (TMU), Tehran, Iran
| | - Abbas Nowzari-Dalini
- Department of Computer Science, School of Mathematics, Statistics, and Computer Science, University of Tehran, Tehran, Iran
| |
Collapse
|
8
|
Privitera AP, Barresi V, Condorelli DF. Aberrations of Chromosomes 1 and 16 in Breast Cancer: A Framework for Cooperation of Transcriptionally Dysregulated Genes. Cancers (Basel) 2021; 13:1585. [PMID: 33808143 PMCID: PMC8037453 DOI: 10.3390/cancers13071585] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Revised: 03/21/2021] [Accepted: 03/24/2021] [Indexed: 12/13/2022] Open
Abstract
Derivative chromosome der(1;16), isochromosome 1q, and deleted 16q-producing arm-level 1q-gain and/or 16q-loss-are recurrent cytogenetic abnormalities in breast cancer, but their exact role in determining the malignant phenotype is still largely unknown. We exploited The Cancer Genome Atlas (TCGA) data to generate and analyze groups of breast invasive carcinomas, called 1,16-chromogroups, that are characterized by a pattern of arm-level somatic copy number aberrations congruent with known cytogenetic aberrations of chromosome 1 and 16. Substantial differences were found among 1,16-chromogroups in terms of other chromosomal aberrations, aneuploidy scores, transcriptomic data, single-point mutations, histotypes, and molecular subtypes. Breast cancers with a co-occurrence of 1q-gain and 16q-loss can be distinguished in a "low aneuploidy score" group, congruent to der(1;16), and a "high aneuploidy score" group, congruent to the co-occurrence of isochromosome 1q and deleted 16q. Another three groups are formed by cancers showing separately 1q-gain or 16q-loss or no aberrations of 1q and 16q. Transcriptome comparisons among the 1,16-chromogroups, integrated with functional pathway analysis, suggested the cooperation of overexpressed 1q genes and underexpressed 16q genes in the genesis of both ductal and lobular carcinomas, thus highlighting the putative role of genes encoding gamma-secretase subunits (APH1A, PSEN2, and NCSTN) and Wnt enhanceosome components (BCL9 and PYGO2) in 1q, and the glycoprotein E-cadherin (CDH1), the E3 ubiquitin-protein ligase WWP2, the deubiquitinating enzyme CYLD, and the transcription factor CBFB in 16q. The analysis of 1,16-chromogroups is a strategy with far-reaching implications for the selection of cancer cell models and novel experimental therapies.
Collapse
Affiliation(s)
| | - Vincenza Barresi
- Department of Biomedical and Biotechnological Sciences, Section of Medical Biochemistry, University of Catania, Via S. Sofia 89-97, 95123 Catania, Italy;
| | - Daniele Filippo Condorelli
- Department of Biomedical and Biotechnological Sciences, Section of Medical Biochemistry, University of Catania, Via S. Sofia 89-97, 95123 Catania, Italy;
| |
Collapse
|
9
|
Qin G, Liu Z, Xie L. Multiple Omics Data Integration. SYSTEMS MEDICINE 2021. [DOI: 10.1016/b978-0-12-801238-3.11508-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022] Open
|
10
|
GenHITS: A network science approach to driver gene detection in human regulatory network using gene's influence evaluation. J Biomed Inform 2020; 114:103661. [PMID: 33326867 DOI: 10.1016/j.jbi.2020.103661] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Revised: 12/09/2020] [Accepted: 12/10/2020] [Indexed: 12/14/2022]
Abstract
Cancer is among the diseases causing death, in which, cells uncontrollably grow and reproduce beyond the cell regulatory mechanism. In this disease, some genes are initiators of abnormalities and then transmit them to other genes through protein interactions. Accordingly, these genes are known as cancer driver genes (CDGs). In this regard, several methods have been previously developed for identifying cancer driver genes. Most of these methods are computational-based, which use the concept of mutation to predict CDGs. In this research, a method has been proposed for identifying CDGs in the transcription regulatory network using the concept of influence diffusion and by modifying the Hyperlink-Induced Topic Search algorithm based on the diffusion concept. Due to the type of these networks and the processes of abnormality progression in cells and the formation of cancerous tumors, high-influence genes can be the most likely considered as the driver genes. Therefore, we can use the influence diffusion concept as an acceptable theory to identify these genes. Recently, a method has been proposed to detect CDGs with the concept of the influence maximization. One of the challenges in these types of networks is finding the power of regulatory interaction between genes. Moreover, we have proposed a novel method to calculate the weight of regulatory interactions, based on the concept of diffusion. The performance of the proposed method was compared with other seventeen computational and network tools. Correspondingly, three cancer types were used as benchmarks as follows: breast invasive carcinoma (BRCA), Colon adenocarcinoma (COAD), and lung squamous cell carcinoma (LUSC). In addition, to determine the accuracy of the detected drivers using each method, CGC (Cancer Gene Census) and Mut-driver gene lists were utilized as gold standard. The results show that GenHITS performs better compared to the most of the other computational and network methods. Besides, it is also able to identify genes that have been identified by none of the other methods yet.
Collapse
|
11
|
Akhavan-Safar M, Teimourpour B. KatzDriver: A network based method to cancer causal genes discovery in gene regulatory network. Biosystems 2020; 201:104326. [PMID: 33309969 DOI: 10.1016/j.biosystems.2020.104326] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2020] [Revised: 12/08/2020] [Accepted: 12/08/2020] [Indexed: 02/07/2023]
Abstract
One of the important problems in oncology is finding the genes that perturb the cell functionality and cause cancer. These genes, namely cancer driver genes (CDGs), when mutated, lead to the activation of the abnormal proteins. This abnormality is passed on to other genes by protein-protein interactions, which can cause cells to uncontrollably multiply and become cancerous. So, many methods have been introduced to predict this group of genes. Most of these methods are computational-based, which identify the CDGs based on mutations and genomic data. In this study, we proposed KatzDriver, as a network-based approach, in order to detect CDGs. This method is able to calculate the relative impact of each gene in the spread of abnormality in the gene regulatory network. In this approach, we firstly create the studied networks using gene expression and regulatory interaction data. Then by combining the topological and biological data, the weights of edges (regulatory interactions) and nodes (genes) are calculated. Afterward, based on the KATZ approach, the receiving and broadcasting powers of each gene were calculated to find the relative impact of each gene. At the end, the top genes with the highest relative impact ranks were selected as potential cancer drivers. The result of the proposed approach was compared with 18 existing computational and network-based methods in terms of F-measure, and the number of the predicted cancer driver genes. The result shows that our proposed algorithm is better than most of the other methods. KatzDriver is also able to detect a significant number of unique driver genes compared to other computational and network-based methods.
Collapse
Affiliation(s)
- Mostafa Akhavan-Safar
- Information Technology Engineering Department, School of Systems and Industrial Engineering, Tarbiat Modares University (TMU), Chamran/Al-e-Ahmad Highways Intersection, Tehran, P.O. Box 14115-111, Iran.
| | - Babak Teimourpour
- Information Technology Engineering Department, School of Systems and Industrial Engineering, Tarbiat Modares University (TMU), Chamran/Al-e-Ahmad Highways Intersection, Tehran, P.O. Box 14115-111, Iran.
| |
Collapse
|
12
|
García-Cortés D, de Anda-Jáuregui G, Fresno C, Hernández-Lemus E, Espinal-Enríquez J. Gene Co-expression Is Distance-Dependent in Breast Cancer. Front Oncol 2020; 10:1232. [PMID: 32850369 PMCID: PMC7396632 DOI: 10.3389/fonc.2020.01232] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2020] [Accepted: 06/16/2020] [Indexed: 12/15/2022] Open
Abstract
Breast carcinomas are characterized by anomalous gene regulatory programs. As is well-known, gene expression programs are able to shape phenotypes. Hence, the understanding of gene co-expression may shed light on the underlying mechanisms behind the transcriptional regulatory programs affecting tumor development and evolution. For instance, in breast cancer, there is a clear loss of inter-chromosomal (trans-) co-expression, compared with healthy tissue. At the same time cis- (intra-chromosomal) interactions are favored in breast tumors. In order to have a deeper understanding of regulatory phenomena in cancer, here, we constructed Gene Co-expression Networks by using TCGA-derived RNA-seq whole-genome samples corresponding to the four breast cancer molecular subtypes, as well as healthy tissue. We quantify the cis-/trans- co-expression imbalance in all phenotypes. Additionally, we measured the association between co-expression and physical distance between genes, and characterized the ratio of intra/inter-cytoband interactions per phenotype. We confirmed loss of trans- co-expression in all molecular subtypes. We also observed that gene cis- co-expression decays abruptly with distance in all tumors in contrast with healthy tissue. We observed co-expressed gene hotspots, that tend to be connected at cytoband regions, and coincide accurately with already known copy number altered regions, such as Chr17q12, or Chr8q24.3 for all subtypes. Our methodology recovered different alterations already reported for specific breast cancer subtypes, showing how co-expression network approaches might help to capture distinct events that modify the cell regulatory program.
Collapse
Affiliation(s)
- Diana García-Cortés
- Computational Genomics Division, National Institute of Genomic Medicine, Mexico City, Mexico
- Programa de Doctorado en Ciencias Biomédicas, Universidad Nacional Autónoma de México, Mexico City, Mexico
| | | | - Cristóbal Fresno
- Computational Genomics Division, National Institute of Genomic Medicine, Mexico City, Mexico
| | - Enrique Hernández-Lemus
- Computational Genomics Division, National Institute of Genomic Medicine, Mexico City, Mexico
- Centro de Ciencias de la Complejidad, Universidad Nacional Autónoma de México, Mexico City, Mexico
| | - Jesús Espinal-Enríquez
- Computational Genomics Division, National Institute of Genomic Medicine, Mexico City, Mexico
- Centro de Ciencias de la Complejidad, Universidad Nacional Autónoma de México, Mexico City, Mexico
| |
Collapse
|
13
|
Oh M, Park S, Kim S, Chae H. Machine learning-based analysis of multi-omics data on the cloud for investigating gene regulations. Brief Bioinform 2020; 22:66-76. [PMID: 32227074 DOI: 10.1093/bib/bbaa032] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2019] [Revised: 02/05/2020] [Accepted: 02/25/2020] [Indexed: 02/06/2023] Open
Abstract
Gene expressions are subtly regulated by quantifiable measures of genetic molecules such as interaction with other genes, methylation, mutations, transcription factor and histone modifications. Integrative analysis of multi-omics data can help scientists understand the condition or patient-specific gene regulation mechanisms. However, analysis of multi-omics data is challenging since it requires not only the analysis of multiple omics data sets but also mining complex relations among different genetic molecules by using state-of-the-art machine learning methods. In addition, analysis of multi-omics data needs quite large computing infrastructure. Moreover, interpretation of the analysis results requires collaboration among many scientists, often requiring reperforming analysis from different perspectives. Many of the aforementioned technical issues can be nicely handled when machine learning tools are deployed on the cloud. In this survey article, we first survey machine learning methods that can be used for gene regulation study, and we categorize them according to five different goals: gene regulatory subnetwork discovery, disease subtype analysis, survival analysis, clinical prediction and visualization. We also summarize the methods in terms of multi-omics input types. Then, we explain why the cloud is potentially a good solution for the analysis of multi-omics data, followed by a survey of two state-of-the-art cloud systems, Galaxy and BioVLAB. Finally, we discuss important issues when the cloud is used for the analysis of multi-omics data for the gene regulation study.
Collapse
Affiliation(s)
- Minsik Oh
- Department of Computer Science and Engineering, Seoul National University, Seoul, 08826, Korea
| | - Sungjoon Park
- Department of Computer Science and Engineering, Seoul National University, Seoul, 08826, Korea
| | - Sun Kim
- Department of Computer Science and Engineering, Seoul National University, Seoul, 08826, Korea.,Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 08826, Korea.,Bioinformatics Institute, Seoul National University, Seoul, 08826, Korea
| | - Heejoon Chae
- Division of Computer Science, Sookmyung Women's University, Seoul, 04310,Korea
| |
Collapse
|
14
|
Mallik S, Zhao Z. Graph- and rule-based learning algorithms: a comprehensive review of their applications for cancer type classification and prognosis using genomic data. Brief Bioinform 2020; 21:368-394. [PMID: 30649169 PMCID: PMC7373185 DOI: 10.1093/bib/bby120] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Revised: 10/26/2018] [Accepted: 11/21/2018] [Indexed: 12/20/2022] Open
Abstract
Cancer is well recognized as a complex disease with dysregulated molecular networks or modules. Graph- and rule-based analytics have been applied extensively for cancer classification as well as prognosis using large genomic and other data over the past decade. This article provides a comprehensive review of various graph- and rule-based machine learning algorithms that have been applied to numerous genomics data to determine the cancer-specific gene modules, identify gene signature-based classifiers and carry out other related objectives of potential therapeutic value. This review focuses mainly on the methodological design and features of these algorithms to facilitate the application of these graph- and rule-based analytical approaches for cancer classification and prognosis. Based on the type of data integration, we divided all the algorithms into three categories: model-based integration, pre-processing integration and post-processing integration. Each category is further divided into four sub-categories (supervised, unsupervised, semi-supervised and survival-driven learning analyses) based on learning style. Therefore, a total of 11 categories of methods are summarized with their inputs, objectives and description, advantages and potential limitations. Next, we briefly demonstrate well-known and most recently developed algorithms for each sub-category along with salient information, such as data profiles, statistical or feature selection methods and outputs. Finally, we summarize the appropriate use and efficiency of all categories of graph- and rule mining-based learning methods when input data and specific objective are given. This review aims to help readers to select and use the appropriate algorithms for cancer classification and prognosis study.
Collapse
Affiliation(s)
- Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center, Houston
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center, Houston
| |
Collapse
|
15
|
Duan A, Kong L, An T, Zhou H, Yu C, Li Y. Star-PAP regulates tumor protein D52 through modulating miR-449a/34a in breast cancer. Biol Open 2019; 8:bio.045914. [PMID: 31649118 PMCID: PMC6899025 DOI: 10.1242/bio.045914] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
Tumor protein D52 (TPD52) is an oncogene amplified and overexpressed in various cancers. Tumor-suppressive microRNA-449a and microRNA-34a (miR-449a/34a) were recently reported to inhibit breast cancer cell migration and invasion via targeting TPD52. However, the upstream events are not clearly defined. Star-PAP is a non-canonical poly (A) polymerase which could regulate the expression of many miRNAs and mRNAs, but its biological functions are not well elucidated. The present study aimed to explore the regulative roles of Star-PAP in miR-449a/34a and TPD52 expression in breast cancer. We observed a negative correlation between the expression of TPD52 and Star-PAP in breast cancer. Overexpression of Star-PAP inhibited TPD52 expression, while endogenous Star-PAP knockdown led to increased TPD52. Furthermore, RNA immunoprecipitation assay suggested that Star-PAP could not bind to TPD52, independent of the 3′-end processing. RNA pull-down assay showed that Star-PAP could bind to 3′region of miR-449a. In line with these results, blunted cell proliferation or cell apoptosis caused by Star-PAP was rescued by overexpression of TPD52 or downregulation of miR-449a/34a. Our findings identified that Star-PAP regulates TPD52 by modulating miR-449a/34a, which may be an important molecular mechanism underlying the tumorigenesis of breast cancer and provide a rational therapeutic target for breast cancer treatment. Summary: Star-PAP is an important regulator of miR-449a/34a and was first identified indirectly regulating TPD52 via modulating miR-449a/34a. Furthermore, Star-PAP-miR-449a/34a-TPD52 axis is involved in proliferation and apoptosis of breast cancer cells.
Collapse
Affiliation(s)
- Aizhu Duan
- State Key Laboratory of Phytochemistry and Plant Resources in West China, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, P.R. China.,University of Chinese Academy of Sciences, Beijing, 100049, P.R. China
| | - Lingmei Kong
- State Key Laboratory of Phytochemistry and Plant Resources in West China, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, P.R. China
| | - Tao An
- State Key Laboratory of Phytochemistry and Plant Resources in West China, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, P.R. China
| | - Hongyu Zhou
- State Key Laboratory of Phytochemistry and Plant Resources in West China, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, P.R. China
| | - Chunlei Yu
- Institute of Materia Medica, School of Pharmacy, North Sichuan Medical College, Nanchong, Sichuan, 637100, P.R. China
| | - Yan Li
- State Key Laboratory of Phytochemistry and Plant Resources in West China, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, P.R. China
| |
Collapse
|
16
|
Kim Y, Bismeijer T, Zwart W, Wessels LFA, Vis DJ. Genomic data integration by WON-PARAFAC identifies interpretable factors for predicting drug-sensitivity in vivo. Nat Commun 2019; 10:5034. [PMID: 31695042 PMCID: PMC6834616 DOI: 10.1038/s41467-019-13027-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2018] [Accepted: 10/10/2019] [Indexed: 01/20/2023] Open
Abstract
Integrative analyses that summarize and link molecular data to treatment sensitivity are crucial to capture the biological complexity which is essential to further precision medicine. We introduce Weighted Orthogonal Nonnegative parallel factor analysis (WON-PARAFAC), a data integration method that identifies sparse and interpretable factors. WON-PARAFAC summarizes the GDSC1000 cell line compendium in 130 factors. We interpret the factors based on their association with recurrent molecular alterations, pathway enrichment, cancer type, and drug-response. Crucially, the cell line derived factors capture the majority of the relevant biological variation in Patient-Derived Xenograft (PDX) models, strongly suggesting our factors capture invariant and generalizable aspects of cancer biology. Furthermore, drug response in cell lines is better and more consistently translated to PDXs using factor-based predictors as compared to raw feature-based predictors. WON-PARAFAC efficiently summarizes and integrates multiway high-dimensional genomic data and enhances translatability of drug response prediction from cell lines to patient-derived xenografts.
Collapse
Affiliation(s)
- Yongsoo Kim
- Division of Oncogenomics, Oncode Institute, The Netherlands Cancer Institute, Amsterdam, The Netherlands.,Division of Molecular Carcinogenesis, Oncode Institute, The Netherlands Cancer Institute, Amsterdam, The Netherlands.,Department of Pathology, VU University Medical Center, Amsterdam, The Netherlands
| | - Tycho Bismeijer
- Division of Molecular Carcinogenesis, Oncode Institute, The Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Wilbert Zwart
- Division of Oncogenomics, Oncode Institute, The Netherlands Cancer Institute, Amsterdam, The Netherlands. .,Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands.
| | - Lodewyk F A Wessels
- Division of Molecular Carcinogenesis, Oncode Institute, The Netherlands Cancer Institute, Amsterdam, The Netherlands. .,Faculty of EEMCS, Delft University of Technology, Delft, The Netherlands.
| | - Daniel J Vis
- Division of Molecular Carcinogenesis, Oncode Institute, The Netherlands Cancer Institute, Amsterdam, The Netherlands.
| |
Collapse
|
17
|
Rahimi M, Teimourpour B, Marashi SA. Cancer driver gene discovery in transcriptional regulatory networks using influence maximization approach. Comput Biol Med 2019; 114:103362. [DOI: 10.1016/j.compbiomed.2019.103362] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2019] [Revised: 07/08/2019] [Accepted: 07/16/2019] [Indexed: 01/24/2023]
|
18
|
Rappoport N, Shamir R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res 2019; 46:10546-10562. [PMID: 30295871 PMCID: PMC6237755 DOI: 10.1093/nar/gky889] [Citation(s) in RCA: 238] [Impact Index Per Article: 39.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2018] [Accepted: 09/20/2018] [Indexed: 12/18/2022] Open
Abstract
Recent high throughput experimental methods have been used to collect large biomedical omics datasets. Clustering of single omic datasets has proven invaluable for biological and medical research. The decreasing cost and development of additional high throughput methods now enable measurement of multi-omic data. Clustering multi-omic data has the potential to reveal further systems-level insights, but raises computational and biological challenges. Here, we review algorithms for multi-omics clustering, and discuss key issues in applying these algorithms. Our review covers methods developed specifically for omic data as well as generic multi-view methods developed in the machine learning community for joint clustering of multiple data types. In addition, using cancer data from TCGA, we perform an extensive benchmark spanning ten different cancer types, providing the first systematic comparison of leading multi-omics and multi-view clustering algorithms. The results highlight key issues regarding the use of single- versus multi-omics, the choice of clustering strategy, the power of generic multi-view methods and the use of approximated p-values for gauging solution quality. Due to the growing use of multi-omics data, we expect these issues to be important for future progress in the field.
Collapse
Affiliation(s)
- Nimrod Rappoport
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
19
|
Integration of protein interaction and gene co-expression information for identification of melanoma candidate genes. Melanoma Res 2019; 29:126-133. [PMID: 30451788 DOI: 10.1097/cmr.0000000000000525] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Cutaneous melanoma is an aggressive form of skin cancer that causes death worldwide. Although much has been learned about the molecular basis of melanoma genesis and progression, there is also increasing appreciation for the continuing discovery of melanoma genes to improve the genetic understanding of this malignancy. In the present study, melanoma candidate genes were identified by analysis of the common network from cancer type-specific RNA-Seq co-expression data and protein-protein interaction profiles. Then, an integrated network containing the known melanoma-related genes represented as seed genes and the putative genes represented as linker genes was generated using the subnetwork extraction algorithm. According to the network topology property of the putative genes, we selected seven key genes (CREB1, XPO1, SP3, TNFRSF1B, CD40LG, UBR1, and ZNF484) as candidate genes of melanoma. Subsequent analysis showed that six of these genes are melanoma-associated genes and one (ZNF484) is a cancer-associated gene on the basis of the existing literature. A signature comprising these seven key genes was developed and an overall survival analysis of 461 cutaneous melanoma cases was carried out. This seven-gene signature can accurately determine the risk profile for cutaneous melanoma tumors (log-rank P=3.27E-05) and be validated on an independent clinical cohort (log-rank P=0.028). The presented seven genes might serve as candidates for studying the molecular mechanisms and help improve the prognostic risk assessment, which have clinical implications for melanoma patients.
Collapse
|
20
|
Athreya A, Iyer R, Neavin D, Wang L, Weinshilboum R, Kaddurah-Daouk R, Rush J, Frye M, Bobo W. Augmentation of Physician Assessments with Multi-Omics Enhances Predictability of Drug Response: A Case Study of Major Depressive Disorder. IEEE COMPUT INTELL M 2018; 13:20-31. [PMID: 30467458 DOI: 10.1109/mci.2018.2840660] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
This work proposes a "learning-augmented clinical assessment" workflow to sequentially augment physician assessments of patients' symptoms and their socio-demographic measures with heterogeneous biological measures to accurately predict treatment outcomes using machine learning. Across many psychiatric illnesses, ranging from major depressive disorder to schizophrenia, symptom severity assessments are subjective and do not include biological measures, making predictability in eventual treatment outcomes a challenge. Using data from the Mayo Clinic PGRN-AMPS SSRI trial as a case study, this work demonstrates a significant improvement in the prediction accuracy for antidepressant treatment outcomes in patients with major depressive disorder from 35% to 80% individualized by patient, compared to using only a physician's assessment as the predictors. This improvement is achieved through an iterative overlay of biological measures, starting with metabolites (blood measures modulated by drug action) associated with symptom severity, and then adding in genes associated with metabolomic concentrations. Hence, therapeutic efficacy for a new patient can be assessed prior to treatment, using prediction models that take as inputs, selected biological measures and physician's assessments of depression severity. Of broader significance extending beyond psychiatry, the approach presented in this work can potentially be applied to predicting treatment outcomes for other medical conditions, such as migraine headaches or rheumatoid arthritis, for which patients are treated according to subject-reported assessments of symptom severity.
Collapse
Affiliation(s)
- Arjun Athreya
- Department of Electrical and Computer Engineering, Univ. of Illinois at Urbana-Champaign, IL, USA
| | - Ravishankar Iyer
- Department of Electrical and Computer Engineering, Univ. of Illinois at Urbana-Champaign, IL, USA
| | - Drew Neavin
- Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, MN, USA
| | - Liewei Wang
- Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, MN, USA
| | - Richard Weinshilboum
- Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, MN, USA
| | | | - John Rush
- Department of Psychiatry and Behavioral Sciences, Duke University, NC, USA
| | - Mark Frye
- Department of Psychiatry and Psychology, Mayo Clinic, MN, USA
| | - William Bobo
- Department of Psychiatry and Psychology, Mayo Clinic, FL, USA
| |
Collapse
|
21
|
Misra BB, Langefeld CD, Olivier M, Cox LA. Integrated Omics: Tools, Advances, and Future Approaches. J Mol Endocrinol 2018; 62:JME-18-0055. [PMID: 30006342 DOI: 10.1530/jme-18-0055] [Citation(s) in RCA: 228] [Impact Index Per Article: 32.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/24/2018] [Revised: 07/02/2018] [Accepted: 07/12/2018] [Indexed: 12/13/2022]
Abstract
With the rapid adoption of high-throughput omic approaches to analyze biological samples such as genomics, transcriptomics, proteomics, and metabolomics, each analysis can generate tera- to peta-byte sized data files on a daily basis. These data file sizes, together with differences in nomenclature among these data types, make the integration of these multi-dimensional omics data into biologically meaningful context challenging. Variously named as integrated omics, multi-omics, poly-omics, trans-omics, pan-omics, or shortened to just 'omics', the challenges include differences in data cleaning, normalization, biomolecule identification, data dimensionality reduction, biological contextualization, statistical validation, data storage and handling, sharing, and data archiving. The ultimate goal is towards the holistic realization of a 'systems biology' understanding of the biological question in hand. Commonly used approaches in these efforts are currently limited by the 3 i's - integration, interpretation, and insights. Post integration, these very large datasets aim to yield unprecedented views of cellular systems at exquisite resolution for transformative insights into processes, events, and diseases through various computational and informatics frameworks. With the continued reduction in costs and processing time for sample analyses, and increasing types of omics datasets generated such as glycomics, lipidomics, microbiomics, and phenomics, an increasing number of scientists in this interdisciplinary domain of bioinformatics face these challenges. We discuss recent approaches, existing tools, and potential caveats in the integration of omics datasets for development of standardized analytical pipelines that could be adopted by the global omics research community.
Collapse
Affiliation(s)
- Biswapriya B Misra
- B Misra, Internal Medicine, Wake Forest University School of Medicine, Winston-Salem, United States
| | - Carl D Langefeld
- C Langefeld, Biostatistical Sciences, Wake Forest University School of Medicine, Winston-Salem, United States
| | - Michael Olivier
- M Olivier, Internal Medicine, Wake Forest University School of Medicine, Winston-Salem, United States
| | - Laura A Cox
- L Cox, Internal Medicine, Wake Forest University School of Medicine, Winston-Salem, United States
| |
Collapse
|
22
|
Cava C, Bertoli G, Colaprico A, Olsen C, Bontempi G, Castiglioni I. Integration of multiple networks and pathways identifies cancer driver genes in pan-cancer analysis. BMC Genomics 2018; 19:25. [PMID: 29304754 PMCID: PMC5756345 DOI: 10.1186/s12864-017-4423-x] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2017] [Accepted: 12/27/2017] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Modern high-throughput genomic technologies represent a comprehensive hallmark of molecular changes in pan-cancer studies. Although different cancer gene signatures have been revealed, the mechanism of tumourigenesis has yet to be completely understood. Pathways and networks are important tools to explain the role of genes in functional genomic studies. However, few methods consider the functional non-equal roles of genes in pathways and the complex gene-gene interactions in a network. RESULTS We present a novel method in pan-cancer analysis that identifies de-regulated genes with a functional role by integrating pathway and network data. A pan-cancer analysis of 7158 tumour/normal samples from 16 cancer types identified 895 genes with a central role in pathways and de-regulated in cancer. Comparing our approach with 15 current tools that identify cancer driver genes, we found that 35.6% of the 895 genes identified by our method have been found as cancer driver genes with at least 2/15 tools. Finally, we applied a machine learning algorithm on 16 independent GEO cancer datasets to validate the diagnostic role of cancer driver genes for each cancer. We obtained a list of the top-ten cancer driver genes for each cancer considered in this study. CONCLUSIONS Our analysis 1) confirmed that there are several known cancer driver genes in common among different types of cancer, 2) highlighted that cancer driver genes are able to regulate crucial pathways.
Collapse
Affiliation(s)
- Claudia Cava
- Institute of Molecular Bioimaging and Physiology, National Research Council (IBFM-CNR), Via F.Cervi 93, 20090 Milan, Segrate-Milan Italy
| | - Gloria Bertoli
- Institute of Molecular Bioimaging and Physiology, National Research Council (IBFM-CNR), Via F.Cervi 93, 20090 Milan, Segrate-Milan Italy
| | - Antonio Colaprico
- Interuniversity Institute of Bioinformatics in Brussels (IB)2, 1050 Brussels, Belgium
- Machine Learning Group (MLG), Department d’Informatique, Universite libre de Bruxelles (ULB), 1050 Brussels, Belgium
| | - Catharina Olsen
- Interuniversity Institute of Bioinformatics in Brussels (IB)2, 1050 Brussels, Belgium
- Machine Learning Group (MLG), Department d’Informatique, Universite libre de Bruxelles (ULB), 1050 Brussels, Belgium
| | - Gianluca Bontempi
- Interuniversity Institute of Bioinformatics in Brussels (IB)2, 1050 Brussels, Belgium
- Machine Learning Group (MLG), Department d’Informatique, Universite libre de Bruxelles (ULB), 1050 Brussels, Belgium
| | - Isabella Castiglioni
- Institute of Molecular Bioimaging and Physiology, National Research Council (IBFM-CNR), Via F.Cervi 93, 20090 Milan, Segrate-Milan Italy
| |
Collapse
|
23
|
Yue Z, Li HT, Yang Y, Hussain S, Zheng CH, Xia J, Chen Y. Identification of breast cancer candidate genes using gene co-expression and protein-protein interaction information. Oncotarget 2017; 7:36092-36100. [PMID: 27150055 PMCID: PMC5094985 DOI: 10.18632/oncotarget.9132] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2015] [Accepted: 04/16/2016] [Indexed: 01/18/2023] Open
Abstract
Breast cancer (BC) is one of the most common malignancies that could threaten female health. As the molecular mechanism of BC has not yet been completely discovered, identification of related genes of this disease is an important area of research that could provide new insights into gene function as well as potential treatment targets. Here we used subnetwork extraction algorithms to identify novel BC related genes based on the known BC genes (seed genes), gene co-expression profiles and protein-protein interaction network. We computationally predicted seven key genes (EPHX2, GHRH, PPYR1, ALPP, KNG1, GSK3A and TRIT1) as putative genes of BC. Further analysis shows that six of these have been reported as breast cancer associated genes, and one (PPYR1) as cancer associated gene. Lastly, we developed an expression signature using these seven key genes which significantly stratified 1660 BC patients according to relapse free survival (hazard ratio [HR], 0.55; 95% confidence interval [CI], 0.46–0.65; Logrank p = 5.5e−13). The 7-genes signature could be established as a useful predictor of disease prognosis in BC patients. Overall, the identified seven genes might be useful prognostic and predictive molecular markers to predict the clinical outcome of BC patients.
Collapse
Affiliation(s)
- Zhenyu Yue
- School of Life Sciences, Anhui University, Hefei, Anhui 230601, China.,Institute of Health Sciences, Anhui University, Hefei, Anhui 230601, China
| | - Hai-Tao Li
- College of Electrical Engineering and Automation, Anhui University, Hefei, Anhui 230601, China
| | - Yabing Yang
- School of Life Sciences, Anhui University, Hefei, Anhui 230601, China
| | - Sajid Hussain
- School of Life Sciences, Anhui University, Hefei, Anhui 230601, China
| | - Chun-Hou Zheng
- College of Electrical Engineering and Automation, Anhui University, Hefei, Anhui 230601, China
| | - Junfeng Xia
- Institute of Health Sciences, Anhui University, Hefei, Anhui 230601, China
| | - Yan Chen
- School of Life Sciences, Anhui University, Hefei, Anhui 230601, China
| |
Collapse
|
24
|
Cai L, Li Q, Du Y, Yun J, Xie Y, DeBerardinis RJ, Xiao G. Genomic regression analysis of coordinated expression. Nat Commun 2017; 8:2187. [PMID: 29259170 PMCID: PMC5736603 DOI: 10.1038/s41467-017-02181-0] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2016] [Accepted: 11/13/2017] [Indexed: 01/06/2023] Open
Abstract
Co-expression analysis is widely used to predict gene function and to identify functionally related gene sets. However, co-expression analysis using human cancer transcriptomic data is confounded by somatic copy number alterations (SCNA), which produce co-expression signatures based on physical proximity rather than biological function. To better understand gene-gene co-expression based on biological regulation but not SCNA, we describe a method termed "Genomic Regression Analysis of Coordinated Expression" (GRACE) to adjust for the effect of SCNA in co-expression analysis. The results from analyses of TCGA, CCLE, and NCI60 data sets show that GRACE can improve our understanding of how a transcriptional network is re-wired in cancer. A user-friendly web database populated with data sets from The Cancer Genome Atlas (TCGA) is provided to allow customized query.
Collapse
Affiliation(s)
- Ling Cai
- Children's Medical Center Research Institute at UT Southwestern Medical Center, 6000 Harry Hines Blvd, Dallas, TX, 75235, USA.,Quantitative Biomedical Research Center at UT Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX, 75390, USA
| | - Qiwei Li
- Quantitative Biomedical Research Center at UT Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX, 75390, USA
| | - Yi Du
- Department of Bioinformatics at UT Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX, 75390, USA
| | - Jonghyun Yun
- Department of Mathematics at University of Texas at Arlington, 411S. Nedderman Drive, 478 Pickard Hall, Arlington, TX, 76019, USA
| | - Yang Xie
- Quantitative Biomedical Research Center at UT Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX, 75390, USA
| | - Ralph J DeBerardinis
- Children's Medical Center Research Institute at UT Southwestern Medical Center, 6000 Harry Hines Blvd, Dallas, TX, 75235, USA.
| | - Guanghua Xiao
- Quantitative Biomedical Research Center at UT Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX, 75390, USA.
| |
Collapse
|
25
|
Zhang T, Wang X, Yue Z. Identification of candidate genes related to pancreatic cancer based on analysis of gene co-expression and protein-protein interaction network. Oncotarget 2017; 8:71105-71116. [PMID: 29050346 PMCID: PMC5642621 DOI: 10.18632/oncotarget.20537] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2017] [Accepted: 07/29/2017] [Indexed: 12/11/2022] Open
Abstract
Pancreatic cancer (PC) is one of the most common causes of cancer mortality worldwide. As the genetic mechanism of this complex disease is not uncovered clearly, identification of related genes of PC is of great significance that could provide new insights into gene function as well as potential therapy targets. In this study, we performed an integrated network method to discover PC candidate genes based on known PC related genes. Utilizing the subnetwork extraction algorithm with gene co-expression profiles and protein-protein interaction data, we obtained the integrated network comprising of the known PC related genes (denoted as seed genes) and the putative genes (denoted as linker genes). We then prioritized the linker genes based on their network information and inferred six key genes (KRT19, BARD1, MST1R, S100A14, LGALS1 and RNF168) as candidate genes of PC. Further analysis indicated that all of these genes have been reported as pancreatic cancer associated genes. Finally, we developed an expression signature using these six key genes which significantly stratified PC patients according to overall survival (Logrank p = 0.003) and was validated on an independent clinical cohort (Logrank p = 0.03). Overall, the identified six genes might offer helpful prognostic stratification information and be suitable to transfer to clinical use in PC patients.
Collapse
Affiliation(s)
- Tiejun Zhang
- GMU-GIBH Joint School of Life Sciences, Guangzhou Medical University, Guangzhou, Guangdong 511436, China
| | - Xiaojuan Wang
- Institute of Health Sciences, School of Computer Science and Technology, Anhui University, Hefei, Anhui 230601, China
| | - Zhenyu Yue
- Institute of Health Sciences, School of Computer Science and Technology, Anhui University, Hefei, Anhui 230601, China
| |
Collapse
|
26
|
Bjørklund SS, Panda A, Kumar S, Seiler M, Robinson D, Gheeya J, Yao M, Alnæs GIG, Toppmeyer D, Riis M, Naume B, Børresen-Dale AL, Kristensen VN, Ganesan S, Bhanot G. Widespread alternative exon usage in clinically distinct subtypes of Invasive Ductal Carcinoma. Sci Rep 2017; 7:5568. [PMID: 28717182 PMCID: PMC5514065 DOI: 10.1038/s41598-017-05537-0] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2014] [Accepted: 06/05/2017] [Indexed: 12/11/2022] Open
Abstract
Cancer cells can have different patterns of exon usage of individual genes when compared to normal tissue, suggesting that alternative splicing may play a role in shaping the tumor phenotype. The discovery and identification of gene variants has increased dramatically with the introduction of RNA-sequencing technology, which enables whole transcriptome analysis of known, as well as novel isoforms. Here we report alternative splicing and transcriptional events among subtypes of invasive ductal carcinoma in The Cancer Genome Atlas (TCGA) Breast Invasive Carcinoma (BRCA) cohort. Alternative exon usage was widespread, and although common events were shared among three subtypes, ER+ HER2−, ER− HER2−, and HER2+, many events on the exon level were subtype specific. Additional RNA-seq analysis was carried out in an independent cohort of 43 ER+ HER2− and ER− HER2− primary breast tumors, confirming many of the exon events identified in the TCGA cohort. Alternative splicing and transcriptional events detected in five genes, MYO6, EPB41L1, TPD52, IQCG, and ACOX2 were validated by qRT-PCR in a third cohort of 40 ER+ HER2− and ER− HER2− patients, showing that these events were truly subtype specific.
Collapse
Affiliation(s)
- Sunniva Stordal Bjørklund
- Rutgers Cancer Institute of New Jersey, 195 Little Albany Street, New Brunswick, NJ, 08903, USA.,Department of Cancer Genetics, Institute for Cancer Research, OUS Radiumhospitalet, Oslo, 0310, Norway.,The K.G. Jebsen Center for Breast Cancer Research, Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, P.O box 1171, Blindern, 0318, Oslo, Norway
| | - Anshuman Panda
- Rutgers Cancer Institute of New Jersey, 195 Little Albany Street, New Brunswick, NJ, 08903, USA.,Department of Physics, Rutgers University, Piscataway, NJ, 08854, USA
| | - Surendra Kumar
- Department of Cancer Genetics, Institute for Cancer Research, OUS Radiumhospitalet, Oslo, 0310, Norway.,The K.G. Jebsen Center for Breast Cancer Research, Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, P.O box 1171, Blindern, 0318, Oslo, Norway.,Department of Clinical Molecular Biology and Laboratory Science (EpiGen), Akershus University hospital, Division of Medicine, 1476, Lørenskog, Norway
| | - Michael Seiler
- Rutgers Cancer Institute of New Jersey, 195 Little Albany Street, New Brunswick, NJ, 08903, USA.,BioMaPS Institute, Rutgers University, Piscataway, NJ, 08854, USA
| | - Doug Robinson
- BioMaPS Institute, Rutgers University, Piscataway, NJ, 08854, USA
| | - Jinesh Gheeya
- Rutgers Cancer Institute of New Jersey, 195 Little Albany Street, New Brunswick, NJ, 08903, USA
| | - Ming Yao
- Rutgers Cancer Institute of New Jersey, 195 Little Albany Street, New Brunswick, NJ, 08903, USA
| | - Grethe I Grenaker Alnæs
- Department of Cancer Genetics, Institute for Cancer Research, OUS Radiumhospitalet, Oslo, 0310, Norway
| | - Deborah Toppmeyer
- Rutgers Cancer Institute of New Jersey, 195 Little Albany Street, New Brunswick, NJ, 08903, USA
| | - Margit Riis
- Department of Clinical Molecular Biology and Laboratory Science (EpiGen), Akershus University hospital, Division of Medicine, 1476, Lørenskog, Norway.,Department of Surgery, Akershus University Hospital, 1478, Lørenskog, Norway.,Department of Breast and Endocrine Surgery, Oslo University Hospital, Ullevål, 0450, Oslo, Norway
| | - Bjørn Naume
- Department of Oncology, Oslo University Hospital, Radiumhospitalet, Oslo, Norway
| | - Anne-Lise Børresen-Dale
- Department of Cancer Genetics, Institute for Cancer Research, OUS Radiumhospitalet, Oslo, 0310, Norway.,The K.G. Jebsen Center for Breast Cancer Research, Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, P.O box 1171, Blindern, 0318, Oslo, Norway
| | - Vessela N Kristensen
- Department of Cancer Genetics, Institute for Cancer Research, OUS Radiumhospitalet, Oslo, 0310, Norway.,The K.G. Jebsen Center for Breast Cancer Research, Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, P.O box 1171, Blindern, 0318, Oslo, Norway.,Department of Clinical Molecular Biology and Laboratory Science (EpiGen), Akershus University hospital, Division of Medicine, 1476, Lørenskog, Norway
| | - Shridar Ganesan
- Rutgers Cancer Institute of New Jersey, 195 Little Albany Street, New Brunswick, NJ, 08903, USA.
| | - Gyan Bhanot
- Rutgers Cancer Institute of New Jersey, 195 Little Albany Street, New Brunswick, NJ, 08903, USA. .,Department of Physics, Rutgers University, Piscataway, NJ, 08854, USA. .,Department of Molecular Biology & Biochemistry, Rutgers University, Piscataway, NJ, 08854, USA.
| |
Collapse
|
27
|
Huang S, Chaudhary K, Garmire LX. More Is Better: Recent Progress in Multi-Omics Data Integration Methods. Front Genet 2017; 8:84. [PMID: 28670325 PMCID: PMC5472696 DOI: 10.3389/fgene.2017.00084] [Citation(s) in RCA: 396] [Impact Index Per Article: 49.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2017] [Accepted: 06/01/2017] [Indexed: 01/20/2023] Open
Abstract
Multi-omics data integration is one of the major challenges in the era of precision medicine. Considerable work has been done with the advent of high-throughput studies, which have enabled the data access for downstream analyses. To improve the clinical outcome prediction, a gamut of software tools has been developed. This review outlines the progress done in the field of multi-omics integration and comprehensive tools developed so far in this field. Further, we discuss the integration methods to predict patient survival at the end of the review.
Collapse
Affiliation(s)
- Sijia Huang
- Epidemiology Program, University of Hawaii Cancer CenterHonolulu, HI, United States.,Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at ManoaHonolulu, HI, United States
| | - Kumardeep Chaudhary
- Epidemiology Program, University of Hawaii Cancer CenterHonolulu, HI, United States
| | - Lana X Garmire
- Epidemiology Program, University of Hawaii Cancer CenterHonolulu, HI, United States.,Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at ManoaHonolulu, HI, United States.,Department of Obstetrics, Gynecology, and Women's Health, John A. Burns School of Medicine, University of Hawaii at ManoaHonolulu, HI, United States
| |
Collapse
|
28
|
Baur B, Bozdag S. ProcessDriver: A computational pipeline to identify copy number drivers and associated disrupted biological processes in cancer. Genomics 2017; 109:233-240. [PMID: 28438487 DOI: 10.1016/j.ygeno.2017.04.004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2017] [Revised: 04/19/2017] [Accepted: 04/20/2017] [Indexed: 12/12/2022]
Abstract
Copy number amplifications and deletions that are recurrent in cancer samples harbor genes that confer a fitness advantage to cancer tumor proliferation and survival. One important challenge in computational biology is to separate the causal (i.e., driver) genes from passenger genes in large, aberrated regions. Many previous studies focus on the genes within the aberration (i.e., cis genes), but do not utilize the genes that are outside of the aberrated region and dysregulated as a result of the aberration (i.e., trans genes). We propose a computational pipeline, called ProcessDriver, that prioritizes candidate drivers by relating cis genes to dysregulated trans genes and biological processes. ProcessDriver is based on the assumption that a driver cis gene should be closely associated with the dysregulated trans genes and biological processes, as opposed to previous studies that assume a driver cis gene should be the most correlated gene to the copy number of an aberrated region. We applied our method on breast, bladder and ovarian cancer data from the Cancer Genome Atlas database. Our results included previously known driver genes and cancer genes, as well as potentially novel driver genes. Additionally, many genes in the final set of drivers were linked to new tumor events after initial treatment using survival analysis. Our results highlight the importance of selecting driver genes based on their widespread downstream effects in trans.
Collapse
Affiliation(s)
- Brittany Baur
- Department of Mathematics, Statistics and Computer Science, Marquette University, Milwaukee, WI, USA
| | - Serdar Bozdag
- Department of Mathematics, Statistics and Computer Science, Marquette University, Milwaukee, WI, USA.
| |
Collapse
|
29
|
Lai YP, Wang LB, Wang WA, Lai LC, Tsai MH, Lu TP, Chuang EY. iGC-an integrated analysis package of gene expression and copy number alteration. BMC Bioinformatics 2017; 18:35. [PMID: 28088185 PMCID: PMC5237550 DOI: 10.1186/s12859-016-1438-2] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2016] [Accepted: 12/17/2016] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND With the advancement in high-throughput technologies, researchers can simultaneously investigate gene expression and copy number alteration (CNA) data from individual patients at a lower cost. Traditional analysis methods analyze each type of data individually and integrate their results using Venn diagrams. Challenges arise, however, when the results are irreproducible and inconsistent across multiple platforms. To address these issues, one possible approach is to concurrently analyze both gene expression profiling and CNAs in the same individual. RESULTS We have developed an open-source R/Bioconductor package (iGC). Multiple input formats are supported and users can define their own criteria for identifying differentially expressed genes driven by CNAs. The analysis of two real microarray datasets demonstrated that the CNA-driven genes identified by the iGC package showed significantly higher Pearson correlation coefficients with their gene expression levels and copy numbers than those genes located in a genomic region with CNA. Compared with the Venn diagram approach, the iGC package showed better performance. CONCLUSION The iGC package is effective and useful for identifying CNA-driven genes. By simultaneously considering both comparative genomic and transcriptomic data, it can provide better understanding of biological and medical questions. The iGC package's source code and manual are freely available at https://www.bioconductor.org/packages/release/bioc/html/iGC.html .
Collapse
Affiliation(s)
- Yi-Pin Lai
- Bioinformatics and Biostatistics Core, Center of Genomic Medicine, National Taiwan University, Taipei, Taiwan
| | - Liang-Bo Wang
- Bioinformatics and Biostatistics Core, Center of Genomic Medicine, National Taiwan University, Taipei, Taiwan.,Graduate Institute of Biomedical Electronics and Bioinformatics, Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan
| | - Wei-An Wang
- Bioinformatics and Biostatistics Core, Center of Genomic Medicine, National Taiwan University, Taipei, Taiwan
| | - Liang-Chuan Lai
- Bioinformatics and Biostatistics Core, Center of Genomic Medicine, National Taiwan University, Taipei, Taiwan.,Graduate Institute of Physiology, National Taiwan University, Taipei, Taiwan
| | - Mong-Hsun Tsai
- Bioinformatics and Biostatistics Core, Center of Genomic Medicine, National Taiwan University, Taipei, Taiwan.,Institute of Biotechnology, National Taiwan University, Taipei, Taiwan
| | - Tzu-Pin Lu
- Department of Public Health, Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan.
| | - Eric Y Chuang
- Bioinformatics and Biostatistics Core, Center of Genomic Medicine, National Taiwan University, Taipei, Taiwan. .,Graduate Institute of Biomedical Electronics and Bioinformatics, Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan.
| |
Collapse
|
30
|
Liu C, Jiang J, Gu J, Yu Z, Wang T, Lu H. High-dimensional omics data analysis using a variable screening protocol with prior knowledge integration (SKI). BMC SYSTEMS BIOLOGY 2016; 10:118. [PMID: 28155690 PMCID: PMC5260139 DOI: 10.1186/s12918-016-0358-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
BACKGROUND High-throughput technology could generate thousands to millions biomarker measurements in one experiment. However, results from high throughput analysis are often barely reproducible due to small sample size. Different statistical methods have been proposed to tackle this "small n and large p" scenario, for example different datasets could be pooled or integrated together to provide an effective way to improve reproducibility. However, the raw data is either unavailable or hard to integrate due to different experimental conditions, thus there is an emerging need to develop a method for "knowledge integration" in high-throughput data analysis. RESULTS In this study, we proposed an integrative prescreening approach, SKI, for high-throughput data analysis. A new rank is generated based on two initial ranks: (1) knowledge based rank; and (2) marginal correlation based rank. Our simulation shows the SKI outperforms other methods without knowledge-integration in terms of higher true positive rate given the same number of variables selected. We also applied our method in a drug response study and found its performance to be better than regular screening methods. CONCLUSION The proposed method provides an effective way to integrate knowledge for high-throughput analysis. It could easily implemented with our provided R package named SKI.
Collapse
Affiliation(s)
- Cong Liu
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA.,SJTU-Yale Joint Center for Biostatistics, Shanghai Jiao Tong University, Shanghai, China
| | - Jianping Jiang
- SJTU-Yale Joint Center for Biostatistics, Shanghai Jiao Tong University, Shanghai, China.,Department of Bioinformatics and Biostatistics, College of Life Science, Shanghai Jiao Tong University, Shanghai, China
| | - Jianlei Gu
- SJTU-Yale Joint Center for Biostatistics, Shanghai Jiao Tong University, Shanghai, China.,Department of Bioinformatics and Biostatistics, College of Life Science, Shanghai Jiao Tong University, Shanghai, China.,Center for Biomedical Informatics, Shanghai Children's Hospital, Shanghai, China
| | - Zhangsheng Yu
- SJTU-Yale Joint Center for Biostatistics, Shanghai Jiao Tong University, Shanghai, China.,Department of Bioinformatics and Biostatistics, College of Life Science, Shanghai Jiao Tong University, Shanghai, China
| | - Tao Wang
- SJTU-Yale Joint Center for Biostatistics, Shanghai Jiao Tong University, Shanghai, China. .,Department of Bioinformatics and Biostatistics, College of Life Science, Shanghai Jiao Tong University, Shanghai, China.
| | - Hui Lu
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA. .,SJTU-Yale Joint Center for Biostatistics, Shanghai Jiao Tong University, Shanghai, China. .,Department of Bioinformatics and Biostatistics, College of Life Science, Shanghai Jiao Tong University, Shanghai, China. .,Center for Biomedical Informatics, Shanghai Children's Hospital, Shanghai, China.
| |
Collapse
|
31
|
van Essen TH, van Pelt SI, Bronkhorst IHG, Versluis M, Némati F, Laurent C, Luyten GPM, van Hall T, van den Elsen PJ, van der Velden PA, Decaudin D, Jager MJ. Upregulation of HLA Expression in Primary Uveal Melanoma by Infiltrating Leukocytes. PLoS One 2016; 11:e0164292. [PMID: 27764126 PMCID: PMC5072555 DOI: 10.1371/journal.pone.0164292] [Citation(s) in RCA: 58] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2016] [Accepted: 09/22/2016] [Indexed: 12/22/2022] Open
Abstract
Introduction Uveal melanoma (UM) with an inflammatory phenotype, characterized by infiltrating leukocytes and increased human leukocyte antigen (HLA) expression, carry an increased risk of death due to metastases. These tumors should be ideal for T-cell based therapies, yet it is not clear why prognostically-infaust tumors have a high HLA expression. We set out to determine whether the level of HLA molecules in UM is associated with other genetic factors, HLA transcriptional regulators, or microenvironmental factors. Methods 28 enucleated UM were used to study HLA class I and II expression, and several regulators of HLA by immunohistochemistry, PCR microarray, qPCR and chromosome SNP-array. Fresh tumor samples of eight primary UM and four metastases were compared to their corresponding xenograft in SCID mice, using a PCR microarray and SNP array. Results Increased expression levels of HLA class I and II showed no dosage effect of chromosome 6p, but, as expected, were associated with monosomy of chromosome 3. Increased HLA class I and II protein levels were positively associated with their gene expression and with raised levels of the peptide-loading gene TAP1, and HLA transcriptional regulators IRF1, IRF8, CIITA, and NLRC5, revealing a higher transcriptional activity in prognostically-bad tumors. Implantation of fresh human tumor samples into SCID mice led to a loss of infiltrating leukocytes, and to a decreased expression of HLA class I and II genes, and their regulators. Conclusion Our data provides evidence for a proper functioning HLA regulatory system in UM, offering a target for T-cell based therapies.
Collapse
Affiliation(s)
| | - Sake I van Pelt
- Department of Medical Statistics, LUMC, Leiden, the Netherlands
| | | | - Mieke Versluis
- Department of Ophthalmology, LUMC, Leiden, the Netherlands
| | - Fariba Némati
- Laboratory of Preclinical Investigation, Translational Research Department, Institut Curie, Paris, France
| | - Cécile Laurent
- Laboratory of Preclinical Investigation, Translational Research Department, Institut Curie, Paris, France
| | | | | | - Peter J van den Elsen
- Department of Immunohematology and Blood Transfusion, LUMC, Leiden, the Netherlands.,Department of Pathology, VU University Medical Center, Amsterdam, the Netherlands
| | | | - Didier Decaudin
- Laboratory of Preclinical Investigation, Translational Research Department, Institut Curie, Paris, France.,Department of Clinical Hematology, Institut Curie, Paris France
| | | |
Collapse
|
32
|
Bersanelli M, Mosca E, Remondini D, Giampieri E, Sala C, Castellani G, Milanesi L. Methods for the integration of multi-omics data: mathematical aspects. BMC Bioinformatics 2016; 17 Suppl 2:15. [PMID: 26821531 PMCID: PMC4959355 DOI: 10.1186/s12859-015-0857-9] [Citation(s) in RCA: 237] [Impact Index Per Article: 26.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Methods for the integrative analysis of multi-omics data are required to draw a more complete and accurate picture of the dynamics of molecular systems. The complexity of biological systems, the technological limits, the large number of biological variables and the relatively low number of biological samples make the analysis of multi-omics datasets a non-trivial problem. RESULTS AND CONCLUSIONS We review the most advanced strategies for integrating multi-omics datasets, focusing on mathematical and methodological aspects.
Collapse
Affiliation(s)
- Matteo Bersanelli
- Department of Physics and Astronomy, Universita' di Bologna, Via B. Pichat 6/2, Bologna, 40127, Italy. .,Institute of Biomedical Technologies - CNR, Via Fratelli Cervi 93, Segrate MI, 20090, Italy.
| | - Ettore Mosca
- Institute of Biomedical Technologies - CNR, Via Fratelli Cervi 93, Segrate MI, 20090, Italy.
| | - Daniel Remondini
- Department of Physics and Astronomy, Universita' di Bologna, Via B. Pichat 6/2, Bologna, 40127, Italy.
| | - Enrico Giampieri
- Department of Physics and Astronomy, Universita' di Bologna, Via B. Pichat 6/2, Bologna, 40127, Italy.
| | - Claudia Sala
- Department of Physics and Astronomy, Universita' di Bologna, Via B. Pichat 6/2, Bologna, 40127, Italy.
| | - Gastone Castellani
- Department of Physics and Astronomy, Universita' di Bologna, Via B. Pichat 6/2, Bologna, 40127, Italy.
| | - Luciano Milanesi
- Institute of Biomedical Technologies - CNR, Via Fratelli Cervi 93, Segrate MI, 20090, Italy.
| |
Collapse
|
33
|
Systematic analysis of somatic mutations impacting gene expression in 12 tumour types. Nat Commun 2015; 6:8554. [PMID: 26436532 PMCID: PMC4600750 DOI: 10.1038/ncomms9554] [Citation(s) in RCA: 83] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2014] [Accepted: 09/04/2015] [Indexed: 12/27/2022] Open
Abstract
We present a novel hierarchical Bayes statistical model, xseq, to systematically quantify the impact of somatic mutations on expression profiles. We establish the theoretical framework and robust inference characteristics of the method using computational benchmarking. We then use xseq to analyse thousands of tumour data sets available through The Cancer Genome Atlas, to systematically quantify somatic mutations impacting expression profiles. We identify 30 novel cis-effect tumour suppressor gene candidates, enriched in loss-of-function mutations and biallelic inactivation. Analysis of trans-effects of mutations and copy number alterations with xseq identifies mutations in 150 genes impacting expression networks, with 89 novel predictions. We reveal two important novel characteristics of mutation impact on expression: (1) patients harbouring known driver mutations exhibit different downstream gene expression consequences; (2) expression patterns for some mutations are stable across tumour types. These results have critical implications for identification and interpretation of mutations with consequent impact on transcription in cancer. Assessing functional impact of mutations in cancer on gene expression can improve our understanding of cancer biology and may identify potential therapeutic targets. Here, Ding et al. describe a novel statistical model named xseq for a systematic survey of how mutations impact transcriptome landscapes across 12 different tumour types.
Collapse
|
34
|
Haakensen VD, Steinfeld I, Saldova R, Shehni AA, Kifer I, Naume B, Rudd PM, Børresen-Dale AL, Yakhini Z. Serum N-glycan analysis in breast cancer patients--Relation to tumour biology and clinical outcome. Mol Oncol 2015; 10:59-72. [PMID: 26321095 DOI: 10.1016/j.molonc.2015.08.002] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2014] [Revised: 08/02/2015] [Accepted: 08/03/2015] [Indexed: 12/13/2022] Open
Abstract
Glycosylation and related processes play important roles in cancer development and progression, including metastasis. Several studies have shown that N-glycans have potential diagnostic value as cancer serum biomarkers. We have explored the significance of the abundance of particular serum N-glycan structures as important features of breast tumour biology by studying the serum glycome and tumour transcriptome (mRNA and miRNA) of 104 breast cancer patients. Integration of these types of molecular data allows us to study the relationship between serum glycans and transcripts representing functional pathways, such as metabolic pathways or DNA damage response. We identified tri antennary trigalactosylated trisialylated glycans in serum as being associated with lower levels of tumour transcripts involved in focal adhesion and integrin-mediated cell adhesion. These glycan structures were also linked to poor prognosis in patients with ER negative tumours. High abundance of simple monoantennary glycan structures were associated with increased survival, particularly in the basal-like subgroup. The presence of circulating tumour cells was found to be significantly associated with several serum glycome structures like bi and triantennary, di- and trigalactosylated, di- and trisialylated. The link between tumour miRNA expression levels and N-glycan production is also examined.
Collapse
Affiliation(s)
- Vilde D Haakensen
- Department of Genetics, Institute for Cancer Research, Oslo University Hospital, The Norwegian Radium Hospital, Oslo, Norway; The K.G. Jebsen Center for Breast Cancer Research, Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway
| | - Israel Steinfeld
- Department of Computer Science, Technion, Haifa, Israel; Agilent Laboratories, Agilent Technologies, Tel-Aviv, Israel
| | - Radka Saldova
- NIBRT GlycoScience Group, National Institute for Bioprocessing Research and Training, Fosters Avenue, Mount Merrion, Blackrock, Dublin 4, Ireland
| | - Akram Asadi Shehni
- NIBRT GlycoScience Group, National Institute for Bioprocessing Research and Training, Fosters Avenue, Mount Merrion, Blackrock, Dublin 4, Ireland
| | - Ilona Kifer
- Agilent Laboratories, Agilent Technologies, Tel-Aviv, Israel
| | - Bjørn Naume
- Department of Oncology, Oslo University Hospital, The Norwegian Radium Hospital, Oslo, Norway
| | - Pauline M Rudd
- NIBRT GlycoScience Group, National Institute for Bioprocessing Research and Training, Fosters Avenue, Mount Merrion, Blackrock, Dublin 4, Ireland
| | - Anne-Lise Børresen-Dale
- Department of Genetics, Institute for Cancer Research, Oslo University Hospital, The Norwegian Radium Hospital, Oslo, Norway; The K.G. Jebsen Center for Breast Cancer Research, Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway.
| | - Zohar Yakhini
- Department of Computer Science, Technion, Haifa, Israel; Agilent Laboratories, Agilent Technologies, Tel-Aviv, Israel.
| |
Collapse
|
35
|
Modelska A, Quattrone A, Re A. Molecular portraits: the evolution of the concept of transcriptome-based cancer signatures. Brief Bioinform 2015; 16:1000-7. [PMID: 25832647 PMCID: PMC4652618 DOI: 10.1093/bib/bbv013] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2014] [Indexed: 12/13/2022] Open
Abstract
Cancer results from dysregulation of multiple steps of gene expression programs. We review how transcriptome profiling has been widely explored for cancer classification and biomarker discovery but resulted in limited clinical impact. Therefore, we discuss alternative and complementary omics approaches.
Collapse
|
36
|
Long non-coding RNAs differentially expressed between normal versus primary breast tumor tissues disclose converse changes to breast cancer-related protein-coding genes. PLoS One 2014; 9:e106076. [PMID: 25264628 PMCID: PMC4180073 DOI: 10.1371/journal.pone.0106076] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2013] [Accepted: 07/29/2014] [Indexed: 12/04/2022] Open
Abstract
Breast cancer, the second leading cause of cancer death in women, is a highly heterogeneous disease, characterized by distinct genomic and transcriptomic profiles. Transcriptome analyses prevalently assessed protein-coding genes; however, the majority of the mammalian genome is expressed in numerous non-coding transcripts. Emerging evidence supports that many of these non-coding RNAs are specifically expressed during development, tumorigenesis, and metastasis. The focus of this study was to investigate the expression features and molecular characteristics of long non-coding RNAs (lncRNAs) in breast cancer. We investigated 26 breast tumor and 5 normal tissue samples utilizing a custom expression microarray enclosing probes for mRNAs as well as novel and previously identified lncRNAs. We identified more than 19,000 unique regions significantly differentially expressed between normal versus breast tumor tissue, half of these regions were non-coding without any evidence for functional open reading frames or sequence similarity to known proteins. The identified non-coding regions were primarily located in introns (53%) or in the intergenic space (33%), frequently orientated in antisense-direction of protein-coding genes (14%), and commonly distributed at promoter-, transcription factor binding-, or enhancer-sites. Analyzing the most diverse mRNA breast cancer subtypes Basal-like versus Luminal A and B resulted in 3,025 significantly differentially expressed unique loci, including 682 (23%) for non-coding transcripts. A notable number of differentially expressed protein-coding genes displayed non-synonymous expression changes compared to their nearest differentially expressed lncRNA, including an antisense lncRNA strongly anticorrelated to the mRNA coding for histone deacetylase 3 (HDAC3), which was investigated in more detail. Previously identified chromatin-associated lncRNAs (CARs) were predominantly downregulated in breast tumor samples, including CARs located in the protein-coding genes for CALD1, FTX, and HNRNPH1. In conclusion, a number of differentially expressed lncRNAs have been identified with relation to cancer-related protein-coding genes.
Collapse
|
37
|
Tumor protein D52 (TPD52) and cancer-oncogene understudy or understudied oncogene? Tumour Biol 2014; 35:7369-82. [PMID: 24798974 DOI: 10.1007/s13277-014-2006-x] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2014] [Accepted: 04/22/2014] [Indexed: 12/16/2022] Open
Abstract
The Tumor protein D52 (TPD52) gene was identified nearly 20 years ago through its overexpression in human cancer, and a substantial body of data now strongly supports TPD52 representing a gene amplification target at chromosome 8q21.13. This review updates progress toward understanding the significance of TPD52 overexpression and targeting, both in tumors known to be characterized by TPD52 overexpression/amplification, and those where TPD52 overexpression/amplification has been recently or variably reported. We highlight recent findings supporting microRNA regulation of TPD52 expression in experimental systems and describe progress toward deciphering TPD52's cellular functions, particularly in cancer cells. Finally, we provide an overview of TPD52's potential as a cancer biomarker and immunotherapeutic target. These combined studies highlight the potential value of genes such as TPD52, which are overexpressed in many cancer types, but have been relatively understudied.
Collapse
|
38
|
Kristensen VN, Lingjærde OC, Russnes HG, Vollan HKM, Frigessi A, Børresen-Dale AL. Principles and methods of integrative genomic analyses in cancer. Nat Rev Cancer 2014; 14:299-313. [PMID: 24759209 DOI: 10.1038/nrc3721] [Citation(s) in RCA: 246] [Impact Index Per Article: 22.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Combined analyses of molecular data, such as DNA copy-number alteration, mRNA and protein expression, point to biological functions and molecular pathways being deregulated in multiple cancers. Genomic, metabolomic and clinical data from various solid cancers and model systems are emerging and can be used to identify novel patient subgroups for tailored therapy and monitoring. The integrative genomics methodologies that are used to interpret these data require expertise in different disciplines, such as biology, medicine, mathematics, statistics and bioinformatics, and they can seem daunting. The objectives, methods and computational tools of integrative genomics that are available to date are reviewed here, as is their implementation in cancer research.
Collapse
Affiliation(s)
- Vessela N Kristensen
- 1] Department of Genetics, Institute for Cancer Research, Oslo University Hospital, The Norwegian Radium Hospital, Montebello, 0310 Oslo, Norway. [2] K.G. Jebsen Centre for Breast Cancer Research, Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, 0313 Oslo, Norway. [3] Department of Clinical Molecular Oncology, Division of Medicine, Akershus University Hospital, 1478 Ahus, Norway
| | - Ole Christian Lingjærde
- 1] K.G. Jebsen Centre for Breast Cancer Research, Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, 0313 Oslo, Norway. [2] Division for Biomedical Informatics, Department of Computer Science, University of Oslo, 0316 Oslo, Norway
| | - Hege G Russnes
- 1] Department of Genetics, Institute for Cancer Research, Oslo University Hospital, The Norwegian Radium Hospital, Montebello, 0310 Oslo, Norway. [2] K.G. Jebsen Centre for Breast Cancer Research, Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, 0313 Oslo, Norway. [3] Department of Pathology, Oslo University Hospital, 0450 Oslo, Norway
| | - Hans Kristian M Vollan
- 1] Department of Genetics, Institute for Cancer Research, Oslo University Hospital, The Norwegian Radium Hospital, Montebello, 0310 Oslo, Norway. [2] K.G. Jebsen Centre for Breast Cancer Research, Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, 0313 Oslo, Norway. [3] Department of Oncology, Division of Cancer, Surgery and Transplantation, Oslo University Hospital, 0450 Oslo, Norway
| | - Arnoldo Frigessi
- 1] Statistics for Innovation, Norwegian Computing Center, 0314 Oslo, Norway. [2] Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, PO Box 1122 Blindern, 0317 Oslo, Norway
| | - Anne-Lise Børresen-Dale
- 1] Department of Genetics, Institute for Cancer Research, Oslo University Hospital, The Norwegian Radium Hospital, Montebello, 0310 Oslo, Norway. [2] K.G. Jebsen Centre for Breast Cancer Research, Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, 0313 Oslo, Norway
| |
Collapse
|
39
|
Leibovich L, Yakhini Z. Mutual enrichment in ranked lists and the statistical assessment of position weight matrix motifs. Algorithms Mol Biol 2014; 9:11. [PMID: 24708618 PMCID: PMC4021615 DOI: 10.1186/1748-7188-9-11] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2013] [Accepted: 03/30/2014] [Indexed: 11/18/2022] Open
Abstract
Background Statistics in ranked lists is useful in analysing molecular biology measurement data, such as differential expression, resulting in ranked lists of genes, or ChIP-Seq, which yields ranked lists of genomic sequences. State of the art methods study fixed motifs in ranked lists of sequences. More flexible models such as position weight matrix (PWM) motifs are more challenging in this context, partially because it is not clear how to avoid the use of arbitrary thresholds. Results To assess the enrichment of a PWM motif in a ranked list we use a second ranking on the same set of elements induced by the PWM. Possible orders of one ranked list relative to another can be modelled as permutations. Due to sample space complexity, it is difficult to accurately characterize tail distributions in the group of permutations. In this paper we develop tight upper bounds on tail distributions of the size of the intersection of the top parts of two uniformly and independently drawn permutations. We further demonstrate advantages of this approach using our software implementation, mmHG-Finder, which is publicly available, to study PWM motifs in several datasets. In addition to validating known motifs, we found GC-rich strings to be enriched amongst the promoter sequences of long non-coding RNAs that are specifically expressed in thyroid and prostate tissue samples and observed a statistical association with tissue specific CpG hypo-methylation. Conclusions We develop tight bounds that can be calculated in polynomial time. We demonstrate utility of mutual enrichment in motif search and assess performance for synthetic and biological datasets. We suggest that thyroid and prostate-specific long non-coding RNAs are regulated by transcription factors that bind GC-rich sequences, such as EGR1, SP1 and E2F3. We further suggest that this regulation is associated with DNA hypo-methylation.
Collapse
|