1
|
Scala G, Ferraro L, Brandi A, Guo Y, Majello B, Ceccarelli M. MoNETA: MultiOmics Network Embedding for SubType Analysis. NAR Genom Bioinform 2024; 6:lqae141. [PMID: 39416887 PMCID: PMC11482636 DOI: 10.1093/nargab/lqae141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 07/19/2024] [Accepted: 10/04/2024] [Indexed: 10/19/2024] Open
Abstract
Cells are complex systems whose behavior emerges from a huge number of reactions taking place within and among different molecular districts. The availability of bulk and single-cell omics data fueled the creation of multi-omics systems biology models capturing the dynamics within and between omics layers. Powerful modeling strategies are needed to cope with the increased amount of data to be interrogated and the relative research questions. Here, we present MultiOmics Network Embedding for SubType Analysis (MoNETA) for fast and scalable identification of relevant multi-omics relationships between biological entities at the bulk and single-cells level. We apply MoNETA to show how glioma subtypes previously described naturally emerge with our approach. We also show how MoNETA can be used to identify cell types in five multi-omic single-cell datasets.
Collapse
Affiliation(s)
- Giovanni Scala
- Department of Biology, University of Naples ‘Federico II’, 80128 Naples, Italy
| | - Luigi Ferraro
- Sylvester Comprehensive Cancer Center, University of Miami, 33136, Miami, USA
| | - Aurora Brandi
- Department of Biology, University of Naples ‘Federico II’, 80128 Naples, Italy
| | - Yan Guo
- Sylvester Comprehensive Cancer Center, University of Miami, 33136, Miami, USA
| | - Barbara Majello
- Department of Biology, University of Naples ‘Federico II’, 80128 Naples, Italy
| | - Michele Ceccarelli
- Sylvester Comprehensive Cancer Center, University of Miami, 33136, Miami, USA
| |
Collapse
|
2
|
Das S, Srivastava DK. ioSearch: An approach for identifying interacting multiomics biomarkers using a novel algorithm with application on breast cancer data sets. Genet Epidemiol 2023; 47:600-616. [PMID: 37795815 DOI: 10.1002/gepi.22536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Revised: 08/04/2023] [Accepted: 09/18/2023] [Indexed: 10/06/2023]
Abstract
Identification of biomarkers by integrating multiple omics together is important because complex diseases occur due to an intricate interplay of various genetic materials. Traditional single-omics association tests neither explore this crucial interomics dependence nor identify moderately weak signals due to the multiple-testing burden. Conversely, multiomics data integration imparts complementary information but suffers from an increased multiple-testing burden, data diversity inherent with different omics features, high-dimensionality, and so forth. Most of the available methods address subtype classification using dimension-reduction techniques to circumvent the sample size issue but interacting multiomics biomarker identification methods are unavailable. We propose a two-step model that first investigates phenotype-omics association using logistic regression. Then, selects disease-associated omics using sparse principal components which explores the interrelationship of multiple variables from two omics in a multivariate multiple regression framework. On the basis of this model, we developed a multiomics biomarker identification algorithm, interacting omics search (ioSearch), that jointly tests the effect of multiple omics with disease and between-omics associations by using pathway information that subsequently reduces the multiple-testing burden. Further, inference in terms of p values potentially makes it an easily interpretable biomarker identification tool. Extensive simulation demonstrates ioSearch as statistically powerful with a controlled Type-I error rate. Its application to publicly available breast cancer data sets identified relevant omics features in important pathways.
Collapse
Affiliation(s)
- Sarmistha Das
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, Tennessee, USA
| | - Deo Kumar Srivastava
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, Tennessee, USA
| |
Collapse
|
3
|
Unlu Yazici M, Marron JS, Bakir-Gungor B, Zou F, Yousef M. Invention of 3Mint for feature grouping and scoring in multi-omics. Front Genet 2023; 14:1093326. [PMID: 37007972 PMCID: PMC10050723 DOI: 10.3389/fgene.2023.1093326] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 02/27/2023] [Indexed: 03/17/2023] Open
Abstract
Advanced genomic and molecular profiling technologies accelerated the enlightenment of the regulatory mechanisms behind cancer development and progression, and the targeted therapies in patients. Along this line, intense studies with immense amounts of biological information have boosted the discovery of molecular biomarkers. Cancer is one of the leading causes of death around the world in recent years. Elucidation of genomic and epigenetic factors in Breast Cancer (BRCA) can provide a roadmap to uncover the disease mechanisms. Accordingly, unraveling the possible systematic connections between-omics data types and their contribution to BRCA tumor progression is crucial. In this study, we have developed a novel machine learning (ML) based integrative approach for multi-omics data analysis. This integrative approach combines information from gene expression (mRNA), microRNA (miRNA) and methylation data. Due to the complexity of cancer, this integrated data is expected to improve the prediction, diagnosis and treatment of disease through patterns only available from the 3-way interactions between these 3-omics datasets. In addition, the proposed method bridges the interpretation gap between the disease mechanisms that drive onset and progression. Our fundamental contribution is the 3 Multi-omics integrative tool (3Mint). This tool aims to perform grouping and scoring of groups using biological knowledge. Another major goal is improved gene selection via detection of novel groups of cross-omics biomarkers. Performance of 3Mint is assessed using different metrics. Our computational performance evaluations showed that the 3Mint classifies the BRCA molecular subtypes with lower number of genes when compared to the miRcorrNet tool which uses miRNA and mRNA gene expression profiles in terms of similar performance metrics (95% Accuracy). The incorporation of methylation data in 3Mint yields a much more focused analysis. The 3Mint tool and all other supplementary files are available at https://github.com/malikyousef/3Mint/.
Collapse
Affiliation(s)
- Miray Unlu Yazici
- Department of Bioengineering, Abdullah Gül University, Kayseri, Türkiye
| | - J. S. Marron
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC, United States
| | - Burcu Bakir-Gungor
- Department of Bioengineering, Abdullah Gül University, Kayseri, Türkiye
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Türkiye
| | - Fei Zou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center, Zefat Academic College, Zefat, Israel
- *Correspondence: Malik Yousef,
| |
Collapse
|
4
|
Niranjan V, Uttarkar A, Kaul A, Varghese M. A Machine Learning-Based Approach Using Multi-omics Data to Predict Metabolic Pathways. Methods Mol Biol 2023; 2553:441-452. [PMID: 36227554 DOI: 10.1007/978-1-0716-2617-7_19] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
The integrative method approaches are continuously evolving to provide accurate insights from the data that is received through experimentation on various biological systems. Multi-omics data can be integrated with predictive machine learning algorithms in order to provide results with high accuracy. This protocol chapter defines the steps required for the ML-multi-omics integration methods that are applied on biological datasets for its analysis and the visual interpretation of the results thus obtained.
Collapse
Affiliation(s)
- Vidya Niranjan
- Department of Biotechnology, R V College of Engineering, Mysuru Road, Kengeri, Bengaluru, India.
| | - Akshay Uttarkar
- Department of Biotechnology, R V College of Engineering, Mysuru Road, Kengeri, Bengaluru, India
| | - Aakaanksha Kaul
- Department of Biotechnology, R V College of Engineering, Mysuru Road, Kengeri, Bengaluru, India
| | - Maryanne Varghese
- Department of Biotechnology, R V College of Engineering, Mysuru Road, Kengeri, Bengaluru, India
| |
Collapse
|
5
|
Mokou M, Narayanasamy S, Stroggilos R, Balaur IA, Vlahou A, Mischak H, Frantzi M. A Drug Repurposing Pipeline Based on Bladder Cancer Integrated Proteotranscriptomics Signatures. Methods Mol Biol 2023; 2684:59-99. [PMID: 37410228 DOI: 10.1007/978-1-0716-3291-8_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/07/2023]
Abstract
Delivering better care for patients with bladder cancer (BC) necessitates the development of novel therapeutic strategies that address both the high disease heterogeneity and the limitations of the current therapeutic modalities, such as drug low efficacy and patient resistance acquisition. Drug repurposing is a cost-effective strategy that targets the reuse of existing drugs for new therapeutic purposes. Such a strategy could open new avenues toward more effective BC treatment. BC patients' multi-omics signatures can be used to guide the investigation of existing drugs that show an effective therapeutic potential through drug repurposing. In this book chapter, we present an integrated multilayer approach that includes cross-omics analyses from publicly available transcriptomics and proteomics data derived from BC tissues and cell lines that were investigated for the development of disease-specific signatures. These signatures are subsequently used as input for a signature-based repurposing approach using the Connectivity Map (CMap) tool. We further explain the steps that may be followed to identify and select existing drugs of increased potential for repurposing in BC patients.
Collapse
Affiliation(s)
- Marika Mokou
- Department of Biomarker Research, Mosaiques Diagnostics, Hannover, Germany.
| | - Shaman Narayanasamy
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Rafael Stroggilos
- Systems Biology Center, Biomedical Research Foundation, Academy of Athens, Athens, Greece
| | - Irina-Afrodita Balaur
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Antonia Vlahou
- Systems Biology Center, Biomedical Research Foundation, Academy of Athens, Athens, Greece
| | - Harald Mischak
- Department of Biomarker Research, Mosaiques Diagnostics, Hannover, Germany
- Institute of Cardiovascular and Medical Sciences, University of Glasgow, Glasgow, UK
| | - Maria Frantzi
- Department of Biomarker Research, Mosaiques Diagnostics, Hannover, Germany
| |
Collapse
|
6
|
Ng HM, Jiang B, Wong KY. Penalized estimation of a class of single-index varying-coefficient models for integrative genomic analysis. Biom J 2023; 65:e2100139. [PMID: 35837982 DOI: 10.1002/bimj.202100139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Revised: 04/15/2022] [Accepted: 05/27/2022] [Indexed: 01/17/2023]
Abstract
Recent technological advances have made it possible to collect high-dimensional genomic data along with clinical data on a large number of subjects. In the studies of chronic diseases such as cancer, it is of great interest to integrate clinical and genomic data to build a comprehensive understanding of the disease mechanisms. Despite extensive studies on integrative analysis, it remains an ongoing challenge to model the interaction effects between clinical and genomic variables, due to high dimensionality of the data and heterogeneity across data types. In this paper, we propose an integrative approach that models interaction effects using a single-index varying-coefficient model, where the effects of genomic features can be modified by clinical variables. We propose a penalized approach for separate selection of main and interaction effects. Notably, the proposed methods can be applied to right-censored survival outcomes based on a Cox proportional hazards model. We demonstrate the advantages of the proposed methods through extensive simulation studies and provide applications to a motivating cancer genomic study.
Collapse
Affiliation(s)
- Hoi Min Ng
- Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong
| | - Binyan Jiang
- Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong
| | - Kin Yau Wong
- Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong
| |
Collapse
|
7
|
Li W, Shao C, Zhou H, Du H, Chen H, Wan H, He Y. Multi-omics research strategies in ischemic stroke: A multidimensional perspective. Ageing Res Rev 2022; 81:101730. [PMID: 36087702 DOI: 10.1016/j.arr.2022.101730] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2022] [Revised: 08/23/2022] [Accepted: 09/03/2022] [Indexed: 01/31/2023]
Abstract
Ischemic stroke (IS) is a multifactorial and heterogeneous neurological disorder with high rate of death and long-term impairment. Despite years of studies, there are still no stroke biomarkers for clinical practice, and the molecular mechanisms of stroke remain largely unclear. The high-throughput omics approach provides new avenues for discovering biomarkers of IS and explaining its pathological mechanisms. However, single-omics approaches only provide a limited understanding of the biological pathways of diseases. The integration of multiple omics data means the simultaneous analysis of thousands of genes, RNAs, proteins and metabolites, revealing networks of interactions between multiple molecular levels. Integrated analysis of multi-omics approaches will provide helpful insights into stroke pathogenesis, therapeutic target identification and biomarker discovery. Here, we consider advances in genomics, transcriptomics, proteomics and metabolomics and outline their use in discovering the biomarkers and pathological mechanisms of IS. We then delineate strategies for achieving integration at the multi-omics level and discuss how integrative omics and systems biology can contribute to our understanding and management of IS.
Collapse
Affiliation(s)
- Wentao Li
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| | - Chongyu Shao
- School of Life Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| | - Huifen Zhou
- School of Life Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| | - Haixia Du
- School of Life Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| | - Haiyang Chen
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| | - Haitong Wan
- School of Life Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| | - Yu He
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| |
Collapse
|
8
|
Hesami M, Alizadeh M, Jones AMP, Torkamaneh D. Machine learning: its challenges and opportunities in plant system biology. Appl Microbiol Biotechnol 2022; 106:3507-3530. [PMID: 35575915 DOI: 10.1007/s00253-022-11963-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 03/14/2022] [Accepted: 05/07/2022] [Indexed: 12/25/2022]
Abstract
Sequencing technologies are evolving at a rapid pace, enabling the generation of massive amounts of data in multiple dimensions (e.g., genomics, epigenomics, transcriptomic, metabolomics, proteomics, and single-cell omics) in plants. To provide comprehensive insights into the complexity of plant biological systems, it is important to integrate different omics datasets. Although recent advances in computational analytical pipelines have enabled efficient and high-quality exploration and exploitation of single omics data, the integration of multidimensional, heterogenous, and large datasets (i.e., multi-omics) remains a challenge. In this regard, machine learning (ML) offers promising approaches to integrate large datasets and to recognize fine-grained patterns and relationships. Nevertheless, they require rigorous optimizations to process multi-omics-derived datasets. In this review, we discuss the main concepts of machine learning as well as the key challenges and solutions related to the big data derived from plant system biology. We also provide in-depth insight into the principles of data integration using ML, as well as challenges and opportunities in different contexts including multi-omics, single-cell omics, protein function, and protein-protein interaction. KEY POINTS: • The key challenges and solutions related to the big data derived from plant system biology have been highlighted. • Different methods of data integration have been discussed. • Challenges and opportunities of the application of machine learning in plant system biology have been highlighted and discussed.
Collapse
Affiliation(s)
- Mohsen Hesami
- Department of Plant Agriculture, University of Guelph, Guelph, ON, N1G 2W1, Canada
| | - Milad Alizadeh
- Department of Botany, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | | | - Davoud Torkamaneh
- Département de Phytologie, Université Laval, Québec City, QC, G1V 0A6, Canada. .,Institut de Biologie Intégrative Et Des Systèmes (IBIS), Université Laval, Québec City, QC, G1V 0A6, Canada.
| |
Collapse
|
9
|
Hwangbo S, Lee S, Lee S, Hwang H, Kim I, Park T. Kernel-based hierarchical structural component models for pathway analysis. Bioinformatics 2022; 38:3078-3086. [PMID: 35460238 DOI: 10.1093/bioinformatics/btac276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Revised: 04/08/2022] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Pathway analyses have led to more insight into the underlying biological functions related to the phenotype of interest in various types of omics data. Pathway-based statistical approaches have been actively developed, but most of them do not consider correlations among pathways. Because it is well known that there are quite a few biomarkers that overlap between pathways, these approaches may provide misleading results. In addition, most pathway-based approaches tend to assume that biomarkers within a pathway have linear associations with the phenotype of interest, even though the relationships are more complex. RESULTS To model complex effects including nonlinear effects, we propose a new approach, Hierarchical structural CoMponent analysis using Kernel (HisCoM-Kernel). The proposed method models nonlinear associations between biomarkers and phenotype by extending the kernel machine regression and analyzes entire pathways simultaneously by using the biomarker-pathway hierarchical structure. HisCoM-Kernel is a flexible model that can be applied to various omics data. It was successfully applied to three omics datasets generated by different technologies. Our simulation studies showed that HisCoM-Kernel provided higher statistical power than other existing pathway-based methods in all datasets. The application of HisCoM-Kernel to three types of omics dataset showed its superior performance compared to existing methods in identifying more biologically meaningful pathways, including those reported in previous studies. AVAILABILITY AND IMPLEMENTATION Freely available at http://statgen.snu.ac.kr/software/HisCom-Kernel/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Suhyun Hwangbo
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 151-747, Korea.,Department of Genomic Medicine, Seoul National University Hospital, Seoul, 03080, Korea
| | - Sungyoung Lee
- Department of Genomic Medicine, Seoul National University Hospital, Seoul, 03080, Korea
| | - Seungyeoun Lee
- Department of Mathematics and Statistics, Sejong University, Sejong, 05006, Korea
| | - Heungsun Hwang
- Department of Psychology, McGill University, Montreal, QC, H3A 1B1, Canada
| | - Inyoung Kim
- Department of Statistics, Virginia Tech, Blacksburg, Virginia, 24060, U.S.A
| | - Taesung Park
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 151-747, Korea.,Department of Statistics, Seoul National University, Seoul, 151-747, Korea
| |
Collapse
|
10
|
Taguchi YH, Turki T. Novel feature selection method via kernel tensor decomposition for improved multi-omics data analysis. BMC Med Genomics 2022; 15:37. [PMID: 35209912 PMCID: PMC8876179 DOI: 10.1186/s12920-022-01181-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2021] [Accepted: 02/11/2022] [Indexed: 11/27/2022] Open
Abstract
Background Feature selection of multi-omics data analysis remains challenging owing to the size of omics datasets, comprising approximately \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$10^2$$\end{document}102–\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$10^5$$\end{document}105 features. In particular, appropriate methods to weight individual omics datasets are unclear, and the approach adopted has substantial consequences for feature selection. In this study, we extended a recently proposed kernel tensor decomposition (KTD)-based unsupervised feature extraction (FE) method to integrate multi-omics datasets obtained from common samples in a weight-free manner. Method KTD-based unsupervised FE was reformatted as the collection of kernelized tensors sharing common samples, which was applied to synthetic and real datasets. Results The proposed advanced KTD-based unsupervised FE method showed comparative performance to that of the previously proposed KTD method, as well as tensor decomposition-based unsupervised FE, but required reduced memory and central processing unit time. Moreover, this advanced KTD method, specifically designed for multi-omics analysis, attributes P values to features, which is rare for existing multi-omics–oriented methods. Conclusions The sample R code is available at https://github.com/tagtag/MultiR/.
Collapse
Affiliation(s)
- Y-H Taguchi
- Department of Physics, Chuo University, 1-13-27 Kasuga, Bunkyo-ku, Tokyo, 112-8551, Japan.
| | - Turki Turki
- Department of Computer Science, King Abdulaziz University, Jeddah, 21589, Saudi Arabia
| |
Collapse
|
11
|
Synergistic Effects of Different Levels of Genomic Data for the Staging of Lung Adenocarcinoma: An Illustrative Study. Genes (Basel) 2021; 12:genes12121872. [PMID: 34946821 PMCID: PMC8700916 DOI: 10.3390/genes12121872] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2021] [Revised: 11/18/2021] [Accepted: 11/24/2021] [Indexed: 11/17/2022] Open
Abstract
Lung adenocarcinoma (LUAD) is a common and very lethal cancer. Accurate staging is a prerequisite for its effective diagnosis and treatment. Therefore, improving the accuracy of the stage prediction of LUAD patients is of great clinical relevance. Previous works have mainly focused on single genomic data information or a small number of different omics data types concurrently for generating predictive models. A few of them have considered multi-omics data from genome to proteome. We used a publicly available dataset to illustrate the potential of multi-omics data for stage prediction in LUAD. In particular, we investigated the roles of the specific omics data types in the prediction process. We used a self-developed method, Omics-MKL, for stage prediction that combines an existing feature ranking technique Minimum Redundancy and Maximum Relevance (mRMR), which avoids redundancy among the selected features, and multiple kernel learning (MKL), applying different kernels for different omics data types. Each of the considered omics data types individually provided useful prediction results. Moreover, using multi-omics data delivered notably better results than using single-omics data. Gene expression and methylation information seem to play vital roles in the staging of LUAD. The Omics-MKL method retained 70 features after the selection process. Of these, 21 (30%) were methylation features and 34 (48.57%) were gene expression features. Moreover, 18 (25.71%) of the selected features are known to be related to LUAD, and 29 (41.43%) to lung cancer in general. Using multi-omics data from genome to proteome for predicting the stage of LUAD seems promising because each omics data type may improve the accuracy of the predictions. Here, methylation and gene expression data may play particularly important roles.
Collapse
|
12
|
Carpenter CM, Zhang W, Gillenwater L, Severn C, Ghosh T, Bowler R, Kechris K, Ghosh D. PaIRKAT: A pathway integrated regression-based kernel association test with applications to metabolomics and COPD phenotypes. PLoS Comput Biol 2021; 17:e1008986. [PMID: 34679079 PMCID: PMC8565741 DOI: 10.1371/journal.pcbi.1008986] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Revised: 11/03/2021] [Accepted: 10/13/2021] [Indexed: 02/02/2023] Open
Abstract
High-throughput data such as metabolomics, genomics, transcriptomics, and proteomics have become familiar data types within the "-omics" family. For this work, we focus on subsets that interact with one another and represent these "pathways" as graphs. Observed pathways often have disjoint components, i.e., nodes or sets of nodes (metabolites, etc.) not connected to any other within the pathway, which notably lessens testing power. In this paper we propose the Pathway Integrated Regression-based Kernel Association Test (PaIRKAT), a new kernel machine regression method for incorporating known pathway information into the semi-parametric kernel regression framework. This work extends previous kernel machine approaches. This paper also contributes an application of a graph kernel regularization method for overcoming disconnected pathways. By incorporating a regularized or "smoothed" graph into a score test, PaIRKAT can provide more powerful tests for associations between biological pathways and phenotypes of interest and will be helpful in identifying novel pathways for targeted clinical research. We evaluate this method through several simulation studies and an application to real metabolomics data from the COPDGene study. Our simulation studies illustrate the robustness of this method to incorrect and incomplete pathway knowledge, and the real data analysis shows meaningful improvements of testing power in pathways. PaIRKAT was developed for application to metabolomic pathway data, but the techniques are easily generalizable to other data sources with a graph-like structure.
Collapse
Affiliation(s)
- Charlie M. Carpenter
- Department of Biostatistics and Informatics, University of Colorado Denver, Anschutz Medical campus, Denver, Colorado, United States of America
| | - Weiming Zhang
- Syneos Health, Morrisville, North Carolina, United States of America
| | - Lucas Gillenwater
- Computational Bioscience Program, University of Colorado Denver, Anschutz medical campus, Denver, Colorado, United States of America
| | - Cameron Severn
- Department of Biostatistics and Informatics, University of Colorado Denver, Anschutz Medical campus, Denver, Colorado, United States of America
| | - Tusharkanti Ghosh
- Department of Biostatistics and Informatics, University of Colorado Denver, Anschutz Medical campus, Denver, Colorado, United States of America
| | - Russell Bowler
- Department of Medicine, National Jewish Health, Denver; University of Colorado Denver, Anschutz Medical Campus, Denver, Colorado, United States of America
| | - Katerina Kechris
- Department of Biostatistics and Informatics, University of Colorado Denver, Anschutz Medical campus, Denver, Colorado, United States of America
| | - Debashis Ghosh
- Department of Biostatistics and Informatics, University of Colorado Denver, Anschutz Medical campus, Denver, Colorado, United States of America
| |
Collapse
|
13
|
Liu JZ, Deng W, Lee J, Lin PID, Valeri L, Christiani DC, Bellinger DC, Wright RO, Mazumdar MM, Coull BA. A Cross-validated Ensemble Approach to Robust Hypothesis Testing of Continuous Nonlinear Interactions: Application to Nutrition-Environment Studies. J Am Stat Assoc 2021; 117:561-573. [PMID: 36310839 PMCID: PMC9611147 DOI: 10.1080/01621459.2021.1962889] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2019] [Revised: 07/23/2021] [Accepted: 07/28/2021] [Indexed: 01/03/2023]
Abstract
Gene-environment and nutrition-environment studies often involve testing of high-dimensional interactions between two sets of variables, each having potentially complex nonlinear main effects on an outcome. Construction of a valid and powerful hypothesis test for such an interaction is challenging, due to the difficulty in constructing an efficient and unbiased estimator for the complex, nonlinear main effects. In this work we address this problem by proposing a Cross-validated Ensemble of Kernels (CVEK) that learns the space of appropriate functions for the main effects using a cross-validated ensemble approach. With a carefully chosen library of base kernels, CVEK flexibly estimates the form of the main-effect functions from the data, and encourages test power by guarding against over-fitting under the alternative. The method is motivated by a study on the interaction between metal exposures in utero and maternal nutrition on children's neurodevelopment in rural Bangladesh. The proposed tests identified evidence of an interaction between minerals and vitamins intake and arsenic and manganese exposures. Results suggest that the detrimental effects of these metals are most pronounced at low intake levels of the nutrients, suggesting nutritional interventions in pregnant women could mitigate the adverse impacts of in utero metal exposures on children's neurodevelopment.
Collapse
Affiliation(s)
- Jeremiah Zhe Liu
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Wenying Deng
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Jane Lee
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Neurology, Boston Children’s Hospital, Boston, MA, USA
| | - Pi-i Debby Lin
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Linda Valeri
- Department of Biostatistics, Columbia Mailman School of Public Health, New York, New York, USA
| | - David C. Christiani
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - David C. Bellinger
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Neurology, Boston Children’s Hospital, Boston, MA, USA
| | - Robert O. Wright
- Department of Environmental Medicine and Public Health, Icahn School of Medicine, New York, NY, USA
| | - Maitreyi M. Mazumdar
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Neurology, Boston Children’s Hospital, Boston, MA, USA
| | - Brent A. Coull
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| |
Collapse
|
14
|
Liñares-Blanco J, Pazos A, Fernandez-Lozano C. Machine learning analysis of TCGA cancer data. PeerJ Comput Sci 2021; 7:e584. [PMID: 34322589 PMCID: PMC8293929 DOI: 10.7717/peerj-cs.584] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Accepted: 05/17/2021] [Indexed: 06/13/2023]
Abstract
In recent years, machine learning (ML) researchers have changed their focus towards biological problems that are difficult to analyse with standard approaches. Large initiatives such as The Cancer Genome Atlas (TCGA) have allowed the use of omic data for the training of these algorithms. In order to study the state of the art, this review is provided to cover the main works that have used ML with TCGA data. Firstly, the principal discoveries made by the TCGA consortium are presented. Once these bases have been established, we begin with the main objective of this study, the identification and discussion of those works that have used the TCGA data for the training of different ML approaches. After a review of more than 100 different papers, it has been possible to make a classification according to following three pillars: the type of tumour, the type of algorithm and the predicted biological problem. One of the conclusions drawn in this work shows a high density of studies based on two major algorithms: Random Forest and Support Vector Machines. We also observe the rise in the use of deep artificial neural networks. It is worth emphasizing, the increase of integrative models of multi-omic data analysis. The different biological conditions are a consequence of molecular homeostasis, driven by both protein coding regions, regulatory elements and the surrounding environment. It is notable that a large number of works make use of genetic expression data, which has been found to be the preferred method by researchers when training the different models. The biological problems addressed have been classified into five types: prognosis prediction, tumour subtypes, microsatellite instability (MSI), immunological aspects and certain pathways of interest. A clear trend was detected in the prediction of these conditions according to the type of tumour. That is the reason for which a greater number of works have focused on the BRCA cohort, while specific works for survival, for example, were centred on the GBM cohort, due to its large number of events. Throughout this review, it will be possible to go in depth into the works and the methodologies used to study TCGA cancer data. Finally, it is intended that this work will serve as a basis for future research in this field of study.
Collapse
Affiliation(s)
- Jose Liñares-Blanco
- CITIC-Research Center of Information and Communication Technologies, University of A Coruna, A Coruña, Spain
- Department of Computer Science and Information Technologies, Faculty of Computer Science, University of A Coruna, A Coruña, Spain
| | - Alejandro Pazos
- CITIC-Research Center of Information and Communication Technologies, University of A Coruna, A Coruña, Spain
- Department of Computer Science and Information Technologies, Faculty of Computer Science, University of A Coruna, A Coruña, Spain
- Grupo de Redes de Neuronas Artificiales y Sistemas Adaptativos. Imagen Médica y Diagnóstico Radiológico (RNASA-IMEDIR). Complexo Hospitalario Universitario de A Coruña (CHUAC), SERGAS, Universidade da Coruña, Instituto de Investigación Biomédica de A Coruña (INIBIC), A Coruña, Spain
| | - Carlos Fernandez-Lozano
- CITIC-Research Center of Information and Communication Technologies, University of A Coruna, A Coruña, Spain
- Department of Computer Science and Information Technologies, Faculty of Computer Science, University of A Coruna, A Coruña, Spain
- Grupo de Redes de Neuronas Artificiales y Sistemas Adaptativos. Imagen Médica y Diagnóstico Radiológico (RNASA-IMEDIR). Complexo Hospitalario Universitario de A Coruña (CHUAC), SERGAS, Universidade da Coruña, Instituto de Investigación Biomédica de A Coruña (INIBIC), A Coruña, Spain
| |
Collapse
|
15
|
Reel PS, Reel S, Pearson E, Trucco E, Jefferson E. Using machine learning approaches for multi-omics data analysis: A review. Biotechnol Adv 2021; 49:107739. [PMID: 33794304 DOI: 10.1016/j.biotechadv.2021.107739] [Citation(s) in RCA: 284] [Impact Index Per Article: 94.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Revised: 03/01/2021] [Accepted: 03/25/2021] [Indexed: 02/06/2023]
Abstract
With the development of modern high-throughput omic measurement platforms, it has become essential for biomedical studies to undertake an integrative (combined) approach to fully utilise these data to gain insights into biological systems. Data from various omics sources such as genetics, proteomics, and metabolomics can be integrated to unravel the intricate working of systems biology using machine learning-based predictive algorithms. Machine learning methods offer novel techniques to integrate and analyse the various omics data enabling the discovery of new biomarkers. These biomarkers have the potential to help in accurate disease prediction, patient stratification and delivery of precision medicine. This review paper explores different integrative machine learning methods which have been used to provide an in-depth understanding of biological systems during normal physiological functioning and in the presence of a disease. It provides insight and recommendations for interdisciplinary professionals who envisage employing machine learning skills in multi-omics studies.
Collapse
Affiliation(s)
- Parminder S Reel
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, United Kingdom
| | - Smarti Reel
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, United Kingdom
| | - Ewan Pearson
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, United Kingdom
| | - Emanuele Trucco
- VAMPIRE project, Computing, School of Science and Engineering, University of Dundee, Dundee, United Kingdom
| | - Emily Jefferson
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, United Kingdom.
| |
Collapse
|
16
|
Uzunangelov V, Wong CK, Stuart JM. Accurate cancer phenotype prediction with AKLIMATE, a stacked kernel learner integrating multimodal genomic data and pathway knowledge. PLoS Comput Biol 2021; 17:e1008878. [PMID: 33861732 PMCID: PMC8081343 DOI: 10.1371/journal.pcbi.1008878] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Revised: 04/28/2021] [Accepted: 03/15/2021] [Indexed: 02/03/2023] Open
Abstract
Advancements in sequencing have led to the proliferation of multi-omic profiles of human cells under different conditions and perturbations. In addition, many databases have amassed information about pathways and gene "signatures"-patterns of gene expression associated with specific cellular and phenotypic contexts. An important current challenge in systems biology is to leverage such knowledge about gene coordination to maximize the predictive power and generalization of models applied to high-throughput datasets. However, few such integrative approaches exist that also provide interpretable results quantifying the importance of individual genes and pathways to model accuracy. We introduce AKLIMATE, a first kernel-based stacked learner that seamlessly incorporates multi-omics feature data with prior information in the form of pathways for either regression or classification tasks. AKLIMATE uses a novel multiple-kernel learning framework where individual kernels capture the prediction propensities recorded in random forests, each built from a specific pathway gene set that integrates all omics data for its member genes. AKLIMATE has comparable or improved performance relative to state-of-the-art methods on diverse phenotype learning tasks, including predicting microsatellite instability in endometrial and colorectal cancer, survival in breast cancer, and cell line response to gene knockdowns. We show how AKLIMATE is able to connect feature data across data platforms through their common pathways to identify examples of several known and novel contributors of cancer and synthetic lethality.
Collapse
Affiliation(s)
- Vladislav Uzunangelov
- Department of Biomolecular Engineering, University of California, Santa Cruz, California, United States of America
| | - Christopher K. Wong
- Department of Biomolecular Engineering, University of California, Santa Cruz, California, United States of America
| | - Joshua M. Stuart
- Department of Biomolecular Engineering, University of California, Santa Cruz, California, United States of America
- * E-mail:
| |
Collapse
|
17
|
Li Y, Ma L, Wu D, Chen G. Advances in bulk and single-cell multi-omics approaches for systems biology and precision medicine. Brief Bioinform 2021; 22:6189773. [PMID: 33778867 DOI: 10.1093/bib/bbab024] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2020] [Revised: 12/31/2020] [Accepted: 01/20/2021] [Indexed: 12/13/2022] Open
Abstract
Multi-omics allows the systematic understanding of the information flow across different omics layers, while single omics can mainly reflect one aspect of the biological system. The advancement of bulk and single-cell sequencing technologies and related computational methods for multi-omics largely facilitated the development of system biology and precision medicine. Single-cell approaches have the advantage of dissecting cellular dynamics and heterogeneity, whereas traditional bulk technologies are limited to individual/population-level investigation. In this review, we first summarize the technologies for producing bulk and single-cell multi-omics data. Then, we survey the computational approaches for integrative analysis of bulk and single-cell multimodal data, respectively. Moreover, the databases and data storage for multi-omics, as well as the tools for visualizing multimodal data are summarized. We also outline the integration between bulk and single-cell data, and discuss the applications of multi-omics in precision medicine. Finally, we present the challenges and perspectives for multi-omics development.
Collapse
Affiliation(s)
| | - Lu Ma
- China Normal University, China
| | | | | |
Collapse
|
18
|
Vlachavas EI, Bohn J, Ückert F, Nürnberg S. A Detailed Catalogue of Multi-Omics Methodologies for Identification of Putative Biomarkers and Causal Molecular Networks in Translational Cancer Research. Int J Mol Sci 2021; 22:2822. [PMID: 33802234 PMCID: PMC8000236 DOI: 10.3390/ijms22062822] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Revised: 03/05/2021] [Accepted: 03/05/2021] [Indexed: 02/06/2023] Open
Abstract
Recent advances in sequencing and biotechnological methodologies have led to the generation of large volumes of molecular data of different omics layers, such as genomics, transcriptomics, proteomics and metabolomics. Integration of these data with clinical information provides new opportunities to discover how perturbations in biological processes lead to disease. Using data-driven approaches for the integration and interpretation of multi-omics data could stably identify links between structural and functional information and propose causal molecular networks with potential impact on cancer pathophysiology. This knowledge can then be used to improve disease diagnosis, prognosis, prevention, and therapy. This review will summarize and categorize the most current computational methodologies and tools for integration of distinct molecular layers in the context of translational cancer research and personalized therapy. Additionally, the bioinformatics tools Multi-Omics Factor Analysis (MOFA) and netDX will be tested using omics data from public cancer resources, to assess their overall robustness, provide reproducible workflows for gaining biological knowledge from multi-omics data, and to comprehensively understand the significantly perturbed biological entities in distinct cancer types. We show that the performed supervised and unsupervised analyses result in meaningful and novel findings.
Collapse
Affiliation(s)
- Efstathios Iason Vlachavas
- Medical Informatics for Translational Oncology, German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany; (J.B.); (F.Ü.)
| | - Jonas Bohn
- Medical Informatics for Translational Oncology, German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany; (J.B.); (F.Ü.)
| | - Frank Ückert
- Medical Informatics for Translational Oncology, German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany; (J.B.); (F.Ü.)
- Applied Medical Informatics, University Hospital Hamburg-Eppendorf, 20251 Hamburg, Germany
| | - Sylvia Nürnberg
- Medical Informatics for Translational Oncology, German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany; (J.B.); (F.Ü.)
- Applied Medical Informatics, University Hospital Hamburg-Eppendorf, 20251 Hamburg, Germany
| |
Collapse
|
19
|
Chang SM, Yang M, Lu W, Huang YJ, Huang Y, Hung H, Miecznikowski JC, Lu TP, Tzeng JY. Gene-Set Integrative Analysis of Multi-Omics Data Using Tensor-based Association Test. Bioinformatics 2021; 37:2259-2265. [PMID: 33674827 PMCID: PMC8388036 DOI: 10.1093/bioinformatics/btab125] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2020] [Revised: 12/30/2020] [Accepted: 02/24/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Facilitated by technological advances and the decrease in costs, it is feasible to gather subject data from several omics platforms. Each platform assesses different molecular events, and the challenge lies in efficiently analyzing these data to discover novel disease genes or mechanisms. A common strategy is to regress the outcomes on all omics variables in a gene set. However, this approach suffers from problems associated with high-dimensional inference. RESULTS We introduce a tensor-based framework for variable-wise inference in multi-omics analysis. By accounting for the matrix structure of an individual's multi-omics data, the proposed tensor methods incorporate the relationship among omics effects, reduce the number of parameters, and boost the modeling efficiency. We derive the variable-specific tensor test and enhance computational efficiency of tensor modeling. Using simulations and data applications on the Cancer Cell Line Encyclopedia (CCLE), we demonstrate our method performs favorably over baseline methods and will be useful for gaining biological insights in multi-omics analysis. AVAILABILITY AND IMPLEMENTATION R function and instruction are available from the authors' website: https://www4.stat.ncsu.edu/∼jytzeng/Software/TR.omics/TRinstruction.pdf. SUPPLEMENTARY INFORMATION Supplementary materials are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sheng-Mao Chang
- Department of Statistics, National Cheng Kung University, Tainan, Taiwan
| | - Meng Yang
- Department of Statistics, North Carolina State University, Raleigh NC, 27695, USA
| | - Wenbin Lu
- Department of Statistics, North Carolina State University, Raleigh NC, 27695, USA
| | - Yu-Jyun Huang
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
| | - Yueyang Huang
- Bioinformatics Research Center, North Carolina State University, Raleigh NC, 27695, USA
| | - Hung Hung
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
| | | | - Tzu-Pin Lu
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
| | - Jung-Ying Tzeng
- Department of Statistics, National Cheng Kung University, Tainan, Taiwan.,Department of Statistics, North Carolina State University, Raleigh NC, 27695, USA.,Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan.,Bioinformatics Research Center, North Carolina State University, Raleigh NC, 27695, USA
| |
Collapse
|
20
|
Wani N, Raza K. MKL-GRNI: A parallel multiple kernel learning approach for supervised inference of large-scale gene regulatory networks. PeerJ Comput Sci 2021; 7:e363. [PMID: 33817013 PMCID: PMC7924726 DOI: 10.7717/peerj-cs.363] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Accepted: 12/29/2020] [Indexed: 06/12/2023]
Abstract
High throughput multi-omics data generation coupled with heterogeneous genomic data fusion are defining new ways to build computational inference models. These models are scalable and can support very large genome sizes with the added advantage of exploiting additional biological knowledge from the integration framework. However, the limitation with such an arrangement is the huge computational cost involved when learning from very large datasets in a sequential execution environment. To overcome this issue, we present a multiple kernel learning (MKL) based gene regulatory network (GRN) inference approach wherein multiple heterogeneous datasets are fused using MKL paradigm. We formulate the GRN learning problem as a supervised classification problem, whereby genes regulated by a specific transcription factor are separated from other non-regulated genes. A parallel execution architecture is devised to learn a large scale GRN by decomposing the initial classification problem into a number of subproblems that run as multiple processes on a multi-processor machine. We evaluate the approach in terms of increased speedup and inference potential using genomic data from Escherichia coli, Saccharomyces cerevisiae and Homo sapiens. The results thus obtained demonstrate that the proposed method exhibits better classification accuracy and enhanced speedup compared to other state-of-the-art methods while learning large scale GRNs from multiple and heterogeneous datasets.
Collapse
Affiliation(s)
- Nisar Wani
- Govt. Degree College Baramulla, Jammu & Kashmir, India
| | - Khalid Raza
- Department of Computer Science, Jamia Millia Islamia, New Delhi, India
| |
Collapse
|
21
|
He Z, Zhang J, Yuan X, Zhang Y. Integrating Somatic Mutations for Breast Cancer Survival Prediction Using Machine Learning Methods. Front Genet 2021; 11:632901. [PMID: 33537063 PMCID: PMC7848170 DOI: 10.3389/fgene.2020.632901] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Accepted: 12/30/2020] [Indexed: 12/13/2022] Open
Abstract
Breast cancer is the most common malignancy in women, and because it has a high mortality rate, it is urgent to develop computational methods to increase the accuracy of breast cancer survival predictive models. Although multi-omics data such as gene expression have been extensively used in recent studies, the accurate prognosis of breast cancer remains a challenge. Somatic mutations are another important and promising data source for studying cancer development, and its effect on the prognosis of breast cancer remains to be further explored. Meanwhile, these omics datasets are high-dimensional and redundant. Therefore, we adopted multiple kernel learning (MKL) to efficiently integrate somatic mutation to currently molecular data including gene expression, copy number variation (CNV), methylation, and protein expression data for the prediction of breast cancer survival. Before integration, the maximum relevance minimum redundancy (mRMR) feature selection method was utilized to select features that present high relevance to survival and low redundancy among themselves for each type of data. The experimental results demonstrated that the proposed method achieved the most optimal performance and there was a remarkable improvement in the prediction performance when somatic mutations were included, indicating that somatic mutations are critical for improving breast cancer survival predictions. Moreover, mRMR was superior to other feature selection methods used in previous studies. Furthermore, MKL outperformed the other traditional classifiers in multi-omics data integration. Our analysis indicated that through employing promising omics data such as somatic mutations and harnessing the power of proper feature selection methods and effective integration frameworks, the breast cancer survival predictive accuracy can be further increased, thereby providing a more optimal clinical diagnosis and more effective treatment for breast cancer patients.
Collapse
Affiliation(s)
- Zongzhen He
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Junying Zhang
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Xiguo Yuan
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Yuanyuan Zhang
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao, China
| |
Collapse
|
22
|
A survey on single and multi omics data mining methods in cancer data classification. J Biomed Inform 2020; 107:103466. [DOI: 10.1016/j.jbi.2020.103466] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 05/01/2020] [Accepted: 05/31/2020] [Indexed: 01/09/2023]
|
23
|
Carballal A, Fernandez-Lozano C, Rodriguez-Fernandez N, Santos I, Romero J. Comparison of Outlier-Tolerant Models for Measuring Visual Complexity. ENTROPY 2020; 22:e22040488. [PMID: 33286263 PMCID: PMC7516971 DOI: 10.3390/e22040488] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/20/2020] [Revised: 04/18/2020] [Accepted: 04/23/2020] [Indexed: 11/18/2022]
Abstract
Providing the visual complexity of an image in terms of impact or aesthetic preference can be of great applicability in areas such as psychology or marketing. To this end, certain areas such as Computer Vision have focused on identifying features and computational models that allow for satisfactory results. This paper studies the application of recent ML models using input images evaluated by humans and characterized by features related to visual complexity. According to the experiments carried out, it was confirmed that one of these methods, Correlation by Genetic Search (CGS), based on the search for minimum sets of features that maximize the correlation of the model with respect to the input data, predicted human ratings of image visual complexity better than any other model referenced to date in terms of correlation, RMSE or minimum number of features required by the model. In addition, the variability of these terms were studied eliminating images considered as outliers in previous studies, observing the robustness of the method when selecting the most important variables to make the prediction.
Collapse
Affiliation(s)
- Adrian Carballal
- CITIC-Research Center of Information and Communication Technologies, University of A Coruña, 15071 A Coruña, Spain; (C.F.-L.); (N.R.-F.); (I.S.); (J.R.)
- Department of Computer Science and Information Technologies, Faculty of Computer Science, University of A Coruña, Campus Elviña s/n, 15071 A Coruña, Spain
- Correspondence:
| | - Carlos Fernandez-Lozano
- CITIC-Research Center of Information and Communication Technologies, University of A Coruña, 15071 A Coruña, Spain; (C.F.-L.); (N.R.-F.); (I.S.); (J.R.)
- Department of Computer Science and Information Technologies, Faculty of Computer Science, University of A Coruña, Campus Elviña s/n, 15071 A Coruña, Spain
| | - Nereida Rodriguez-Fernandez
- CITIC-Research Center of Information and Communication Technologies, University of A Coruña, 15071 A Coruña, Spain; (C.F.-L.); (N.R.-F.); (I.S.); (J.R.)
- Department of Computer Science and Information Technologies, Faculty of Communication Science, University of A Coruña, Campus Elviña s/n, 15071 A Coruña, Spain
| | - Iria Santos
- CITIC-Research Center of Information and Communication Technologies, University of A Coruña, 15071 A Coruña, Spain; (C.F.-L.); (N.R.-F.); (I.S.); (J.R.)
- Department of Computer Science and Information Technologies, Faculty of Communication Science, University of A Coruña, Campus Elviña s/n, 15071 A Coruña, Spain
| | - Juan Romero
- CITIC-Research Center of Information and Communication Technologies, University of A Coruña, 15071 A Coruña, Spain; (C.F.-L.); (N.R.-F.); (I.S.); (J.R.)
- Department of Computer Science and Information Technologies, Faculty of Communication Science, University of A Coruña, Campus Elviña s/n, 15071 A Coruña, Spain
| |
Collapse
|
24
|
Mallik S, Zhao Z. Graph- and rule-based learning algorithms: a comprehensive review of their applications for cancer type classification and prognosis using genomic data. Brief Bioinform 2020; 21:368-394. [PMID: 30649169 PMCID: PMC7373185 DOI: 10.1093/bib/bby120] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Revised: 10/26/2018] [Accepted: 11/21/2018] [Indexed: 12/20/2022] Open
Abstract
Cancer is well recognized as a complex disease with dysregulated molecular networks or modules. Graph- and rule-based analytics have been applied extensively for cancer classification as well as prognosis using large genomic and other data over the past decade. This article provides a comprehensive review of various graph- and rule-based machine learning algorithms that have been applied to numerous genomics data to determine the cancer-specific gene modules, identify gene signature-based classifiers and carry out other related objectives of potential therapeutic value. This review focuses mainly on the methodological design and features of these algorithms to facilitate the application of these graph- and rule-based analytical approaches for cancer classification and prognosis. Based on the type of data integration, we divided all the algorithms into three categories: model-based integration, pre-processing integration and post-processing integration. Each category is further divided into four sub-categories (supervised, unsupervised, semi-supervised and survival-driven learning analyses) based on learning style. Therefore, a total of 11 categories of methods are summarized with their inputs, objectives and description, advantages and potential limitations. Next, we briefly demonstrate well-known and most recently developed algorithms for each sub-category along with salient information, such as data profiles, statistical or feature selection methods and outputs. Finally, we summarize the appropriate use and efficiency of all categories of graph- and rule mining-based learning methods when input data and specific objective are given. This review aims to help readers to select and use the appropriate algorithms for cancer classification and prognosis study.
Collapse
Affiliation(s)
- Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center, Houston
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center, Houston
| |
Collapse
|
25
|
Singh NP, Vinod PK. Integrative analysis of DNA methylation and gene expression in papillary renal cell carcinoma. Mol Genet Genomics 2020; 295:807-824. [PMID: 32185457 DOI: 10.1007/s00438-020-01664-y] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2019] [Accepted: 03/03/2020] [Indexed: 12/18/2022]
Abstract
Patterns of DNA methylation are significantly altered in cancers. Interpreting the functional consequences of DNA methylation requires the integration of multiple forms of data. The recent advancement in the next-generation sequencing can help to decode this relationship and in biomarker discovery. In this study, we investigated the methylation patterns of papillary renal cell carcinoma (PRCC) and its relationship with the gene expression using The Cancer Genome Atlas (TCGA) multi-omics data. We found that the promoter and body of tumor suppressor genes, microRNAs and gene clusters and families, including cadherins, protocadherins, claudins and collagens, are hypermethylated in PRCC. Hypomethylated genes in PRCC are associated with the immune function. The gene expression of several novel candidate genes, including interleukin receptor IL17RE and immune checkpoint genes HHLA2, SIRPA and HAVCR2, shows a significant correlation with DNA methylation. We also developed machine learning models using features extracted from single and multi-omics data to distinguish early and late stages of PRCC. A comparative study of different feature selection algorithms, predictive models, data integration techniques and representations of methylation data was performed. Integration of both gene expression and DNA methylation features improved the performance of models in distinguishing tumor stages. In summary, our study identifies PRCC driver genes and proposes predictive models based on both DNA methylation and gene expression. These results on PRCC will aid in targeted experiments and provide a strategy to improve the classification accuracy of tumor stages.
Collapse
Affiliation(s)
- Noor Pratap Singh
- Center for Computational Natural Sciences and Bioinformatics, IIIT Hyderabad, Hyderabad, 500032, India
| | - P K Vinod
- Center for Computational Natural Sciences and Bioinformatics, IIIT Hyderabad, Hyderabad, 500032, India.
| |
Collapse
|
26
|
Subramanian I, Verma S, Kumar S, Jere A, Anamika K. Multi-omics Data Integration, Interpretation, and Its Application. Bioinform Biol Insights 2020; 14:1177932219899051. [PMID: 32076369 PMCID: PMC7003173 DOI: 10.1177/1177932219899051] [Citation(s) in RCA: 566] [Impact Index Per Article: 141.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Accepted: 11/09/2019] [Indexed: 12/22/2022] Open
Abstract
To study complex biological processes holistically, it is imperative to take an integrative approach that combines multi-omics data to highlight the interrelationships of the involved biomolecules and their functions. With the advent of high-throughput techniques and availability of multi-omics data generated from a large set of samples, several promising tools and methods have been developed for data integration and interpretation. In this review, we collected the tools and methods that adopt integrative approach to analyze multiple omics data and summarized their ability to address applications such as disease subtyping, biomarker prediction, and deriving insights into the data. We provide the methodology, use-cases, and limitations of these tools; brief account of multi-omics data repositories and visualization portals; and challenges associated with multi-omics data integration.
Collapse
Affiliation(s)
| | | | | | - Abhay Jere
- Innovation Cell, Ministry of Human Resource Development, New Delhi, India
| | | |
Collapse
|
27
|
Sun S, Yu X, Sun F, Tang Y, Zhao J, Zeng T. Dynamically characterizing individual clinical change by the steady state of disease-associated pathway. BMC Bioinformatics 2019; 20:697. [PMID: 31874621 PMCID: PMC6929545 DOI: 10.1186/s12859-019-3271-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Background Along with the development of precision medicine, individual heterogeneity is attracting more and more attentions in clinical research and application. Although the biomolecular reaction seems to be some various when different individuals suffer a same disease (e.g. virus infection), the final pathogen outcomes of individuals always can be mainly described by two categories in clinics, i.e. symptomatic and asymptomatic. Thus, it is still a great challenge to characterize the individual specific intrinsic regulatory convergence during dynamic gene regulation and expression. Except for individual heterogeneity, the sampling time also increase the expression diversity, so that, the capture of similar steady biological state is a key to characterize individual dynamic biological processes. Results Assuming the similar biological functions (e.g. pathways) should be suitable to detect consistent functions rather than chaotic genes, we design and implement a new computational framework (ABP: Attractor analysis of Boolean network of Pathway). ABP aims to identify the dynamic phenotype associated pathways in a state-transition manner, using the network attractor to model and quantify the steady pathway states characterizing the final steady biological sate of individuals (e.g. normal or disease). By analyzing multiple temporal gene expression datasets of virus infections, ABP has shown its effectiveness on identifying key pathways associated with phenotype change; inferring the consensus functional cascade among key pathways; and grouping pathway activity states corresponding to disease states. Conclusions Collectively, ABP can detect key pathways and infer their consensus functional cascade during dynamical process (e.g. virus infection), and can also categorize individuals with disease state well, which is helpful for disease classification and prediction.
Collapse
Affiliation(s)
- Shaoyan Sun
- School of Mathematics and Statistics Science, Ludong University, Yantai, 264025, China.
| | - Xiangtian Yu
- Shanghai Jiao Tong University Affiliated Sixth People's Hospital, Shanghai, 200233, China.,Key Laboratory of Systems Biology, Institute of Biochemistry and Cell Biology, Chinese Academy Science, Shanghai, 200031, China
| | - Fengnan Sun
- Medical Laboratory, Yantaishan Hospital, Yantai, 264001, China
| | - Ying Tang
- Key Laboratory of Systems Biology, Institute of Biochemistry and Cell Biology, Chinese Academy Science, Shanghai, 200031, China
| | - Juan Zhao
- Key Laboratory of Systems Biology, Institute of Biochemistry and Cell Biology, Chinese Academy Science, Shanghai, 200031, China
| | - Tao Zeng
- Key Laboratory of Systems Biology, Institute of Biochemistry and Cell Biology, Chinese Academy Science, Shanghai, 200031, China. .,Shanghai Research Center for Brain Science and Brain-Inspired Intelligence, Shanghai, 201210, China.
| |
Collapse
|
28
|
Wilson CM, Li K, Yu X, Kuan PF, Wang X. Multiple-kernel learning for genomic data mining and prediction. BMC Bioinformatics 2019; 20:426. [PMID: 31416413 PMCID: PMC6694479 DOI: 10.1186/s12859-019-2992-1] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2019] [Accepted: 07/11/2019] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND Advances in medical technology have allowed for customized prognosis, diagnosis, and treatment regimens that utilize multiple heterogeneous data sources. Multiple kernel learning (MKL) is well suited for the integration of multiple high throughput data sources. MKL remains to be under-utilized by genomic researchers partly due to the lack of unified guidelines for its use, and benchmark genomic datasets. RESULTS We provide three implementations of MKL in R. These methods are applied to simulated data to illustrate that MKL can select appropriate models. We also apply MKL to combine clinical information with miRNA gene expression data of ovarian cancer study into a single analysis. Lastly, we show that MKL can identify gene sets that are known to play a role in the prognostic prediction of 15 cancer types using gene expression data from The Cancer Genome Atlas, as well as, identify new gene sets for the future research. CONCLUSION Multiple kernel learning coupled with modern optimization techniques provides a promising learning tool for building predictive models based on multi-source genomic data. MKL also provides an automated scheme for kernel prioritization and parameter tuning. The methods used in the paper are implemented as an R package called RMKL package, which is freely available for download through CRAN at https://CRAN.R-project.org/package=RMKL .
Collapse
Affiliation(s)
- Christopher M. Wilson
- Department of Biostatistics and Bioinformatics at Moffitt Cancer Center, Tampa, FL USA
| | - Kaiqiao Li
- Department of Applied Mathematics and Statistics at Stony Brook University, Stony Brook, NY USA
| | - Xiaoqing Yu
- Department of Biostatistics and Bioinformatics at Moffitt Cancer Center, Tampa, FL USA
| | - Pei-Fen Kuan
- Department of Applied Mathematics and Statistics at Stony Brook University, Stony Brook, NY USA
| | - Xuefeng Wang
- Department of Biostatistics and Bioinformatics at Moffitt Cancer Center, Tampa, FL USA
| |
Collapse
|
29
|
Hornung R, Wright MN. Block Forests: random forests for blocks of clinical and omics covariate data. BMC Bioinformatics 2019; 20:358. [PMID: 31248362 PMCID: PMC6598279 DOI: 10.1186/s12859-019-2942-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Accepted: 06/07/2019] [Indexed: 12/25/2022] Open
Abstract
Background In the last years more and more multi-omics data are becoming available, that is, data featuring measurements of several types of omics data for each patient. Using multi-omics data as covariate data in outcome prediction is both promising and challenging due to the complex structure of such data. Random forest is a prediction method known for its ability to render complex dependency patterns between the outcome and the covariates. Against this background we developed five candidate random forest variants tailored to multi-omics covariate data. These variants modify the split point selection of random forest to incorporate the block structure of multi-omics data and can be applied to any outcome type for which a random forest variant exists, such as categorical, continuous and survival outcomes. Using 20 publicly available multi-omics data sets with survival outcome we compared the prediction performances of the block forest variants with alternatives. We also considered the common special case of having clinical covariates and measurements of a single omics data type available. Results We identify one variant termed “block forest” that outperformed all other approaches in the comparison study. In particular, it performed significantly better than standard random survival forest (adjusted p-value: 0.027). The two best performing variants have in common that the block choice is randomized in the split point selection procedure. In the case of having clinical covariates and a single omics data type available, the improvements of the variants over random survival forest were larger than in the case of the multi-omics data. The degrees of improvements over random survival forest varied strongly across data sets. Moreover, considering all clinical covariates mandatorily improved the performance. This result should however be interpreted with caution, because the level of predictive information contained in clinical covariates depends on the specific application. Conclusions The new prediction method block forest for multi-omics data can significantly improve the prediction performance of random forest and outperformed alternatives in the comparison. Block forest is particularly effective for the special case of using clinical covariates in combination with measurements of a single omics data type. Electronic supplementary material The online version of this article (10.1186/s12859-019-2942-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Roman Hornung
- Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, 81377, Germany.
| | - Marvin N Wright
- Leibniz Institute for Prevention Research and Epidemiology - BIPS, Achterstr. 30, Bremen, 28359, Germany.,Section of Biostatistics, Department of Public Health, University of Copenhagen, Øster Farimagsgade 5, Copenhagen, 1014, Denmark
| |
Collapse
|
30
|
Bravo-Merodio L, Williams JA, Gkoutos GV, Acharjee A. -Omics biomarker identification pipeline for translational medicine. J Transl Med 2019; 17:155. [PMID: 31088492 PMCID: PMC6518609 DOI: 10.1186/s12967-019-1912-5] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2018] [Accepted: 05/08/2019] [Indexed: 01/31/2023] Open
Abstract
BACKGROUND Translational medicine (TM) is an emerging domain that aims to facilitate medical or biological advances efficiently from the scientist to the clinician. Central to the TM vision is to narrow the gap between basic science and applied science in terms of time, cost and early diagnosis of the disease state. Biomarker identification is one of the main challenges within TM. The identification of disease biomarkers from -omics data will not only help the stratification of diverse patient cohorts but will also provide early diagnostic information which could improve patient management and potentially prevent adverse outcomes. However, biomarker identification needs to be robust and reproducible. Hence a robust unbiased computational framework that can help clinicians identify those biomarkers is necessary. METHODS We developed a pipeline (workflow) that includes two different supervised classification techniques based on regularization methods to identify biomarkers from -omics or other high dimension clinical datasets. The pipeline includes several important steps such as quality control and stability of selected biomarkers. The process takes input files (outcome and independent variables or -omics data) and pre-processes (normalization, missing values) them. After a random division of samples into training and test sets, Least Absolute Shrinkage and Selection Operator and Elastic Net feature selection methods are applied to identify the most important features representing potential biomarker candidates. The penalization parameters are optimised using 10-fold cross validation and the process undergoes 100 iterations and a combinatorial analysis to select the best performing multivariate model. An empirical unbiased assessment of their quality as biomarkers for clinical use is performed through a Receiver Operating Characteristic curve and its Area Under the Curve analysis on both permuted and real data for 1000 different randomized training and test sets. We validated this pipeline against previously published biomarkers. RESULTS We applied this pipeline to three different datasets with previously published biomarkers: lipidomics data by Acharjee et al. (Metabolomics 13:25, 2017) and transcriptomics data by Rajamani and Bhasin (Genome Med 8:38, 2016) and Mills et al. (Blood 114:1063-1072, 2009). Our results demonstrate that our method was able to identify both previously published biomarkers as well as new variables that add value to the published results. CONCLUSIONS We developed a robust pipeline to identify clinically relevant biomarkers that can be applied to different -omics datasets. Such identification reveals potentially novel drug targets and can be used as a part of a machine-learning based patient stratification framework in the translational medicine settings.
Collapse
Affiliation(s)
- Laura Bravo-Merodio
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, B15 2TT UK
- Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, B15 2TT UK
| | - John A. Williams
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, B15 2TT UK
- Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, B15 2TT UK
- Mammalian Genetics Unit, Medical Research Council Harwell Institute, Harwell Campus, Didcot, OX11 0RD UK
| | - Georgios V. Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, B15 2TT UK
- Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, B15 2TT UK
- MRC Health Data Research UK (HDR UK), London, UK
- NIHR Experimental Cancer Medicine Centre, Birmingham, B15 2TT UK
- NIHR Surgical Reconstruction and Microbiology Research Centre, Birmingham, B15 2TT UK
- NIHR Biomedical Research Centre, Birmingham, B15 2TT UK
| | - Animesh Acharjee
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, B15 2TT UK
- Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, B15 2TT UK
- NIHR Surgical Reconstruction and Microbiology Research Centre, Birmingham, B15 2TT UK
| |
Collapse
|
31
|
Aouiche C, Chen B, Shang X. Predicting stage-specific cancer related genes and their dynamic modules by integrating multiple datasets. BMC Bioinformatics 2019; 20:194. [PMID: 31074385 PMCID: PMC6509867 DOI: 10.1186/s12859-019-2740-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND The mechanism of many complex diseases has not been detected accurately in terms of their stage evolution. Previous studies mainly focus on the identification of associations between genes and individual diseases, but less is known about their associations with specific disease stages. Exploring biological modules through different disease stages could provide valuable knowledge to genomic and clinical research. RESULTS In this study, we proposed a powerful and versatile framework to identify stage-specific cancer related genes and their dynamic modules by integrating multiple datasets. The discovered modules and their specific-signature genes were significantly enriched in many relevant known pathways. To further illustrate the dynamic evolution of these clinical-stages, a pathway network was built by taking individual pathways as vertices and the overlapping relationship between their annotated genes as edges. CONCLUSIONS The identified pathway network not only help us to understand the functional evolution of complex diseases, but also useful for clinical management to select the optimum treatment regimens and the appropriate drugs for patients.
Collapse
Affiliation(s)
- Chaima Aouiche
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China.,Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University Ministry of Industry and Information Technology, Xi'an, China
| | - Bolin Chen
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China. .,Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University Ministry of Industry and Information Technology, Xi'an, China.
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China.,Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University Ministry of Industry and Information Technology, Xi'an, China
| |
Collapse
|
32
|
López de Maturana E, Alonso L, Alarcón P, Martín-Antoniano IA, Pineda S, Piorno L, Calle ML, Malats N. Challenges in the Integration of Omics and Non-Omics Data. Genes (Basel) 2019; 10:genes10030238. [PMID: 30897838 PMCID: PMC6471713 DOI: 10.3390/genes10030238] [Citation(s) in RCA: 50] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2019] [Revised: 03/05/2019] [Accepted: 03/14/2019] [Indexed: 11/16/2022] Open
Abstract
Omics data integration is already a reality. However, few omics-based algorithms show enough predictive ability to be implemented into clinics or public health domains. Clinical/epidemiological data tend to explain most of the variation of health-related traits, and its joint modeling with omics data is crucial to increase the algorithm’s predictive ability. Only a small number of published studies performed a “real” integration of omics and non-omics (OnO) data, mainly to predict cancer outcomes. Challenges in OnO data integration regard the nature and heterogeneity of non-omics data, the possibility of integrating large-scale non-omics data with high-throughput omics data, the relationship between OnO data (i.e., ascertainment bias), the presence of interactions, the fairness of the models, and the presence of subphenotypes. These challenges demand the development and application of new analysis strategies to integrate OnO data. In this contribution we discuss different attempts of OnO data integration in clinical and epidemiological studies. Most of the reviewed papers considered only one type of omics data set, mainly RNA expression data. All selected papers incorporated non-omics data in a low-dimensionality fashion. The integrative strategies used in the identified papers adopted three modeling methods: Independent, conditional, and joint modeling. This review presents, discusses, and proposes integrative analytical strategies towards OnO data integration.
Collapse
Affiliation(s)
- Evangelina López de Maturana
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Lola Alonso
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Pablo Alarcón
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Isabel Adoración Martín-Antoniano
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Silvia Pineda
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Lucas Piorno
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - M Luz Calle
- Biosciences Department, University of Vic-Central University of Catalonia, Carrer de la Laura 13, 08570 Vic, Spain.
| | - Núria Malats
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| |
Collapse
|
33
|
Shao B, Bjaanæs MM, Helland Å, Schütte C, Conrad T. EMT network-based feature selection improves prognosis prediction in lung adenocarcinoma. PLoS One 2019; 14:e0204186. [PMID: 30703089 PMCID: PMC6354965 DOI: 10.1371/journal.pone.0204186] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2018] [Accepted: 12/25/2018] [Indexed: 12/16/2022] Open
Abstract
Various feature selection algorithms have been proposed to identify cancer prognostic biomarkers. In recent years, however, their reproducibility is criticized. The performance of feature selection algorithms is shown to be affected by the datasets, underlying networks and evaluation metrics. One of the causes is the curse of dimensionality, which makes it hard to select the features that generalize well on independent data. Even the integration of biological networks does not mitigate this issue because the networks are large and many of their components are not relevant for the phenotype of interest. With the availability of multi-omics data, integrative approaches are being developed to build more robust predictive models. In this scenario, the higher data dimensions create greater challenges. We proposed a phenotype relevant network-based feature selection (PRNFS) framework and demonstrated its advantages in lung cancer prognosis prediction. We constructed cancer prognosis relevant networks based on epithelial mesenchymal transition (EMT) and integrated them with different types of omics data for feature selection. With less than 2.5% of the total dimensionality, we obtained EMT prognostic signatures that achieved remarkable prediction performance (average AUC values >0.8), very significant sample stratifications, and meaningful biological interpretations. In addition to finding EMT signatures from different omics data levels, we combined these single-omics signatures into multi-omics signatures, which improved sample stratifications significantly. Both single- and multi-omics EMT signatures were tested on independent multi-omics lung cancer datasets and significant sample stratifications were obtained.
Collapse
Affiliation(s)
- Borong Shao
- Zuse Institute Berlin, Berlin, Germany
- Dept of mathematics and computer science, Freie Universität Berlin, Berlin, Germany
- * E-mail:
| | - Maria Moksnes Bjaanæs
- Dept of Oncology, Oslo University Hospital, Oslo, Norway
- Dept of Cancer Genetics, Oslo University Hospital, Oslo, Norway
- Dept of Clinical Medicine, University of Oslo, Oslo, Norway
| | - Åslaug Helland
- Dept of Oncology, Oslo University Hospital, Oslo, Norway
- Dept of Cancer Genetics, Oslo University Hospital, Oslo, Norway
- Dept of Clinical Medicine, University of Oslo, Oslo, Norway
| | - Christof Schütte
- Zuse Institute Berlin, Berlin, Germany
- Dept of mathematics and computer science, Freie Universität Berlin, Berlin, Germany
| | - Tim Conrad
- Zuse Institute Berlin, Berlin, Germany
- Dept of mathematics and computer science, Freie Universität Berlin, Berlin, Germany
| |
Collapse
|
34
|
Graim K, Friedl V, Houlahan KE, Stuart JM. PLATYPUS: A Multiple-View Learning Predictive Framework for Cancer Drug Sensitivity Prediction. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2019; 24:136-147. [PMID: 30864317 PMCID: PMC6417802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
Cancer is a complex collection of diseases that are to some degree unique to each patient. Precision oncology aims to identify the best drug treatment regime using molecular data on tumor samples. While omics-level data is becoming more widely available for tumor specimens, the datasets upon which computational learning methods can be trained vary in coverage from sample to sample and from data type to data type. Methods that can 'connect the dots' to leverage more of the information provided by these studies could offer major advantages for maximizing predictive potential. We introduce a multi-view machinelearning strategy called PLATYPUS that builds 'views' from multiple data sources that are all used as features for predicting patient outcomes. We show that a learning strategy that finds agreement across the views on unlabeled data increases the performance of the learning methods over any single view. We illustrate the power of the approach by deriving signatures for drug sensitivity in a large cancer cell line database. Code and additional information are available from the PLATYPUS website https://sysbiowiki.soe.ucsc.edu/platypus.
Collapse
Affiliation(s)
| | - Verena Friedl
- Dept. of Biomolecular Engineering, University of California, Santa Cruz, CA 95064, USA
| | | | - Joshua M. Stuart
- Dept. of Biomolecular Engineering, University of California, Santa Cruz, CA 95064, USA
| |
Collapse
|
35
|
Yang H, Cao H, He T, Wang T, Cui Y. Multilevel heterogeneous omics data integration with kernel fusion. Brief Bioinform 2018; 21:156-170. [PMID: 30496340 DOI: 10.1093/bib/bby115] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2018] [Revised: 10/25/2018] [Accepted: 10/26/2018] [Indexed: 01/26/2023] Open
Abstract
High-throughput omics data are generated almost with no limit nowadays. It becomes increasingly important to integrate different omics data types to disentangle the molecular machinery of complex diseases with the hope for better disease prevention and treatment. Since the relationship among different omics data features are typically unknown, a supervised learning model assuming a particular distribution with a specific structure will not serve the purpose to capture the underlying complex relationship between multiple features and a disease phenotype. In this work, we briefly reviewed methods for kernel fusion (KF) based on support vector machine and kernel partial least squares (KPLS) algorithms. We then proposed a fused KPLS (fKPLS) model for disease classification and prediction with multilevel omics data. The fused kernel can deal with effect heterogeneity in which different omic data types may have different effect contribution to the trait of interest, with the purpose to improve the prediction performance. We proposed to optimize the kernel parameters and kernel weights with the genetic algorithm (GA). The proposed GA-fKPLS model can substantially improve disease classification performance by integrating multiple omics data types, demonstrated via extensive simulations and real data analysis. With properly defined fitness functions during GA optimization, the proposed KF method can be extended to other kernel-based analyses such as in kernel association analysis with common or rare variants.
Collapse
Affiliation(s)
- Haitao Yang
- Department of Epidemiology and Health Statistics, School of Public Health, and Hebei Province Key Laboratory of Environment and Human Health, Hebei Medical University, Shijiazhuang, PR China
| | - Hongyan Cao
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, PR China
| | - Tao He
- Department of Mathematics, San Francisco State University, San Francisco, CA, USA
| | - Tong Wang
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, PR China
| | - Yuehua Cui
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, PR China.,Department of Statistics and Probability, Michigan State University, East Lansing, MI, USA
| |
Collapse
|
36
|
Murphy K, Murphy BT, Boyce S, Flynn L, Gilgunn S, O'Rourke CJ, Rooney C, Stöckmann H, Walsh AL, Finn S, O'Kennedy RJ, O'Leary J, Pennington SR, Perry AS, Rudd PM, Saldova R, Sheils O, Shields DC, Watson RW. Integrating biomarkers across omic platforms: an approach to improve stratification of patients with indolent and aggressive prostate cancer. Mol Oncol 2018; 12:1513-1525. [PMID: 29927052 PMCID: PMC6120220 DOI: 10.1002/1878-0261.12348] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2018] [Revised: 05/28/2018] [Accepted: 06/13/2018] [Indexed: 12/22/2022] Open
Abstract
Classifying indolent prostate cancer represents a significant clinical challenge. We investigated whether integrating data from different omic platforms could identify a biomarker panel with improved performance compared to individual platforms alone. DNA methylation, transcripts, protein and glycosylation biomarkers were assessed in a single cohort of patients treated by radical prostatectomy. Novel multiblock statistical data integration approaches were used to deal with missing data and modelled via stepwise multinomial logistic regression, or LASSO. After applying leave‐one‐out cross‐validation to each model, the probabilistic predictions of disease type for each individual panel were aggregated to improve prediction accuracy using all available information for a given patient. Through assessment of three performance parameters of area under the curve (AUC) values, calibration and decision curve analysis, the study identified an integrated biomarker panel which predicts disease type with a high level of accuracy, with Multi AUC value of 0.91 (0.89, 0.94) and Ordinal C‐Index (ORC) value of 0.94 (0.91, 0.96), which was significantly improved compared to the values for the clinical panel alone of 0.67 (0.62, 0.72) Multi AUC and 0.72 (0.67, 0.78) ORC. Biomarker integration across different omic platforms significantly improves prediction accuracy. We provide a novel multiplatform approach for the analysis, determination and performance assessment of novel panels which can be applied to other diseases. With further refinement and validation, this panel could form a tool to help inform appropriate treatment strategies impacting on patient outcome in early stage prostate cancer.
Collapse
Affiliation(s)
- Keefe Murphy
- UCD School of Mathematics and Statistics, University College Dublin, Ireland
| | - Brendan T Murphy
- UCD School of Mathematics and Statistics, University College Dublin, Ireland
| | - Susie Boyce
- UCD School of Mathematics and Statistics, University College Dublin, Ireland.,UCD School of Medicine, Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Ireland
| | - Louise Flynn
- Department of Histopathology, Central Pathology Laboratory, Trinity College, St James Hospital, University of Dublin, Ireland
| | - Sarah Gilgunn
- School of Biotechnology, Dublin City University, Ireland.,Biomedical Diagnostics Institute, National Centre for Sensor Research, Dublin City University, Ireland
| | - Colm J O'Rourke
- Prostate Molecular Oncology, Institute of Molecular Medicine, Trinity College Dublin, St James Hospital, Dublin, Ireland
| | - Cathy Rooney
- UCD School of Medicine, Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Ireland
| | - Henning Stöckmann
- NIBRT GlycoScience Group, National Institute for Bioprocessing Research and Training, Dublin, Ireland
| | - Anna L Walsh
- Prostate Molecular Oncology, Institute of Molecular Medicine, Trinity College Dublin, St James Hospital, Dublin, Ireland
| | - Stephen Finn
- Department of Histopathology, Central Pathology Laboratory, Trinity College, St James Hospital, University of Dublin, Ireland
| | - Richard J O'Kennedy
- School of Biotechnology, Dublin City University, Ireland.,Biomedical Diagnostics Institute, National Centre for Sensor Research, Dublin City University, Ireland
| | - John O'Leary
- Department of Histopathology, Central Pathology Laboratory, Trinity College, St James Hospital, University of Dublin, Ireland
| | - Stephen R Pennington
- UCD School of Medicine, Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Ireland
| | - Antoinette S Perry
- Prostate Molecular Oncology, Institute of Molecular Medicine, Trinity College Dublin, St James Hospital, Dublin, Ireland.,Cancer Biology and Therapeutics Lab, UCD School of Biomolecular and Biomedical Science, University College Dublin, Ireland
| | - Pauline M Rudd
- NIBRT GlycoScience Group, National Institute for Bioprocessing Research and Training, Dublin, Ireland
| | - Radka Saldova
- NIBRT GlycoScience Group, National Institute for Bioprocessing Research and Training, Dublin, Ireland
| | - Orla Sheils
- Department of Histopathology, Central Pathology Laboratory, Trinity College, St James Hospital, University of Dublin, Ireland
| | - Denis C Shields
- UCD School of Medicine, Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Ireland
| | - R William Watson
- UCD School of Medicine, Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Ireland
| |
Collapse
|
37
|
Sinnott JA, Cai T. Pathway aggregation for survival prediction via multiple kernel learning. Stat Med 2018; 37:2501-2515. [PMID: 29664143 PMCID: PMC5994931 DOI: 10.1002/sim.7681] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2016] [Revised: 03/10/2018] [Accepted: 03/20/2018] [Indexed: 01/05/2023]
Abstract
Attempts to predict prognosis in cancer patients using high-dimensional genomic data such as gene expression in tumor tissue can be made difficult by the large number of features and the potential complexity of the relationship between features and the outcome. Integrating prior biological knowledge into risk prediction with such data by grouping genomic features into pathways and networks reduces the dimensionality of the problem and could improve prediction accuracy. Additionally, such knowledge-based models may be more biologically grounded and interpretable. Prediction could potentially be further improved by allowing for complex nonlinear pathway effects. The kernel machine framework has been proposed as an effective approach for modeling the nonlinear and interactive effects of genes in pathways for both censored and noncensored outcomes. When multiple pathways are under consideration, one may efficiently select informative pathways and aggregate their signals via multiple kernel learning (MKL), which has been proposed for prediction of noncensored outcomes. In this paper, we propose MKL methods for censored survival outcomes. We derive our approach for a general survival modeling framework with a convex objective function and illustrate its application under the Cox proportional hazards and semiparametric accelerated failure time models. Numerical studies demonstrate that the proposed MKL-based prediction methods work well in finite sample and can potentially outperform models constructed assuming linear effects or ignoring the group knowledge. The methods are illustrated with an application to 2 cancer data sets.
Collapse
Affiliation(s)
| | - Tianxi Cai
- Department of Biostatistics, Harvard University, Boston, MA, USA
| |
Collapse
|
38
|
Dihazi H, Asif AR, Beißbarth T, Bohrer R, Feussner K, Feussner I, Jahn O, Lenz C, Majcherczyk A, Schmidt B, Schmitt K, Urlaub H, Valerius O. Integrative omics - from data to biology. Expert Rev Proteomics 2018; 15:463-466. [DOI: 10.1080/14789450.2018.1476143] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Affiliation(s)
- Hassan Dihazi
- Göttingen Proteomics Forum (GPF), Göttingen, Germany
- Nephrology and Rheumatology, University Medical Center Göttingen, University of Göttingen, Göttingen, Germany
| | - Abdul R. Asif
- Göttingen Proteomics Forum (GPF), Göttingen, Germany
- Institute for Clinical Chemistry/UMG-Laborateries, University Medical Center Göttingen, University of Göttingen, Göttingen, Germany
| | - Tim Beißbarth
- Göttingen Proteomics Forum (GPF), Göttingen, Germany
- Department of Medical Statistics, University Medical Center Göttingen, University of Göttingen, Göttingen, Germany
| | - Rainer Bohrer
- Göttingen Proteomics Forum (GPF), Göttingen, Germany
- Gesellschaft für Wissenschaftlische Datenverarbeitung mbH, Göttingen, Germany
| | - Kirstin Feussner
- Göttingen Metabolomics and Lipidomics Platform (GMLP), Göttingen, Germany
- Department of Plant Biochemistry, Albrecht-von-Haller-Institute for Plant Sciences, University of Göttingen, Göttingen, Germany
| | - Ivo Feussner
- Göttingen Metabolomics and Lipidomics Platform (GMLP), Göttingen, Germany
- Department of Plant Biochemistry, Albrecht-von-Haller-Institute for Plant Sciences, University of Göttingen, Göttingen, Germany
| | - Olaf Jahn
- Göttingen Proteomics Forum (GPF), Göttingen, Germany
- Proteomics Group, Max Planck Institute of Experimental Medicine, Göttingen, Germany
| | - Christof Lenz
- Göttingen Proteomics Forum (GPF), Göttingen, Germany
- Institute for Clinical Chemistry/UMG-Laborateries, University Medical Center Göttingen, University of Göttingen, Göttingen, Germany
- Bioanalytical Mass Spectrometry, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany
| | - Andrzej Majcherczyk
- Göttingen Proteomics Forum (GPF), Göttingen, Germany
- Büsgen-Institute, Section Molecular Wood Biotechnology and Technical Mycology, University of Göttingen, Göttingen, Germany
| | - Bernhard Schmidt
- Göttingen Proteomics Forum (GPF), Göttingen, Germany
- Department of Cellular Biochemistry, University Medical Center Göttingen, Göttingen, Germany
| | - Kerstin Schmitt
- Göttingen Proteomics Forum (GPF), Göttingen, Germany
- Institute for Microbiology and Genetics, University of Göttingen, Göttingen, Germany
| | - Henning Urlaub
- Göttingen Proteomics Forum (GPF), Göttingen, Germany
- Institute for Clinical Chemistry/UMG-Laborateries, University Medical Center Göttingen, University of Göttingen, Göttingen, Germany
- Bioanalytical Mass Spectrometry, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany
| | - Oliver Valerius
- Göttingen Proteomics Forum (GPF), Göttingen, Germany
- Institute for Microbiology and Genetics, University of Göttingen, Göttingen, Germany
| |
Collapse
|
39
|
Noell G, Faner R, Agustí A. From systems biology to P4 medicine: applications in respiratory medicine. Eur Respir Rev 2018; 27:27/147/170110. [PMID: 29436404 PMCID: PMC9489012 DOI: 10.1183/16000617.0110-2017] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2017] [Accepted: 11/30/2017] [Indexed: 12/22/2022] Open
Abstract
Human health and disease are emergent properties of a complex, nonlinear, dynamic multilevel biological system: the human body. Systems biology is a comprehensive research strategy that has the potential to understand these emergent properties holistically. It stems from advancements in medical diagnostics, “omics” data and bioinformatic computing power. It paves the way forward towards “P4 medicine” (predictive, preventive, personalised and participatory), which seeks to better intervene preventively to preserve health or therapeutically to cure diseases. In this review, we: 1) discuss the principles of systems biology; 2) elaborate on how P4 medicine has the potential to shift healthcare from reactive medicine (treatment of illness) to predict and prevent illness, in a revolution that will be personalised in nature, probabilistic in essence and participatory driven; 3) review the current state of the art of network (systems) medicine in three prevalent respiratory diseases (chronic obstructive pulmonary disease, asthma and lung cancer); and 4) outline current challenges and future goals in the field. Systems biology and network medicine have the potential to transform medical research and practicehttp://ow.ly/r3jR30hf35x
Collapse
Affiliation(s)
- Guillaume Noell
- Institut d'Investigacions Biomediques August Pi i Sunyer (IDIBAPS), Barcelona, Spain.,CIBER Enfermedades Respiratorias (CIBERES), Barcelona, Spain
| | - Rosa Faner
- Institut d'Investigacions Biomediques August Pi i Sunyer (IDIBAPS), Barcelona, Spain.,CIBER Enfermedades Respiratorias (CIBERES), Barcelona, Spain
| | - Alvar Agustí
- Institut d'Investigacions Biomediques August Pi i Sunyer (IDIBAPS), Barcelona, Spain .,CIBER Enfermedades Respiratorias (CIBERES), Barcelona, Spain.,Respiratory Institute, Hospital Clinic, Universitat de Barcelona, Barcelona, Spain
| |
Collapse
|
40
|
Huang S, Chaudhary K, Garmire LX. More Is Better: Recent Progress in Multi-Omics Data Integration Methods. Front Genet 2017; 8:84. [PMID: 28670325 PMCID: PMC5472696 DOI: 10.3389/fgene.2017.00084] [Citation(s) in RCA: 396] [Impact Index Per Article: 56.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2017] [Accepted: 06/01/2017] [Indexed: 01/20/2023] Open
Abstract
Multi-omics data integration is one of the major challenges in the era of precision medicine. Considerable work has been done with the advent of high-throughput studies, which have enabled the data access for downstream analyses. To improve the clinical outcome prediction, a gamut of software tools has been developed. This review outlines the progress done in the field of multi-omics integration and comprehensive tools developed so far in this field. Further, we discuss the integration methods to predict patient survival at the end of the review.
Collapse
Affiliation(s)
- Sijia Huang
- Epidemiology Program, University of Hawaii Cancer CenterHonolulu, HI, United States.,Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at ManoaHonolulu, HI, United States
| | - Kumardeep Chaudhary
- Epidemiology Program, University of Hawaii Cancer CenterHonolulu, HI, United States
| | - Lana X Garmire
- Epidemiology Program, University of Hawaii Cancer CenterHonolulu, HI, United States.,Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at ManoaHonolulu, HI, United States.,Department of Obstetrics, Gynecology, and Women's Health, John A. Burns School of Medicine, University of Hawaii at ManoaHonolulu, HI, United States
| |
Collapse
|
41
|
Du W, Cao Z, Song T, Li Y, Liang Y. A feature selection method based on multiple kernel learning with expression profiles of different types. BioData Min 2017; 10:4. [PMID: 28184251 PMCID: PMC5288949 DOI: 10.1186/s13040-017-0124-x] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2015] [Accepted: 01/11/2017] [Indexed: 11/28/2022] Open
Abstract
Background With the development of high-throughput technology, the researchers can acquire large number of expression data with different types from several public databases. Because most of these data have small number of samples and hundreds or thousands features, how to extract informative features from expression data effectively and robustly using feature selection technique is challenging and crucial. So far, a mass of many feature selection approaches have been proposed and applied to analyse expression data of different types. However, most of these methods only are limited to measure the performances on one single type of expression data by accuracy or error rate of classification. Results In this article, we propose a hybrid feature selection method based on Multiple Kernel Learning (MKL) and evaluate the performance on expression datasets of different types. Firstly, the relevance between features and classifying samples is measured by using the optimizing function of MKL. In this step, an iterative gradient descent process is used to perform the optimization both on the parameters of Support Vector Machine (SVM) and kernel confidence. Then, a set of relevant features is selected by sorting the optimizing function of each feature. Furthermore, we apply an embedded scheme of forward selection to detect the compact feature subsets from the relevant feature set. Conclusions We not only compare the classification accuracy with other methods, but also compare the stability, similarity and consistency of different algorithms. The proposed method has a satisfactory capability of feature selection for analysing expression datasets of different types using different performance measurements. Electronic supplementary material The online version of this article (doi:10.1186/s13040-017-0124-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wei Du
- College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, Jilin University, Changchun, 130012 China
| | - Zhongbo Cao
- College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, Jilin University, Changchun, 130012 China.,School of Management Science and Information Engineering, Jilin University of Finance and Economics, Changchun, 130012 China
| | - Tianci Song
- College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, Jilin University, Changchun, 130012 China
| | - Ying Li
- College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, Jilin University, Changchun, 130012 China
| | - Yanchun Liang
- College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, Jilin University, Changchun, 130012 China.,Zhuhai Laboratory of Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Zhuhai College of Jilin University, Zhuhai, 519041 China
| |
Collapse
|
42
|
Kim S, Oesterreich S, Kim S, Park Y, Tseng GC. Integrative clustering of multi-level omics data for disease subtype discovery using sequential double regularization. Biostatistics 2017; 18:165-179. [PMID: 27549122 PMCID: PMC5255053 DOI: 10.1093/biostatistics/kxw039] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2015] [Revised: 04/21/2016] [Accepted: 06/29/2016] [Indexed: 11/12/2022] Open
Abstract
With the rapid advances in technologies of microarray and massively parallel sequencing, data of multiple omics sources from a large patient cohort are now frequently seen in many consortium studies. Effective multi-level omics data integration has brought new statistical challenges. One important biological objective of such integrative analysis is to cluster patients in order to identify clinically relevant disease subtypes, which will form basis for tailored treatment and personalized medicine. Several methods have been proposed in the literature for this purpose, including the popular iCluster method used in many cancer applications. When clustering high-dimensional omics data, effective feature selection is critical for better clustering accuracy and biological interpretation. It is also common that a portion of "scattered samples" has patterns distinct from all major clusters and should not be assigned into any cluster as they may represent a rare disease subcategory or be in transition between disease subtypes. In this paper, we firstly propose to improve feature selection of the iCluster factor model by an overlapping sparse group lasso penalty on the omics features using prior knowledge of inter-omics regulatory flows. We then perform regularization over samples to allow clustering with scattered samples and generate tight clusters. The proposed group structured tight iCluster method will be evaluated by two real breast cancer examples and simulations to demonstrate its improved clustering accuracy, biological interpretation, and ability to generate coherent tight clusters.
Collapse
Affiliation(s)
- Sunghwan Kim
- Department of Biostatistics, University of Pittsburgh, 130 Desoto Street, Pittsburgh, PA 15261, USA and Department of Statistics, Korea University, Anamdong, Seoul 02841, South Korea
| | - Steffi Oesterreich
- Magee-Women's Research Institute, 204 Craft Avenue, Pittsburgh, PA 15213, USA
| | - Seyoung Kim
- School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA
| | - Yongseok Park
- Department of Biostatistics, University of Pittsburgh, 130 Desoto Street, Pittsburgh, PA 15261, USA ;
| | - George C Tseng
- Department of Biostatistics, University of Pittsburgh, 130 Desoto Street, Pittsburgh, PA 15261, USA ;
| |
Collapse
|
43
|
Gönen M. Integrating gene set analysis and nonlinear predictive modeling of disease phenotypes using a Bayesian multitask formulation. BMC Bioinformatics 2016; 17:0. [PMID: 28105911 PMCID: PMC5249028 DOI: 10.1186/s12859-016-1311-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identifying molecular signatures of disease phenotypes is studied using two mainstream approaches: (i) Predictive modeling methods such as linear classification and regression algorithms are used to find signatures predictive of phenotypes from genomic data, which may not be robust due to limited sample size or highly correlated nature of genomic data. (ii) Gene set analysis methods are used to find gene sets on which phenotypes are linearly dependent by bringing prior biological knowledge into the analysis, which may not capture more complex nonlinear dependencies. Thus, formulating an integrated model of gene set analysis and nonlinear predictive modeling is of great practical importance. RESULTS In this study, we propose a Bayesian binary classification framework to integrate gene set analysis and nonlinear predictive modeling. We then generalize this formulation to multitask learning setting to model multiple related datasets conjointly. Our main novelty is the probabilistic nonlinear formulation that enables us to robustly capture nonlinear dependencies between genomic data and phenotype even with small sample sizes. We demonstrate the performance of our algorithms using repeated random subsampling validation experiments on two cancer and two tuberculosis datasets by predicting important disease phenotypes from genome-wide gene expression data. CONCLUSIONS We are able to obtain comparable or even better predictive performance than a baseline Bayesian nonlinear algorithm and to identify sparse sets of relevant genes and gene sets on all datasets. We also show that our multitask learning formulation enables us to further improve the generalization performance and to better understand biological processes behind disease phenotypes.
Collapse
Affiliation(s)
- Mehmet Gönen
- Department of Industrial Engineering, Koç University, İstanbul, 34450, Turkey.
| |
Collapse
|
44
|
Youssef I, Clarke R, Shih IM, Wang Y, Yu G. Biologically inspired survival analysis based on integrating gene expression as mediator with genomic variants. Comput Biol Med 2016; 77:231-9. [PMID: 27619193 DOI: 10.1016/j.compbiomed.2016.08.020] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2016] [Revised: 08/29/2016] [Accepted: 08/30/2016] [Indexed: 12/20/2022]
Abstract
Accurately linking cancer molecular profiling with survival can lead to improvements in the clinical management of cancer. However, existing survival analysis relies on statistical evidence from a single level of data, without paying much attention to the integration of interacting multi-level data and the underlying biology. Advances in genomic techniques provide unprecedented power of characterizing the cancer tissue in a more complete manner than before, offering the opportunity to design biologically informed and integrative approaches for survival data analysis. Human cancer is characterized by somatic copy number alternation and unique gene expression profiles. However, it remains largely unclear how to integrate the gene expression and genetic variant data to achieve a better prediction of patient survival and an improved understanding of disease progression. Consistent with the biological hierarchy from DNA to RNA, we prioritize each survival-relevant feature with two separate scores, predictive and mechanistic. For mRNA expression levels, predictive features are those mRNAs whose variation in expression levels is associated with survival outcome, and mechanistic features are those mRNAs whose variation in expression levels is associated with genomic variants. Further, we simultaneously integrate information from both the predictive model and the mechanistic model through our new approach, GEMPS (Gene Expression as a Mediator for Predicting Survival). Applied on two cancer types (ovarian and glioblastoma multiforme), our method achieved better prediction power (p-value: 6.18E-03-5.15E-11) than peer methods (GE.CNAs and GE.CNAs. Lasso). Gene set enrichment analysis confirms that the genes utilized for the final survival analysis are biologically important and relevant.
Collapse
Affiliation(s)
- Ibrahim Youssef
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA; Biomedical Engineering Department, Cairo University, Giza 12613, Egypt
| | - Robert Clarke
- Lombardi Comprehensive Cancer Center, Georgetown University, Washington DC 20057, USA
| | - Ie-Ming Shih
- Departments of Gynecology and Obstetrics, Pathology, and Oncology, Johns Hopkins University, Baltimore, MD 21231, USA
| | - Yue Wang
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| | - Guoqiang Yu
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA.
| |
Collapse
|
45
|
Fernandez-Lozano C, Seoane JA, Gestal M, Gaunt TR, Dorado J, Pazos A, Campbell C. Texture analysis in gel electrophoresis images using an integrative kernel-based approach. Sci Rep 2016; 6:19256. [PMID: 26758643 PMCID: PMC4713050 DOI: 10.1038/srep19256] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2015] [Accepted: 12/07/2015] [Indexed: 01/08/2023] Open
Abstract
Texture information could be used in proteomics to improve the quality of the image analysis of proteins separated on a gel. In order to evaluate the best technique to identify relevant textures, we use several different kernel-based machine learning techniques to classify proteins in 2-DE images into spot and noise. We evaluate the classification accuracy of each of these techniques with proteins extracted from ten 2-DE images of different types of tissues and different experimental conditions. We found that the best classification model was FSMKL, a data integration method using multiple kernel learning, which achieved AUROC values above 95% while using a reduced number of features. This technique allows us to increment the interpretability of the complex combinations of textures and to weight the importance of each particular feature in the final model. In particular the Inverse Difference Moment exhibited the highest discriminating power. A higher value can be associated with an homogeneous structure as this feature describes the homogeneity; the larger the value, the more symmetric. The final model is performed by the combination of different groups of textural features. Here we demonstrated the feasibility of combining different groups of textures in 2-DE image analysis for spot detection.
Collapse
Affiliation(s)
- Carlos Fernandez-Lozano
- Information and Communication Technologies Department, Faculty of Computer Science, University of A Coruna, A Coruna, 15071, Spain
| | - Jose A Seoane
- Bristol Genetic Epidemiology Laboratories, School of Social and Community Medicine, University of Bristol, Bristol BS82BN, UK.,Stanford Cancer Institute, Stanford School of Medicine, Stanford University, Stanford, 94305, USA
| | - Marcos Gestal
- Information and Communication Technologies Department, Faculty of Computer Science, University of A Coruna, A Coruna, 15071, Spain
| | - Tom R Gaunt
- MRC Integrative Epidemiology Unit, School of Social and Community Medicine, University of Bristol, Bristol BS82BN, UK
| | - Julian Dorado
- Information and Communication Technologies Department, Faculty of Computer Science, University of A Coruna, A Coruna, 15071, Spain
| | - Alejandro Pazos
- Information and Communication Technologies Department, Faculty of Computer Science, University of A Coruna, A Coruna, 15071, Spain.,Instituto de Investigacion Biomedica de A Coruña (INIBIC), Complexo Hospitalario Universitario de A Coruña (CHUAC), A Coruña, 15006, Spain
| | - Colin Campbell
- Intelligent Systems Laboratory, University of Bristol, Bristol BS81UB, UK
| |
Collapse
|
46
|
Iteratively refining breast cancer intrinsic subtypes in the METABRIC dataset. BioData Min 2016; 9:2. [PMID: 26770261 PMCID: PMC4712506 DOI: 10.1186/s13040-015-0078-9] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2015] [Accepted: 12/25/2015] [Indexed: 01/28/2023] Open
Abstract
BACKGROUND Multi-gene lists and single sample predictor models have been currently used to reduce the multidimensional complexity of breast cancers, and to identify intrinsic subtypes. The perceived inability of some models to deal with the challenges of processing high-dimensional data, however, limits the accurate characterisation of these subtypes. Towards the development of robust strategies, we designed an iterative approach to consistently discriminate intrinsic subtypes and improve class prediction in the METABRIC dataset. FINDINGS In this study, we employed the CM1 score to identify the most discriminative probes for each group, and an ensemble learning technique to assess the ability of these probes on assigning subtype labels using 24 different classifiers. Our analysis is comprised of an iterative computation of these methods and statistical measures performed on a set of over 2000 samples. The refined labels assigned using this iterative approach revealed to be more consistent and in better agreement with clinicopathological markers and patients' overall survival than those originally provided by the PAM50 method. CONCLUSIONS The assignment of intrinsic subtypes has a significant impact in translational research for both understanding and managing breast cancer. The refined labelling, therefore, provides more accurate and reliable information by improving the source of fundamental science prior to clinical applications in medicine.
Collapse
|
47
|
Song M, Jiang Z. Inferring Association between Compound and Pathway with an Improved Ensemble Learning Method. Mol Inform 2015; 34:753-60. [DOI: 10.1002/minf.201500033] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2015] [Accepted: 07/03/2015] [Indexed: 12/20/2022]
|
48
|
Munteanu CR, Pimenta AC, Fernandez-Lozano C, Melo A, Cordeiro MNDS, Moreira IS. Solvent accessible surface area-based hot-spot detection methods for protein-protein and protein-nucleic acid interfaces. J Chem Inf Model 2015; 55:1077-86. [PMID: 25845030 DOI: 10.1021/ci500760m] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Due to the importance of hot-spots (HS) detection and the efficiency of computational methodologies, several HS detecting approaches have been developed. The current paper presents new models to predict HS for protein-protein and protein-nucleic acid interactions with better statistics compared with the ones currently reported in literature. These models are based on solvent accessible surface area (SASA) and genetic conservation features subjected to simple Bayes networks (protein-protein systems) and a more complex multi-objective genetic algorithm-support vector machine algorithms (protein-nucleic acid systems). The best models for these interactions have been implemented in two free Web tools.
Collapse
Affiliation(s)
- Cristian R Munteanu
- †Information and Communication Technologies Department, Computer Science Faculty, University of A Coruna, Campus de Elviña s/n, 15071 A Coruña, Spain
| | - António C Pimenta
- ‡REQUIMTE/Departamento de Química e Bioquímica, Faculdade de Ciências da Universidade do Porto, Rua do Campo Alegre s/n, 4169-007 Porto, Portugal
| | - Carlos Fernandez-Lozano
- †Information and Communication Technologies Department, Computer Science Faculty, University of A Coruna, Campus de Elviña s/n, 15071 A Coruña, Spain
| | - André Melo
- ‡REQUIMTE/Departamento de Química e Bioquímica, Faculdade de Ciências da Universidade do Porto, Rua do Campo Alegre s/n, 4169-007 Porto, Portugal
| | - Maria N D S Cordeiro
- ‡REQUIMTE/Departamento de Química e Bioquímica, Faculdade de Ciências da Universidade do Porto, Rua do Campo Alegre s/n, 4169-007 Porto, Portugal
| | - Irina S Moreira
- ‡REQUIMTE/Departamento de Química e Bioquímica, Faculdade de Ciências da Universidade do Porto, Rua do Campo Alegre s/n, 4169-007 Porto, Portugal.,§CNC-Center for Neuroscience and Cell Biology, Universidade de Coimbra, Rua Larga, FMUC, Polo I, 1°andar, 3004-517 Coimbra, Portugal
| |
Collapse
|
49
|
Wang X, Xing EP, Schaid DJ. Kernel methods for large-scale genomic data analysis. Brief Bioinform 2015; 16:183-92. [PMID: 25053743 PMCID: PMC4375394 DOI: 10.1093/bib/bbu024] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2014] [Accepted: 05/20/2014] [Indexed: 11/12/2022] Open
Abstract
Machine learning, particularly kernel methods, has been demonstrated as a promising new tool to tackle the challenges imposed by today's explosive data growth in genomics. They provide a practical and principled approach to learning how a large number of genetic variants are associated with complex phenotypes, to help reveal the complexity in the relationship between the genetic markers and the outcome of interest. In this review, we highlight the potential key role it will have in modern genomic data processing, especially with regard to integration with classical methods for gene prioritizing, prediction and data fusion.
Collapse
|
50
|
Taskesen E, Babaei S, Reinders MMJ, de Ridder J. Integration of gene expression and DNA-methylation profiles improves molecular subtype classification in acute myeloid leukemia. BMC Bioinformatics 2015; 16 Suppl 4:S5. [PMID: 25734246 PMCID: PMC4347619 DOI: 10.1186/1471-2105-16-s4-s5] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Background Acute Myeloid Leukemia (AML) is characterized by various cytogenetic and molecular abnormalities. Detection of these abnormalities is important in the risk-classification of patients but requires laborious experimentation. Various studies showed that gene expression profiles (GEP), and the gene signatures derived from GEP, can be used for the prediction of subtypes in AML. Similarly, successful prediction was also achieved by exploiting DNA-methylation profiles (DMP). There are, however, no studies that compared classification accuracy and performance between GEP and DMP, neither are there studies that integrated both types of data to determine whether predictive power can be improved. Approach Here, we used 344 well-characterized AML samples for which both gene expression and DNA-methylation profiles are available. We created three different classification strategies including early, late and no integration of these datasets and used them to predict AML subtypes using a logistic regression model with Lasso regularization. Results We illustrate that both gene expression and DNA-methylation profiles contain distinct patterns that contribute to discriminating AML subtypes and that an integration strategy can exploit these patterns to achieve synergy between both data types. We show that concatenation of features from both data sets, i.e. early integration, improves the predictive power compared to classifiers trained on GEP or DMP alone. A more sophisticated strategy, i.e. the late integration strategy, employs a two-layer classifier which outperforms the early integration strategy. Conclusion We demonstrate that prediction of known cytogenetic and molecular abnormalities in AML can be further improved by integrating GEP and DMP profiles.
Collapse
|