1
|
Matsuoka T, Yashiro M. Bioinformatics Analysis and Validation of Potential Markers Associated with Prediction and Prognosis of Gastric Cancer. Int J Mol Sci 2024; 25:5880. [PMID: 38892067 PMCID: PMC11172243 DOI: 10.3390/ijms25115880] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 05/23/2024] [Accepted: 05/25/2024] [Indexed: 06/21/2024] Open
Abstract
Gastric cancer (GC) is one of the most common cancers worldwide. Most patients are diagnosed at the progressive stage of the disease, and current anticancer drug advancements are still lacking. Therefore, it is crucial to find relevant biomarkers with the accurate prediction of prognoses and good predictive accuracy to select appropriate patients with GC. Recent advances in molecular profiling technologies, including genomics, epigenomics, transcriptomics, proteomics, and metabolomics, have enabled the approach of GC biology at multiple levels of omics interaction networks. Systemic biological analyses, such as computational inference of "big data" and advanced bioinformatic approaches, are emerging to identify the key molecular biomarkers of GC, which would benefit targeted therapies. This review summarizes the current status of how bioinformatics analysis contributes to biomarker discovery for prognosis and prediction of therapeutic efficacy in GC based on a search of the medical literature. We highlight emerging individual multi-omics datasets, such as genomics, epigenomics, transcriptomics, proteomics, and metabolomics, for validating putative markers. Finally, we discuss the current challenges and future perspectives to integrate multi-omics analysis for improving biomarker implementation. The practical integration of bioinformatics analysis and multi-omics datasets under complementary computational analysis is having a great impact on the search for predictive and prognostic biomarkers and may lead to an important revolution in treatment.
Collapse
Affiliation(s)
- Tasuku Matsuoka
- Department of Molecular Oncology and Therapeutics, Osaka Metropolitan University Graduate School of Medicine, 1-4-3 Asahi-machi, Abeno-ku, Osaka 5458585, Japan;
- Institute of Medical Genetics, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka 5458585, Japan
| | - Masakazu Yashiro
- Department of Molecular Oncology and Therapeutics, Osaka Metropolitan University Graduate School of Medicine, 1-4-3 Asahi-machi, Abeno-ku, Osaka 5458585, Japan;
- Institute of Medical Genetics, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka 5458585, Japan
| |
Collapse
|
2
|
Zeng IS. Integrating omics atlas in health informatics system design-an opinion article. Front Digit Health 2024; 6:1374359. [PMID: 38784702 PMCID: PMC11111845 DOI: 10.3389/fdgth.2024.1374359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Accepted: 04/22/2024] [Indexed: 05/25/2024] Open
Affiliation(s)
- Irene Suilan Zeng
- Department of Biostatistics and Epidemiology, Auckland University of Technology, Auckland, New Zealand
- School of Clinical Science, Faculty of Health and Environmental Sciences, Auckland University of Technology, Auckland, New Zealand
| |
Collapse
|
3
|
Dall’Olio D, Sträng E, Turki AT, Tettero JM, Barbus M, Schulze-Rath R, Elicegui JM, Matteuzzi T, Merlotti A, Carota L, Sala C, Della Porta MG, Giampieri E, Hernández-Rivas JM, Bullinger L, Castellani G. Covering Hierarchical Dirichlet Mixture Models on binary data to enhance genomic stratifications in onco-hematology. PLoS Comput Biol 2024; 20:e1011299. [PMID: 38306404 PMCID: PMC10880984 DOI: 10.1371/journal.pcbi.1011299] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Revised: 02/21/2024] [Accepted: 01/02/2024] [Indexed: 02/04/2024] Open
Abstract
Onco-hematological studies are increasingly adopting statistical mixture models to support the advancement of the genomically-driven classification systems for blood cancer. Targeting enhanced patients stratification based on the sole role of molecular biology attracted much interest and contributes to bring personalized medicine closer to reality. In onco-hematology, Hierarchical Dirichlet Mixture Models (HDMM) have become one of the preferred method to cluster the genomics data, that include the presence or absence of gene mutations and cytogenetics anomalies, into components. This work unfolds the standard workflow used in onco-hematology to improve patient stratification and proposes alternative approaches to characterize the components and to assign patient to them, as they are crucial tasks usually supported by a priori clinical knowledge. We propose (a) to compute the parameters of the multinomial components of the HDMM or (b) to estimate the parameters of the HDMM components as if they were Multivariate Fisher's Non-Central Hypergeometric (MFNCH) distributions. Then, our approach to perform patients assignments to the HDMM components is designed to essentially determine for each patient its most likely component. We show on simulated data that the patients assignment using the MFNCH-based approach can be superior, if not comparable, to using the multinomial-based approach. Lastly, we illustrate on real Acute Myeloid Leukemia data how the utilization of MFNCH-based approach emerges as a good trade-off between the rigorous multinomial-based characterization of the HDMM components and the common refinement of them based on a priori clinical knowledge.
Collapse
Affiliation(s)
- Daniele Dall’Olio
- IRCCS Istituto delle Scienze Neurologiche di Bologna, Bologna, Italia
| | - Eric Sträng
- Department of Hematology, Oncology and Cancer Immunology, Campus Virchow, Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Amin T. Turki
- Department of Hematology and Stem Cell Transplantation, University Hospital Essen, Essen, Germany
- Department of Hematology and Oncology, Marienhospital University Hospital, Ruhr-University Bochum, Bochum, Germany
| | - Jesse M. Tettero
- Department of Hematology, Amsterdam UMC location Vrije Universiteit, Amsterdam, the Netherlands
| | | | | | - Javier Martinez Elicegui
- Molecular Genetics in Oncohematology, Institute of Biomedical Research of Salamanca, Salamanca, Spain
| | - Tommaso Matteuzzi
- Department of Physics and Astronomy, University of Florence, Sesto Fiorentino, Italy
| | - Alessandra Merlotti
- IRCCS Istituto delle Scienze Neurologiche di Bologna, Bologna, Italia
- Physics and Astronomy Department, University of Bologna, Bologna, Italy
| | - Luciana Carota
- Department of Medical and Surgical Sciences—DIMEC, University of Bologna, Bologna, Italy
| | - Claudia Sala
- Department of Medical and Surgical Sciences—DIMEC, University of Bologna, Bologna, Italy
| | - Matteo G. Della Porta
- Comprehensive Cancer Center, IRCCS Humanitas Clinical and Research Center and Department of Biomedical Sciences, Humanitas University, Milan, Italy
| | - Enrico Giampieri
- Department of Medical and Surgical Sciences—DIMEC, University of Bologna, Bologna, Italy
| | - Jesús María Hernández-Rivas
- Molecular Genetics in Oncohematology, Institute of Biomedical Research of Salamanca, Salamanca, Spain
- Hematology Department, University Hospital of Salamanca, Salamanca, Spain
- Cancer Research Center of Salamanca, Salamanca, Spain
| | - Lars Bullinger
- Department of Hematology, Oncology and Cancer Immunology, Campus Virchow, Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Gastone Castellani
- Department of Medical and Surgical Sciences—DIMEC, University of Bologna, Bologna, Italy
| | | |
Collapse
|
4
|
Chetty A, Blekhman R. Multi-omic approaches for host-microbiome data integration. Gut Microbes 2024; 16:2297860. [PMID: 38166610 PMCID: PMC10766395 DOI: 10.1080/19490976.2023.2297860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Accepted: 12/18/2023] [Indexed: 01/05/2024] Open
Abstract
The gut microbiome interacts with the host through complex networks that affect physiology and health outcomes. It is becoming clear that these interactions can be measured across many different omics layers, including the genome, transcriptome, epigenome, metabolome, and proteome, among others. Multi-omic studies of the microbiome can provide insight into the mechanisms underlying host-microbe interactions. As more omics layers are considered, increasingly sophisticated statistical methods are required to integrate them. In this review, we provide an overview of approaches currently used to characterize multi-omic interactions between host and microbiome data. While a large number of studies have generated a deeper understanding of host-microbiome interactions, there is still a need for standardization across approaches. Furthermore, microbiome studies would also benefit from the collection and curation of large, publicly available multi-omics datasets.
Collapse
Affiliation(s)
- Ashwin Chetty
- Committee on Genetics, Genomics and Systems Biology, The University of Chicago, Chicago, IL, USA
| | - Ran Blekhman
- Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL, USA
| |
Collapse
|
5
|
Chen C, Wang J, Pan D, Wang X, Xu Y, Yan J, Wang L, Yang X, Yang M, Liu G. Applications of multi-omics analysis in human diseases. MedComm (Beijing) 2023; 4:e315. [PMID: 37533767 PMCID: PMC10390758 DOI: 10.1002/mco2.315] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Revised: 05/25/2023] [Accepted: 05/31/2023] [Indexed: 08/04/2023] Open
Abstract
Multi-omics usually refers to the crossover application of multiple high-throughput screening technologies represented by genomics, transcriptomics, single-cell transcriptomics, proteomics and metabolomics, spatial transcriptomics, and so on, which play a great role in promoting the study of human diseases. Most of the current reviews focus on describing the development of multi-omics technologies, data integration, and application to a particular disease; however, few of them provide a comprehensive and systematic introduction of multi-omics. This review outlines the existing technical categories of multi-omics, cautions for experimental design, focuses on the integrated analysis methods of multi-omics, especially the approach of machine learning and deep learning in multi-omics data integration and the corresponding tools, and the application of multi-omics in medical researches (e.g., cancer, neurodegenerative diseases, aging, and drug target discovery) as well as the corresponding open-source analysis tools and databases, and finally, discusses the challenges and future directions of multi-omics integration and application in precision medicine. With the development of high-throughput technologies and data integration algorithms, as important directions of multi-omics for future disease research, single-cell multi-omics and spatial multi-omics also provided a detailed introduction. This review will provide important guidance for researchers, especially who are just entering into multi-omics medical research.
Collapse
Affiliation(s)
- Chongyang Chen
- Key Laboratory of Nuclear MedicineMinistry of HealthJiangsu Key Laboratory of Molecular Nuclear MedicineJiangsu Institute of Nuclear MedicineWuxiChina
- Co‐innovation Center of NeurodegenerationNantong UniversityNantongChina
| | - Jing Wang
- Shenzhen Key Laboratory of Modern ToxicologyShenzhen Medical Key Discipline of Health Toxicology (2020–2024)Shenzhen Center for Disease Control and PreventionShenzhenChina
| | - Donghui Pan
- Key Laboratory of Nuclear MedicineMinistry of HealthJiangsu Key Laboratory of Molecular Nuclear MedicineJiangsu Institute of Nuclear MedicineWuxiChina
| | - Xinyu Wang
- Key Laboratory of Nuclear MedicineMinistry of HealthJiangsu Key Laboratory of Molecular Nuclear MedicineJiangsu Institute of Nuclear MedicineWuxiChina
| | - Yuping Xu
- Key Laboratory of Nuclear MedicineMinistry of HealthJiangsu Key Laboratory of Molecular Nuclear MedicineJiangsu Institute of Nuclear MedicineWuxiChina
| | - Junjie Yan
- Key Laboratory of Nuclear MedicineMinistry of HealthJiangsu Key Laboratory of Molecular Nuclear MedicineJiangsu Institute of Nuclear MedicineWuxiChina
| | - Lizhen Wang
- Key Laboratory of Nuclear MedicineMinistry of HealthJiangsu Key Laboratory of Molecular Nuclear MedicineJiangsu Institute of Nuclear MedicineWuxiChina
| | - Xifei Yang
- Shenzhen Key Laboratory of Modern ToxicologyShenzhen Medical Key Discipline of Health Toxicology (2020–2024)Shenzhen Center for Disease Control and PreventionShenzhenChina
| | - Min Yang
- Key Laboratory of Nuclear MedicineMinistry of HealthJiangsu Key Laboratory of Molecular Nuclear MedicineJiangsu Institute of Nuclear MedicineWuxiChina
| | - Gong‐Ping Liu
- Co‐innovation Center of NeurodegenerationNantong UniversityNantongChina
- Department of PathophysiologySchool of Basic MedicineKey Laboratory of Ministry of Education of China and Hubei Province for Neurological DisordersTongji Medical CollegeHuazhong University of Science and TechnologyWuhanChina
| |
Collapse
|
6
|
Carrion J, Nandakumar R, Shi X, Gu H, Kim Y, Raskind WH, Peter B, Dinu V. A data-fusion approach to identifying developmental dyslexia from multi-omics datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.27.530280. [PMID: 36909570 PMCID: PMC10002702 DOI: 10.1101/2023.02.27.530280] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2023]
Abstract
This exploratory study tested and validated the use of data fusion and machine learning techniques to probe high-throughput omics and clinical data with a goal of exploring the etiology of developmental dyslexia. Developmental dyslexia is the leading learning disability in school aged children affecting roughly 5-10% of the US population. The complex biological and neurological phenotype of this life altering disability complicates its diagnosis. Phenome, exome, and metabolome data was collected allowing us to fully explore this system from a behavioral, cellular, and molecular point of view. This study provides a proof of concept showing that data fusion and ensemble learning techniques can outperform traditional machine learning techniques when provided small and complex multi-omics and clinical datasets. Heterogenous stacking classifiers consisting of single-omic experts/models achieved an accuracy of 86%, F1 score of 0.89, and AUC value of 0.83. Ensemble methods also provided a ranked list of important features that suggests exome single nucleotide polymorphisms found in the thalamus and cerebellum could be potential biomarkers for developmental dyslexia and heavily influenced the classification of DD within our machine learning models.
Collapse
Affiliation(s)
- Jackson Carrion
- College of Health Solutions, Arizona State University, Phoenix, AZ 85004
| | - Rohit Nandakumar
- College of Health Solutions, Arizona State University, Phoenix, AZ 85004
| | - Xiaojian Shi
- College of Health Solutions, Arizona State University, Phoenix, AZ 85004
- Cellular and Molecular Physiology Department, Yale School of Medicine, New Haven, CT 06510
| | - Haiwei Gu
- College of Health Solutions, Arizona State University, Phoenix, AZ 85004
- Center for Translational Science, Florida International University, Port St. Lucie, FL 34987
| | - Yookyung Kim
- College of Health Solutions, Arizona State University, Phoenix, AZ 85004
| | - Wendy H Raskind
- Department of Medicine/Medical Genetics, University of Washington, Seattle, WA 98105
| | - Beate Peter
- College of Health Solutions, Arizona State University, Phoenix, AZ 85004
| | - Valentin Dinu
- College of Health Solutions, Arizona State University, Phoenix, AZ 85004
| |
Collapse
|
7
|
Niranjan V, Uttarkar A, Kaul A, Varghese M. A Machine Learning-Based Approach Using Multi-omics Data to Predict Metabolic Pathways. Methods Mol Biol 2023; 2553:441-452. [PMID: 36227554 DOI: 10.1007/978-1-0716-2617-7_19] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
The integrative method approaches are continuously evolving to provide accurate insights from the data that is received through experimentation on various biological systems. Multi-omics data can be integrated with predictive machine learning algorithms in order to provide results with high accuracy. This protocol chapter defines the steps required for the ML-multi-omics integration methods that are applied on biological datasets for its analysis and the visual interpretation of the results thus obtained.
Collapse
Affiliation(s)
- Vidya Niranjan
- Department of Biotechnology, R V College of Engineering, Mysuru Road, Kengeri, Bengaluru, India.
| | - Akshay Uttarkar
- Department of Biotechnology, R V College of Engineering, Mysuru Road, Kengeri, Bengaluru, India
| | - Aakaanksha Kaul
- Department of Biotechnology, R V College of Engineering, Mysuru Road, Kengeri, Bengaluru, India
| | - Maryanne Varghese
- Department of Biotechnology, R V College of Engineering, Mysuru Road, Kengeri, Bengaluru, India
| |
Collapse
|
8
|
Mokou M, Narayanasamy S, Stroggilos R, Balaur IA, Vlahou A, Mischak H, Frantzi M. A Drug Repurposing Pipeline Based on Bladder Cancer Integrated Proteotranscriptomics Signatures. Methods Mol Biol 2023; 2684:59-99. [PMID: 37410228 DOI: 10.1007/978-1-0716-3291-8_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/07/2023]
Abstract
Delivering better care for patients with bladder cancer (BC) necessitates the development of novel therapeutic strategies that address both the high disease heterogeneity and the limitations of the current therapeutic modalities, such as drug low efficacy and patient resistance acquisition. Drug repurposing is a cost-effective strategy that targets the reuse of existing drugs for new therapeutic purposes. Such a strategy could open new avenues toward more effective BC treatment. BC patients' multi-omics signatures can be used to guide the investigation of existing drugs that show an effective therapeutic potential through drug repurposing. In this book chapter, we present an integrated multilayer approach that includes cross-omics analyses from publicly available transcriptomics and proteomics data derived from BC tissues and cell lines that were investigated for the development of disease-specific signatures. These signatures are subsequently used as input for a signature-based repurposing approach using the Connectivity Map (CMap) tool. We further explain the steps that may be followed to identify and select existing drugs of increased potential for repurposing in BC patients.
Collapse
Affiliation(s)
- Marika Mokou
- Department of Biomarker Research, Mosaiques Diagnostics, Hannover, Germany.
| | - Shaman Narayanasamy
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Rafael Stroggilos
- Systems Biology Center, Biomedical Research Foundation, Academy of Athens, Athens, Greece
| | - Irina-Afrodita Balaur
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Antonia Vlahou
- Systems Biology Center, Biomedical Research Foundation, Academy of Athens, Athens, Greece
| | - Harald Mischak
- Department of Biomarker Research, Mosaiques Diagnostics, Hannover, Germany
- Institute of Cardiovascular and Medical Sciences, University of Glasgow, Glasgow, UK
| | - Maria Frantzi
- Department of Biomarker Research, Mosaiques Diagnostics, Hannover, Germany
| |
Collapse
|
9
|
Li W, Shao C, Zhou H, Du H, Chen H, Wan H, He Y. Multi-omics research strategies in ischemic stroke: A multidimensional perspective. Ageing Res Rev 2022; 81:101730. [PMID: 36087702 DOI: 10.1016/j.arr.2022.101730] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2022] [Revised: 08/23/2022] [Accepted: 09/03/2022] [Indexed: 01/31/2023]
Abstract
Ischemic stroke (IS) is a multifactorial and heterogeneous neurological disorder with high rate of death and long-term impairment. Despite years of studies, there are still no stroke biomarkers for clinical practice, and the molecular mechanisms of stroke remain largely unclear. The high-throughput omics approach provides new avenues for discovering biomarkers of IS and explaining its pathological mechanisms. However, single-omics approaches only provide a limited understanding of the biological pathways of diseases. The integration of multiple omics data means the simultaneous analysis of thousands of genes, RNAs, proteins and metabolites, revealing networks of interactions between multiple molecular levels. Integrated analysis of multi-omics approaches will provide helpful insights into stroke pathogenesis, therapeutic target identification and biomarker discovery. Here, we consider advances in genomics, transcriptomics, proteomics and metabolomics and outline their use in discovering the biomarkers and pathological mechanisms of IS. We then delineate strategies for achieving integration at the multi-omics level and discuss how integrative omics and systems biology can contribute to our understanding and management of IS.
Collapse
Affiliation(s)
- Wentao Li
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| | - Chongyu Shao
- School of Life Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| | - Huifen Zhou
- School of Life Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| | - Haixia Du
- School of Life Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| | - Haiyang Chen
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| | - Haitong Wan
- School of Life Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| | - Yu He
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| |
Collapse
|
10
|
Hesami M, Alizadeh M, Jones AMP, Torkamaneh D. Machine learning: its challenges and opportunities in plant system biology. Appl Microbiol Biotechnol 2022; 106:3507-3530. [PMID: 35575915 DOI: 10.1007/s00253-022-11963-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 03/14/2022] [Accepted: 05/07/2022] [Indexed: 12/25/2022]
Abstract
Sequencing technologies are evolving at a rapid pace, enabling the generation of massive amounts of data in multiple dimensions (e.g., genomics, epigenomics, transcriptomic, metabolomics, proteomics, and single-cell omics) in plants. To provide comprehensive insights into the complexity of plant biological systems, it is important to integrate different omics datasets. Although recent advances in computational analytical pipelines have enabled efficient and high-quality exploration and exploitation of single omics data, the integration of multidimensional, heterogenous, and large datasets (i.e., multi-omics) remains a challenge. In this regard, machine learning (ML) offers promising approaches to integrate large datasets and to recognize fine-grained patterns and relationships. Nevertheless, they require rigorous optimizations to process multi-omics-derived datasets. In this review, we discuss the main concepts of machine learning as well as the key challenges and solutions related to the big data derived from plant system biology. We also provide in-depth insight into the principles of data integration using ML, as well as challenges and opportunities in different contexts including multi-omics, single-cell omics, protein function, and protein-protein interaction. KEY POINTS: • The key challenges and solutions related to the big data derived from plant system biology have been highlighted. • Different methods of data integration have been discussed. • Challenges and opportunities of the application of machine learning in plant system biology have been highlighted and discussed.
Collapse
Affiliation(s)
- Mohsen Hesami
- Department of Plant Agriculture, University of Guelph, Guelph, ON, N1G 2W1, Canada
| | - Milad Alizadeh
- Department of Botany, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | | | - Davoud Torkamaneh
- Département de Phytologie, Université Laval, Québec City, QC, G1V 0A6, Canada. .,Institut de Biologie Intégrative Et Des Systèmes (IBIS), Université Laval, Québec City, QC, G1V 0A6, Canada.
| |
Collapse
|
11
|
Zhang X, Zhou Z, Xu H, Liu CT. Integrative clustering methods for multi-omics data. WILEY INTERDISCIPLINARY REVIEWS. COMPUTATIONAL STATISTICS 2022; 14. [PMID: 35573155 PMCID: PMC9097984 DOI: 10.1002/wics.1553] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Integrative analysis of multi-omics data has drawn much attention from the scientific community due to the technological advancements which have generated various omics data. Leveraging these multi-omics data potentially provides a more comprehensive view of the disease mechanism or biological processes. Integrative multi-omics clustering is an unsupervised integrative method specifically used to find coherent groups of samples or features by utilizing information across multi-omics data. It aims to better stratify diseases and to suggest biological mechanisms and potential targeted therapies for the diseases. However, applying integrative multi-omics clustering is both statistically and computationally challenging due to various reasons such as high dimensionality and heterogeneity. In this review, we summarized integrative multi-omics clustering methods into three general categories: concatenated clustering, clustering of clusters, and interactive clustering based on when and how the multi-omics data are processed for clustering. We further classified the methods into different approaches under each category based on the main statistical strategy used during clustering. In addition, we have provided recommended practices tailored to four real-life scenarios to help researchers to strategize their selection in integrative multi-omics clustering methods for their future studies.
Collapse
Affiliation(s)
- Xiaoyu Zhang
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA
| | - Zhenwei Zhou
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA
| | - Hanfei Xu
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA
| | - Ching-Ti Liu
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA
| |
Collapse
|
12
|
Vahabi N, Michailidis G. Unsupervised Multi-Omics Data Integration Methods: A Comprehensive Review. Front Genet 2022; 13:854752. [PMID: 35391796 PMCID: PMC8981526 DOI: 10.3389/fgene.2022.854752] [Citation(s) in RCA: 32] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Accepted: 02/28/2022] [Indexed: 12/26/2022] Open
Abstract
Through the developments of Omics technologies and dissemination of large-scale datasets, such as those from The Cancer Genome Atlas, Alzheimer’s Disease Neuroimaging Initiative, and Genotype-Tissue Expression, it is becoming increasingly possible to study complex biological processes and disease mechanisms more holistically. However, to obtain a comprehensive view of these complex systems, it is crucial to integrate data across various Omics modalities, and also leverage external knowledge available in biological databases. This review aims to provide an overview of multi-Omics data integration methods with different statistical approaches, focusing on unsupervised learning tasks, including disease onset prediction, biomarker discovery, disease subtyping, module discovery, and network/pathway analysis. We also briefly review feature selection methods, multi-Omics data sets, and resources/tools that constitute critical components for carrying out the integration.
Collapse
Affiliation(s)
- Nasim Vahabi
- Informatics Institute, University of Florida, Gainesville, FL, United States
| | - George Michailidis
- Informatics Institute, University of Florida, Gainesville, FL, United States
| |
Collapse
|
13
|
Nguyen H, Tran D, Tran B, Roy M, Cassell A, Dascalu S, Draghici S, Nguyen T. SMRT: Randomized Data Transformation for Cancer Subtyping and Big Data Analysis. Front Oncol 2021; 11:725133. [PMID: 34745946 PMCID: PMC8563705 DOI: 10.3389/fonc.2021.725133] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Accepted: 09/28/2021] [Indexed: 12/25/2022] Open
Abstract
Cancer is an umbrella term that includes a range of disorders, from those that are fast-growing and lethal to indolent lesions with low or delayed potential for progression to death. The treatment options, as well as treatment success, are highly dependent on the correct subtyping of individual patients. With the advancement of high-throughput platforms, we have the opportunity to differentiate among cancer subtypes from a holistic perspective that takes into consideration phenomena at different molecular levels (mRNA, methylation, etc.). This demands powerful integrative methods to leverage large multi-omics datasets for a better subtyping. Here we introduce Subtyping Multi-omics using a Randomized Transformation (SMRT), a new method for multi-omics integration and cancer subtyping. SMRT offers the following advantages over existing approaches: (i) the scalable analysis pipeline allows researchers to integrate multi-omics data and analyze hundreds of thousands of samples in minutes, (ii) the ability to integrate data types with different numbers of patients, (iii) the ability to analyze un-matched data of different types, and (iv) the ability to offer users a convenient data analysis pipeline through a web application. We also improve the efficiency of our ensemble-based, perturbation clustering to support analysis on machines with memory constraints. In an extensive analysis, we compare SMRT with eight state-of-the-art subtyping methods using 37 TCGA and two METABRIC datasets comprising a total of almost 12,000 patient samples from 28 different types of cancer. We also performed a number of simulation studies. We demonstrate that SMRT outperforms other methods in identifying subtypes with significantly different survival profiles. In addition, SMRT is extremely fast, being able to analyze hundreds of thousands of samples in minutes. The web application is available at http://SMRT.tinnguyen-lab.com. The R package will be deposited to CRAN as part of our PINSPlus software suite.
Collapse
Affiliation(s)
- Hung Nguyen
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Duc Tran
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Bang Tran
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Monikrishna Roy
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Adam Cassell
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Sergiu Dascalu
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Sorin Draghici
- Department of Computer Science, Wayne State University, Detroit, MI, United States
| | - Tin Nguyen
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| |
Collapse
|
14
|
Duan R, Gao L, Gao Y, Hu Y, Xu H, Huang M, Song K, Wang H, Dong Y, Jiang C, Zhang C, Jia S. Evaluation and comparison of multi-omics data integration methods for cancer subtyping. PLoS Comput Biol 2021; 17:e1009224. [PMID: 34383739 PMCID: PMC8384175 DOI: 10.1371/journal.pcbi.1009224] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Revised: 08/24/2021] [Accepted: 06/28/2021] [Indexed: 11/18/2022] Open
Abstract
Computational integrative analysis has become a significant approach in the data-driven exploration of biological problems. Many integration methods for cancer subtyping have been proposed, but evaluating these methods has become a complicated problem due to the lack of gold standards. Moreover, questions of practical importance remain to be addressed regarding the impact of selecting appropriate data types and combinations on the performance of integrative studies. Here, we constructed three classes of benchmarking datasets of nine cancers in TCGA by considering all the eleven combinations of four multi-omics data types. Using these datasets, we conducted a comprehensive evaluation of ten representative integration methods for cancer subtyping in terms of accuracy measured by combining both clustering accuracy and clinical significance, robustness, and computational efficiency. We subsequently investigated the influence of different omics data on cancer subtyping and the effectiveness of their combinations. Refuting the widely held intuition that incorporating more types of omics data always produces better results, our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. Our analyses also suggested several effective combinations for most cancers under our studies, which may be of particular interest to researchers in omics data analysis. Cancer is one of the most heterogeneous diseases, characterized by diverse morphological, phenotypic, and genomic profiles between tumors and their subtypes. Identifying cancer subtypes can help patients receive precise treatments. With the development of high-throughput technologies, genomics, epigenomics, and transcriptomics data have been generated for large cancer patient cohorts. It is believed that the more omics data we use, the more accurate identification of cancer subtypes. To examine this assumption, we first constructed three classes of benchmarking datasets to conduct a comprehensive evaluation and comparison of ten representative multi-omics data integration methods for cancer subtyping by considering their accuracy, robustness, and computational efficiency. Then, we investigated the influence of different omics data and their various combinations on the effectiveness of cancer subtyping. Our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. We hope that our work may help researchers choose a proper method and an effective data combination when identifying cancer subtypes using data integration methods.
Collapse
Affiliation(s)
- Ran Duan
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Lin Gao
- School of Computer Science and Technology, Xidian University, Xi’an, China
- * E-mail:
| | - Yong Gao
- Department of Computer Science, The University of British Columbia Okanagan, Kelowna, British Columbia, Canada
| | - Yuxuan Hu
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Han Xu
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Mingfeng Huang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Kuo Song
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Hongda Wang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Yongqiang Dong
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Chaoqun Jiang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Chenxing Zhang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Songwei Jia
- School of Computer Science and Technology, Xidian University, Xi’an, China
| |
Collapse
|
15
|
Heo YJ, Hwa C, Lee GH, Park JM, An JY. Integrative Multi-Omics Approaches in Cancer Research: From Biological Networks to Clinical Subtypes. Mol Cells 2021; 44:433-443. [PMID: 34238766 PMCID: PMC8334347 DOI: 10.14348/molcells.2021.0042] [Citation(s) in RCA: 56] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2021] [Revised: 04/09/2021] [Accepted: 05/12/2021] [Indexed: 11/27/2022] Open
Abstract
Multi-omics approaches are novel frameworks that integrate multiple omics datasets generated from the same patients to better understand the molecular and clinical features of cancers. A wide range of emerging omics and multi-view clustering algorithms now provide unprecedented opportunities to further classify cancers into subtypes, improve the survival prediction and therapeutic outcome of these subtypes, and understand key pathophysiological processes through different molecular layers. In this review, we overview the concept and rationale of multi-omics approaches in cancer research. We also introduce recent advances in the development of multi-omics algorithms and integration methods for multiple-layered datasets from cancer patients. Finally, we summarize the latest findings from large-scale multi-omics studies of various cancers and their implications for patient subtyping and drug development.
Collapse
Affiliation(s)
- Yong Jin Heo
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, Seoul 02841, Korea
- Department of Integrated Biomedical and Life Science, Korea University, Seoul 02841, Korea
| | - Chanwoong Hwa
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, Seoul 02841, Korea
| | - Gang-Hee Lee
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, Seoul 02841, Korea
| | - Jae-Min Park
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, Seoul 02841, Korea
| | - Joon-Yong An
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, Seoul 02841, Korea
- Department of Integrated Biomedical and Life Science, Korea University, Seoul 02841, Korea
| |
Collapse
|
16
|
Ding J, Blencowe M, Nghiem T, Ha SM, Chen YW, Li G, Yang X. Mergeomics 2.0: a web server for multi-omics data integration to elucidate disease networks and predict therapeutics. Nucleic Acids Res 2021; 49:W375-W387. [PMID: 34048577 PMCID: PMC8262738 DOI: 10.1093/nar/gkab405] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2021] [Revised: 04/28/2021] [Accepted: 05/02/2021] [Indexed: 12/13/2022] Open
Abstract
The Mergeomics web server is a flexible online tool for multi-omics data integration to derive biological pathways, networks, and key drivers important to disease pathogenesis and is based on the open source Mergeomics R package. The web server takes summary statistics of multi-omics disease association studies (GWAS, EWAS, TWAS, PWAS, etc.) as input and features four functions: Marker Dependency Filtering (MDF) to correct for known dependency between omics markers, Marker Set Enrichment Analysis (MSEA) to detect disease relevant biological processes, Meta-MSEA to examine the consistency of biological processes informed by various omics datasets, and Key Driver Analysis (KDA) to identify essential regulators of disease-associated pathways and networks. The web server has been extensively updated and streamlined in version 2.0 including an overhauled user interface, improved tutorials and results interpretation for each analytical step, inclusion of numerous disease GWAS, functional genomics datasets, and molecular networks to allow for comprehensive omics integrations, increased functionality to decrease user workload, and increased flexibility to cater to user-specific needs. Finally, we have incorporated our newly developed drug repositioning pipeline PharmOmics for prediction of potential drugs targeting disease processes that were identified by Mergeomics. Mergeomics is freely accessible at http://mergeomics.research.idre.ucla.edu and does not require login.
Collapse
Affiliation(s)
- Jessica Ding
- Department of Integrative Biology and Physiology, University of California, Los Angeles, 610 Charles E. Young Drive East, Los Angeles, CA 90095, USA
- Interdepartmental Program of Molecular, Cellular and Integrative Physiology, University of California, Los Angeles, 610 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| | - Montgomery Blencowe
- Department of Integrative Biology and Physiology, University of California, Los Angeles, 610 Charles E. Young Drive East, Los Angeles, CA 90095, USA
- Interdepartmental Program of Molecular, Cellular and Integrative Physiology, University of California, Los Angeles, 610 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| | - Thien Nghiem
- Department of Integrative Biology and Physiology, University of California, Los Angeles, 610 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| | - Sung-min Ha
- Department of Integrative Biology and Physiology, University of California, Los Angeles, 610 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| | - Yen-Wei Chen
- Department of Integrative Biology and Physiology, University of California, Los Angeles, 610 Charles E. Young Drive East, Los Angeles, CA 90095, USA
- Interdepartmental Program of Molecular Toxicology, University of California, Los Angeles, 610 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| | - Gaoyan Li
- Department of Integrative Biology and Physiology, University of California, Los Angeles, 610 Charles E. Young Drive East, Los Angeles, CA 90095, USA
- Interdepartmental Program of Molecular, Cellular and Integrative Physiology, University of California, Los Angeles, 610 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| | - Xia Yang
- Department of Integrative Biology and Physiology, University of California, Los Angeles, 610 Charles E. Young Drive East, Los Angeles, CA 90095, USA
- Interdepartmental Program of Molecular, Cellular and Integrative Physiology, University of California, Los Angeles, 610 Charles E. Young Drive East, Los Angeles, CA 90095, USA
- Interdepartmental Program of Molecular Toxicology, University of California, Los Angeles, 610 Charles E. Young Drive East, Los Angeles, CA 90095, USA
- Interdepartmental Program of Bioinformatics, University of California, Los Angeles, 610 Charles E. Young Drive East, Los Angeles, CA 90095, USA
- Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, 610 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| |
Collapse
|
17
|
Reel PS, Reel S, Pearson E, Trucco E, Jefferson E. Using machine learning approaches for multi-omics data analysis: A review. Biotechnol Adv 2021; 49:107739. [PMID: 33794304 DOI: 10.1016/j.biotechadv.2021.107739] [Citation(s) in RCA: 243] [Impact Index Per Article: 81.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Revised: 03/01/2021] [Accepted: 03/25/2021] [Indexed: 02/06/2023]
Abstract
With the development of modern high-throughput omic measurement platforms, it has become essential for biomedical studies to undertake an integrative (combined) approach to fully utilise these data to gain insights into biological systems. Data from various omics sources such as genetics, proteomics, and metabolomics can be integrated to unravel the intricate working of systems biology using machine learning-based predictive algorithms. Machine learning methods offer novel techniques to integrate and analyse the various omics data enabling the discovery of new biomarkers. These biomarkers have the potential to help in accurate disease prediction, patient stratification and delivery of precision medicine. This review paper explores different integrative machine learning methods which have been used to provide an in-depth understanding of biological systems during normal physiological functioning and in the presence of a disease. It provides insight and recommendations for interdisciplinary professionals who envisage employing machine learning skills in multi-omics studies.
Collapse
Affiliation(s)
- Parminder S Reel
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, United Kingdom
| | - Smarti Reel
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, United Kingdom
| | - Ewan Pearson
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, United Kingdom
| | - Emanuele Trucco
- VAMPIRE project, Computing, School of Science and Engineering, University of Dundee, Dundee, United Kingdom
| | - Emily Jefferson
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, United Kingdom.
| |
Collapse
|
18
|
Tian J, Zhao J, Zheng C. Clustering of cancer data based on Stiefel manifold for multiple views. BMC Bioinformatics 2021; 22:268. [PMID: 34034643 PMCID: PMC8152349 DOI: 10.1186/s12859-021-04195-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2021] [Accepted: 05/12/2021] [Indexed: 12/23/2022] Open
Abstract
Background In recent years, various sequencing techniques have been used to collect biomedical omics datasets. It is usually possible to obtain multiple types of omics data from a single patient sample. Clustering of omics data plays an indispensable role in biological and medical research, and it is helpful to reveal data structures from multiple collections. Nevertheless, clustering of omics data consists of many challenges. The primary challenges in omics data analysis come from high dimension of data and small size of sample. Therefore, it is difficult to find a suitable integration method for structural analysis of multiple datasets. Results In this paper, a multi-view clustering based on Stiefel manifold method (MCSM) is proposed. The MCSM method comprises three core steps. Firstly, we established a binary optimization model for the simultaneous clustering problem. Secondly, we solved the optimization problem by linear search algorithm based on Stiefel manifold. Finally, we integrated the clustering results obtained from three omics by using k-nearest neighbor method. We applied this approach to four cancer datasets on TCGA. The result shows that our method is superior to several state-of-art methods, which depends on the hypothesis that the underlying omics cluster class is the same. Conclusion Particularly, our approach has better performance than compared approaches when the underlying clusters are inconsistent. For patients with different subtypes, both consistent and differential clusters can be identified at the same time.
Collapse
Affiliation(s)
- Jing Tian
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China
| | - Jianping Zhao
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China.
| | - Chunhou Zheng
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China.,School of Computer Science and Technology, Anhui University, Hefei, China
| |
Collapse
|
19
|
Vatansever S, Schlessinger A, Wacker D, Kaniskan HÜ, Jin J, Zhou M, Zhang B. Artificial intelligence and machine learning-aided drug discovery in central nervous system diseases: State-of-the-arts and future directions. Med Res Rev 2021; 41:1427-1473. [PMID: 33295676 PMCID: PMC8043990 DOI: 10.1002/med.21764] [Citation(s) in RCA: 95] [Impact Index Per Article: 31.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Revised: 10/30/2020] [Accepted: 11/20/2020] [Indexed: 01/11/2023]
Abstract
Neurological disorders significantly outnumber diseases in other therapeutic areas. However, developing drugs for central nervous system (CNS) disorders remains the most challenging area in drug discovery, accompanied with the long timelines and high attrition rates. With the rapid growth of biomedical data enabled by advanced experimental technologies, artificial intelligence (AI) and machine learning (ML) have emerged as an indispensable tool to draw meaningful insights and improve decision making in drug discovery. Thanks to the advancements in AI and ML algorithms, now the AI/ML-driven solutions have an unprecedented potential to accelerate the process of CNS drug discovery with better success rate. In this review, we comprehensively summarize AI/ML-powered pharmaceutical discovery efforts and their implementations in the CNS area. After introducing the AI/ML models as well as the conceptualization and data preparation, we outline the applications of AI/ML technologies to several key procedures in drug discovery, including target identification, compound screening, hit/lead generation and optimization, drug response and synergy prediction, de novo drug design, and drug repurposing. We review the current state-of-the-art of AI/ML-guided CNS drug discovery, focusing on blood-brain barrier permeability prediction and implementation into therapeutic discovery for neurological diseases. Finally, we discuss the major challenges and limitations of current approaches and possible future directions that may provide resolutions to these difficulties.
Collapse
Affiliation(s)
- Sezen Vatansever
- Department of Genetics and Genomic SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Mount Sinai Center for Transformative Disease ModelingIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Icahn Institute for Data Science and Genomic TechnologyIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| | - Avner Schlessinger
- Department of Pharmacological SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Mount Sinai Center for Therapeutics DiscoveryIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| | - Daniel Wacker
- Department of Pharmacological SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Mount Sinai Center for Therapeutics DiscoveryIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Department of NeuroscienceIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| | - H. Ümit Kaniskan
- Department of Pharmacological SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Mount Sinai Center for Therapeutics DiscoveryIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Department of Oncological Sciences, Tisch Cancer InstituteIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| | - Jian Jin
- Department of Pharmacological SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Mount Sinai Center for Therapeutics DiscoveryIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Department of Oncological Sciences, Tisch Cancer InstituteIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| | - Ming‐Ming Zhou
- Department of Pharmacological SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Department of Oncological Sciences, Tisch Cancer InstituteIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| | - Bin Zhang
- Department of Genetics and Genomic SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Mount Sinai Center for Transformative Disease ModelingIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Icahn Institute for Data Science and Genomic TechnologyIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Department of Pharmacological SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| |
Collapse
|
20
|
Wang W, Zhang X, Dai DQ. DeFusion: a denoised network regularization framework for multi-omics integration. Brief Bioinform 2021; 22:6210063. [PMID: 33822879 DOI: 10.1093/bib/bbab057] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Revised: 02/03/2021] [Accepted: 01/14/2020] [Indexed: 11/13/2022] Open
Abstract
With diverse types of omics data widely available, many computational methods have been recently developed to integrate these heterogeneous data, providing a comprehensive understanding of diseases and biological mechanisms. But most of them hardly take noise effects into account. Data-specific patterns unique to data types also make it challenging to uncover the consistent patterns and learn a compact representation of multi-omics data. Here we present a multi-omics integration method considering these issues. We explicitly model the error term in data reconstruction and simultaneously consider noise effects and data-specific patterns. We utilize a denoised network regularization in which we build a fused network using a denoising procedure to suppress noise effects and data-specific patterns. The error term collaborates with the denoised network regularization to capture data-specific patterns. We solve the optimization problem via an inexact alternating minimization algorithm. A comparative simulation study shows the method's superiority at discovering common patterns among data types at three noise levels. Transcriptomics-and-epigenomics integration, in seven cancer cohorts from The Cancer Genome Atlas, demonstrates that the learned integrative representation extracted in an unsupervised manner can depict survival information. Specially in liver hepatocellular carcinoma, the learned integrative representation attains average Harrell's C-index of 0.78 in 10 times 3-fold cross-validation for survival prediction, which far exceeds competing methods, and we discover an aggressive subtype in liver hepatocellular carcinoma with this latent representation, which is validated by an external dataset GSE14520. We also show that DeFusion is applicable to the integration of other omics types.
Collapse
Affiliation(s)
- Weiwen Wang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, China
| | - Xiwen Zhang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, China
| | - Dao-Qing Dai
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, China
| |
Collapse
|
21
|
Li Y, Ma L, Wu D, Chen G. Advances in bulk and single-cell multi-omics approaches for systems biology and precision medicine. Brief Bioinform 2021; 22:6189773. [PMID: 33778867 DOI: 10.1093/bib/bbab024] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2020] [Revised: 12/31/2020] [Accepted: 01/20/2021] [Indexed: 12/13/2022] Open
Abstract
Multi-omics allows the systematic understanding of the information flow across different omics layers, while single omics can mainly reflect one aspect of the biological system. The advancement of bulk and single-cell sequencing technologies and related computational methods for multi-omics largely facilitated the development of system biology and precision medicine. Single-cell approaches have the advantage of dissecting cellular dynamics and heterogeneity, whereas traditional bulk technologies are limited to individual/population-level investigation. In this review, we first summarize the technologies for producing bulk and single-cell multi-omics data. Then, we survey the computational approaches for integrative analysis of bulk and single-cell multimodal data, respectively. Moreover, the databases and data storage for multi-omics, as well as the tools for visualizing multimodal data are summarized. We also outline the integration between bulk and single-cell data, and discuss the applications of multi-omics in precision medicine. Finally, we present the challenges and perspectives for multi-omics development.
Collapse
Affiliation(s)
| | - Lu Ma
- China Normal University, China
| | | | | |
Collapse
|
22
|
Jendoubi T. Approaches to Integrating Metabolomics and Multi-Omics Data: A Primer. Metabolites 2021; 11:184. [PMID: 33801081 PMCID: PMC8003953 DOI: 10.3390/metabo11030184] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2021] [Revised: 03/17/2021] [Accepted: 03/18/2021] [Indexed: 12/14/2022] Open
Abstract
Metabolomics deals with multiple and complex chemical reactions within living organisms and how these are influenced by external or internal perturbations. It lies at the heart of omics profiling technologies not only as the underlying biochemical layer that reflects information expressed by the genome, the transcriptome and the proteome, but also as the closest layer to the phenome. The combination of metabolomics data with the information available from genomics, transcriptomics, and proteomics offers unprecedented possibilities to enhance current understanding of biological functions, elucidate their underlying mechanisms and uncover hidden associations between omics variables. As a result, a vast array of computational tools have been developed to assist with integrative analysis of metabolomics data with different omics. Here, we review and propose five criteria-hypothesis, data types, strategies, study design and study focus- to classify statistical multi-omics data integration approaches into state-of-the-art classes under which all existing statistical methods fall. The purpose of this review is to look at various aspects that lead the choice of the statistical integrative analysis pipeline in terms of the different classes. We will draw particular attention to metabolomics and genomics data to assist those new to this field in the choice of the integrative analysis pipeline.
Collapse
Affiliation(s)
- Takoua Jendoubi
- Department of Statistical Science, University College London, London WC1E 6BT, UK
| |
Collapse
|
23
|
Vlachavas EI, Bohn J, Ückert F, Nürnberg S. A Detailed Catalogue of Multi-Omics Methodologies for Identification of Putative Biomarkers and Causal Molecular Networks in Translational Cancer Research. Int J Mol Sci 2021; 22:2822. [PMID: 33802234 PMCID: PMC8000236 DOI: 10.3390/ijms22062822] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Revised: 03/05/2021] [Accepted: 03/05/2021] [Indexed: 02/06/2023] Open
Abstract
Recent advances in sequencing and biotechnological methodologies have led to the generation of large volumes of molecular data of different omics layers, such as genomics, transcriptomics, proteomics and metabolomics. Integration of these data with clinical information provides new opportunities to discover how perturbations in biological processes lead to disease. Using data-driven approaches for the integration and interpretation of multi-omics data could stably identify links between structural and functional information and propose causal molecular networks with potential impact on cancer pathophysiology. This knowledge can then be used to improve disease diagnosis, prognosis, prevention, and therapy. This review will summarize and categorize the most current computational methodologies and tools for integration of distinct molecular layers in the context of translational cancer research and personalized therapy. Additionally, the bioinformatics tools Multi-Omics Factor Analysis (MOFA) and netDX will be tested using omics data from public cancer resources, to assess their overall robustness, provide reproducible workflows for gaining biological knowledge from multi-omics data, and to comprehensively understand the significantly perturbed biological entities in distinct cancer types. We show that the performed supervised and unsupervised analyses result in meaningful and novel findings.
Collapse
Affiliation(s)
- Efstathios Iason Vlachavas
- Medical Informatics for Translational Oncology, German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany; (J.B.); (F.Ü.)
| | - Jonas Bohn
- Medical Informatics for Translational Oncology, German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany; (J.B.); (F.Ü.)
| | - Frank Ückert
- Medical Informatics for Translational Oncology, German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany; (J.B.); (F.Ü.)
- Applied Medical Informatics, University Hospital Hamburg-Eppendorf, 20251 Hamburg, Germany
| | - Sylvia Nürnberg
- Medical Informatics for Translational Oncology, German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany; (J.B.); (F.Ü.)
- Applied Medical Informatics, University Hospital Hamburg-Eppendorf, 20251 Hamburg, Germany
| |
Collapse
|
24
|
Qin G, Liu Z, Xie L. Multiple Omics Data Integration. SYSTEMS MEDICINE 2021. [DOI: 10.1016/b978-0-12-801238-3.11508-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022] Open
|
25
|
Kondylakis H, Axenie C, Kiran Bastola D, Katehakis DG, Kouroubali A, Kurz D, Larburu N, Macía I, Maguire R, Maramis C, Marias K, Morrow P, Muro N, Núñez-Benjumea FJ, Rampun A, Rivera-Romero O, Scotney B, Signorelli G, Wang H, Tsiknakis M, Zwiggelaar R. Status and Recommendations of Technological and Data-Driven Innovations in Cancer Care: Focus Group Study. J Med Internet Res 2020; 22:e22034. [PMID: 33320099 PMCID: PMC7772066 DOI: 10.2196/22034] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Revised: 10/02/2020] [Accepted: 10/26/2020] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND The status of the data-driven management of cancer care as well as the challenges, opportunities, and recommendations aimed at accelerating the rate of progress in this field are topics of great interest. Two international workshops, one conducted in June 2019 in Cordoba, Spain, and one in October 2019 in Athens, Greece, were organized by four Horizon 2020 (H2020) European Union (EU)-funded projects: BOUNCE, CATCH ITN, DESIREE, and MyPal. The issues covered included patient engagement, knowledge and data-driven decision support systems, patient journey, rehabilitation, personalized diagnosis, trust, assessment of guidelines, and interoperability of information and communication technology (ICT) platforms. A series of recommendations was provided as the complex landscape of data-driven technical innovation in cancer care was portrayed. OBJECTIVE This study aims to provide information on the current state of the art of technology and data-driven innovations for the management of cancer care through the work of four EU H2020-funded projects. METHODS Two international workshops on ICT in the management of cancer care were held, and several topics were identified through discussion among the participants. A focus group was formulated after the second workshop, in which the status of technological and data-driven cancer management as well as the challenges, opportunities, and recommendations in this area were collected and analyzed. RESULTS Technical and data-driven innovations provide promising tools for the management of cancer care. However, several challenges must be successfully addressed, such as patient engagement, interoperability of ICT-based systems, knowledge management, and trust. This paper analyzes these challenges, which can be opportunities for further research and practical implementation and can provide practical recommendations for future work. CONCLUSIONS Technology and data-driven innovations are becoming an integral part of cancer care management. In this process, specific challenges need to be addressed, such as increasing trust and engaging the whole stakeholder ecosystem, to fully benefit from these innovations.
Collapse
Affiliation(s)
| | - Cristian Axenie
- Audi Konfuzius-Institut Ingolstadt Lab, Technische Hochschule Ingolstadt, Ingolstadt, Germany
| | - Dhundy Kiran Bastola
- School of Interdisciplinary Informatics, University of Nebraska, Omaha, NE, United States
| | | | | | - Daria Kurz
- Interdisziplinäres Brustzentrum, Helios Klinikum München West, Munich, Germany
| | - Nekane Larburu
- Vicomtech, Health Research Institute, San Sebastian, Spain
| | - Iván Macía
- Vicomtech, Health Research Institute, San Sebastian, Spain
| | - Roma Maguire
- University of Strathclyde, Glasgow, United Kingdom
| | - Christos Maramis
- eHealth Lab, Institute of Applied Biosciences - Centre for Research & Technology Hellas, Thessaloniki, Greece
| | | | - Philip Morrow
- School of Computing, Ulster University, Newtownabbey, United Kingdom
| | - Naiara Muro
- Vicomtech, Health Research Institute, San Sebastian, Spain
| | | | - Andrik Rampun
- Academic Unit of Radiology, Department of Infection, Immunity and Cardiovascular Disease, University of Sheffield, Sheffield, United Kingdom
| | | | - Bryan Scotney
- School of Computing, Ulster University, Newtownabbey, United Kingdom
| | | | - Hui Wang
- School of Computing and Engineering, University of West London, London, United Kingdom
| | | | - Reyer Zwiggelaar
- Department of Computer Science, Aberystwyth University, Aberystwyth, United Kingdom
| |
Collapse
|
26
|
Labory J, Fierville M, Ait-El-Mkadem S, Bannwarth S, Paquis-Flucklinger V, Bottini S. Multi-Omics Approaches to Improve Mitochondrial Disease Diagnosis: Challenges, Advances, and Perspectives. Front Mol Biosci 2020; 7:590842. [PMID: 33240932 PMCID: PMC7667268 DOI: 10.3389/fmolb.2020.590842] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2020] [Accepted: 10/14/2020] [Indexed: 01/06/2023] Open
Abstract
Mitochondrial diseases (MD) are rare disorders caused by deficiency of the mitochondrial respiratory chain, which provides energy in each cell. They are characterized by a high clinical and genetic heterogeneity and in most patients, the responsible gene is unknown. Diagnosis is based on the identification of the causative gene that allows genetic counseling, prenatal diagnosis, understanding of pathological mechanisms, and personalized therapeutic approaches. Despite the emergence of Next Generation Sequencing (NGS), to date, more than one out of two patients has no diagnosis in the absence of identification of the responsible gene. Technologies currently used for detecting causal variants (genetic alterations) is far from complete, leading many variants of unknown significance (VUS) and mainly based on the use of whole exome sequencing thus neglecting the identification of non-coding variants. The complexity of human genome and its regulation at multiple levels has led biologists to develop several assays to interrogate the different aspects of biological processes. While one-dimension single omics investigation offers a peek of this complex system, the combination of different omics data allows the discovery of coherent signatures. The community of computational biologists and bioinformaticians, in order to integrate data from different omics, has developed several approaches and tools. However, it is difficult to understand which suits the best to predict diverse phenotypic outcome. First attempts to use multi-omics approaches showed an improvement of the diagnostic power. However, we are far from a complete understanding of MD and their diagnosis. After reviewing multi-omics algorithms developed in the latest years, we are proposing here a novel data-driven classification and we will discuss how multi-omics will change and improve the diagnosis of MD. Due to the growing use of multi-omics approaches in MD, we foresee that this work will contribute to set up good practices to perform multi-omics data integration to improve the prediction of phenotypic outcomes and the diagnostic power of MD.
Collapse
Affiliation(s)
- Justine Labory
- Université Côte d'Azur, Center of Modeling, Simulation and Interactions, Nice, France
| | - Morgane Fierville
- Université Côte d'Azur, Center of Modeling, Simulation and Interactions, Nice, France
| | - Samira Ait-El-Mkadem
- Université Côte d'Azur, Inserm U1081, CNRS UMR 7284, Institute for Research on Cancer and Aging, Nice (IRCAN), Centre hospitalier universitaire (CHU) de Nice, Nice, France
| | - Sylvie Bannwarth
- Université Côte d'Azur, Inserm U1081, CNRS UMR 7284, Institute for Research on Cancer and Aging, Nice (IRCAN), Centre hospitalier universitaire (CHU) de Nice, Nice, France
| | - Véronique Paquis-Flucklinger
- Université Côte d'Azur, Center of Modeling, Simulation and Interactions, Nice, France.,Université Côte d'Azur, Inserm U1081, CNRS UMR 7284, Institute for Research on Cancer and Aging, Nice (IRCAN), Centre hospitalier universitaire (CHU) de Nice, Nice, France
| | - Silvia Bottini
- Université Côte d'Azur, Center of Modeling, Simulation and Interactions, Nice, France
| |
Collapse
|
27
|
Abstract
In this chapter we discuss the past, present and future of clinical biomarker development. We explore the advent of new technologies, paving the way in which health, medicine and disease is understood. This review includes the identification of physicochemical assays, current regulations, the development and reproducibility of clinical trials, as well as, the revolution of omics technologies and state-of-the-art integration and analysis approaches.
Collapse
|
28
|
Rappoport N, Safra R, Shamir R. MONET: Multi-omic module discovery by omic selection. PLoS Comput Biol 2020; 16:e1008182. [PMID: 32931516 PMCID: PMC7518594 DOI: 10.1371/journal.pcbi.1008182] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Revised: 09/25/2020] [Accepted: 07/22/2020] [Indexed: 01/19/2023] Open
Abstract
Recent advances in experimental biology allow creation of datasets where several genome-wide data types (called omics) are measured per sample. Integrative analysis of multi-omic datasets in general, and clustering of samples in such datasets specifically, can improve our understanding of biological processes and discover different disease subtypes. In this work we present MONET (Multi Omic clustering by Non-Exhaustive Types), which presents a unique approach to multi-omic clustering. MONET discovers modules of similar samples, such that each module is allowed to have a clustering structure for only a subset of the omics. This approach differs from most existent multi-omic clustering algorithms, which assume a common structure across all omics, and from several recent algorithms that model distinct cluster structures. We tested MONET extensively on simulated data, on an image dataset, and on ten multi-omic cancer datasets from TCGA. Our analysis shows that MONET compares favorably with other multi-omic clustering methods. We demonstrate MONET's biological and clinical relevance by analyzing its results for Ovarian Serous Cystadenocarcinoma. We also show that MONET is robust to missing data, can cluster genes in multi-omic dataset, and reveal modules of cell types in single-cell multi-omic data. Our work shows that MONET is a valuable tool that can provide complementary results to those provided by existent algorithms for multi-omic analysis.
Collapse
Affiliation(s)
- Nimrod Rappoport
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Roy Safra
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Ron Shamir
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
29
|
Wu MJ, Gao YL, Liu JX, Zheng CH, Wang J. Integrative Hypergraph Regularization Principal Component Analysis for Sample Clustering and Co-Expression Genes Network Analysis on Multi-Omics Data. IEEE J Biomed Health Inform 2020; 24:1823-1834. [DOI: 10.1109/jbhi.2019.2948456] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
30
|
Ding H, Sharpnack M, Wang C, Huang K, Machiraju R. Integrative cancer patient stratification via subspace merging. Bioinformatics 2020; 35:1653-1659. [PMID: 30329022 DOI: 10.1093/bioinformatics/bty866] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2017] [Revised: 06/09/2018] [Accepted: 10/15/2018] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION Technologies that generate high-throughput omics data are flourishing, creating enormous, publicly available repositories of multi-omics data. As many data repositories continue to grow, there is an urgent need for computational methods that can leverage these data to create comprehensive clusters of patients with a given disease. RESULTS Our proposed approach creates a patient-to-patient similarity graph for each data type as an intermediate representation of each omics data type and merges the graphs through subspace analysis on a Grassmann manifold. We hypothesize that this approach generates more informative clusters by preserving the complementary information from each level of omics data. We applied our approach to The Cancer Genome Atlas (TCGA) breast cancer dataset and show that by integrating gene expression, microRNA and DNA methylation data, our proposed method can produce clinically useful subtypes of breast cancer. We then investigate the molecular characteristics underlying these subtypes. We discover a highly expressed cluster of genes on chromosome 19p13 that strongly correlates with survival in TCGA breast cancer patients and validate these results in three additional breast cancer datasets. We also compare our approach with previous integrative clustering approaches and obtain comparable or superior results. AVAILABILITY AND IMPLEMENTATION https://github.com/michaelsharpnack/GrassmannCluster. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hao Ding
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
| | - Michael Sharpnack
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Chao Wang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA.,Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Kun Huang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA.,Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Raghu Machiraju
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA.,Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
31
|
Lemsara A, Ouadfel S, Fröhlich H. PathME: pathway based multi-modal sparse autoencoders for clustering of patient-level multi-omics data. BMC Bioinformatics 2020; 21:146. [PMID: 32299344 PMCID: PMC7161108 DOI: 10.1186/s12859-020-3465-2] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2019] [Accepted: 03/23/2020] [Indexed: 02/08/2023] Open
Abstract
Background Recent years have witnessed an increasing interest in multi-omics data, because these data allow for better understanding complex diseases such as cancer on a molecular system level. In addition, multi-omics data increase the chance to robustly identify molecular patient sub-groups and hence open the door towards a better personalized treatment of diseases. Several methods have been proposed for unsupervised clustering of multi-omics data. However, a number of challenges remain, such as the magnitude of features and the large difference in dimensionality across different omics data sources. Results We propose a multi-modal sparse denoising autoencoder framework coupled with sparse non-negative matrix factorization to robustly cluster patients based on multi-omics data. The proposed model specifically leverages pathway information to effectively reduce the dimensionality of omics data into a pathway and patient specific score profile. In consequence, our method allows us to understand, which pathway is a feature of which particular patient cluster. Moreover, recently proposed machine learning techniques allow us to disentangle the specific impact of each individual omics feature on a pathway score. We applied our method to cluster patients in several cancer datasets using gene expression, miRNA expression, DNA methylation and CNVs, demonstrating the possibility to obtain biologically plausible disease subtypes characterized by specific molecular features. Comparison against several competing methods showed a competitive clustering performance. In addition, post-hoc analysis of somatic mutations and clinical data provided supporting evidence and interpretation of the identified clusters. Conclusions Our suggested multi-modal sparse denoising autoencoder approach allows for an effective and interpretable integration of multi-omics data on pathway level while addressing the high dimensional character of omics data. Patient specific pathway score profiles derived from our model allow for a robust identification of disease subgroups.
Collapse
Affiliation(s)
- Amina Lemsara
- Computer Science Department, University of Constantine 2, 25016, Constantine, Algeria
| | - Salima Ouadfel
- Computer Science Department, University of Constantine 2, 25016, Constantine, Algeria
| | - Holger Fröhlich
- University of Bonn, Bonn-Aachen, International Center for IT, 53115, Bonn, Germany. .,Fraunhofer Institute for, Algorithms and Scientific, Computing (SCAI), 53754, Sankt, Augustin, Germany.
| |
Collapse
|
32
|
Oh M, Park S, Kim S, Chae H. Machine learning-based analysis of multi-omics data on the cloud for investigating gene regulations. Brief Bioinform 2020; 22:66-76. [PMID: 32227074 DOI: 10.1093/bib/bbaa032] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2019] [Revised: 02/05/2020] [Accepted: 02/25/2020] [Indexed: 02/06/2023] Open
Abstract
Gene expressions are subtly regulated by quantifiable measures of genetic molecules such as interaction with other genes, methylation, mutations, transcription factor and histone modifications. Integrative analysis of multi-omics data can help scientists understand the condition or patient-specific gene regulation mechanisms. However, analysis of multi-omics data is challenging since it requires not only the analysis of multiple omics data sets but also mining complex relations among different genetic molecules by using state-of-the-art machine learning methods. In addition, analysis of multi-omics data needs quite large computing infrastructure. Moreover, interpretation of the analysis results requires collaboration among many scientists, often requiring reperforming analysis from different perspectives. Many of the aforementioned technical issues can be nicely handled when machine learning tools are deployed on the cloud. In this survey article, we first survey machine learning methods that can be used for gene regulation study, and we categorize them according to five different goals: gene regulatory subnetwork discovery, disease subtype analysis, survival analysis, clinical prediction and visualization. We also summarize the methods in terms of multi-omics input types. Then, we explain why the cloud is potentially a good solution for the analysis of multi-omics data, followed by a survey of two state-of-the-art cloud systems, Galaxy and BioVLAB. Finally, we discuss important issues when the cloud is used for the analysis of multi-omics data for the gene regulation study.
Collapse
Affiliation(s)
- Minsik Oh
- Department of Computer Science and Engineering, Seoul National University, Seoul, 08826, Korea
| | - Sungjoon Park
- Department of Computer Science and Engineering, Seoul National University, Seoul, 08826, Korea
| | - Sun Kim
- Department of Computer Science and Engineering, Seoul National University, Seoul, 08826, Korea.,Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 08826, Korea.,Bioinformatics Institute, Seoul National University, Seoul, 08826, Korea
| | - Heejoon Chae
- Division of Computer Science, Sookmyung Women's University, Seoul, 04310,Korea
| |
Collapse
|
33
|
Mallik S, Zhao Z. Graph- and rule-based learning algorithms: a comprehensive review of their applications for cancer type classification and prognosis using genomic data. Brief Bioinform 2020; 21:368-394. [PMID: 30649169 PMCID: PMC7373185 DOI: 10.1093/bib/bby120] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Revised: 10/26/2018] [Accepted: 11/21/2018] [Indexed: 12/20/2022] Open
Abstract
Cancer is well recognized as a complex disease with dysregulated molecular networks or modules. Graph- and rule-based analytics have been applied extensively for cancer classification as well as prognosis using large genomic and other data over the past decade. This article provides a comprehensive review of various graph- and rule-based machine learning algorithms that have been applied to numerous genomics data to determine the cancer-specific gene modules, identify gene signature-based classifiers and carry out other related objectives of potential therapeutic value. This review focuses mainly on the methodological design and features of these algorithms to facilitate the application of these graph- and rule-based analytical approaches for cancer classification and prognosis. Based on the type of data integration, we divided all the algorithms into three categories: model-based integration, pre-processing integration and post-processing integration. Each category is further divided into four sub-categories (supervised, unsupervised, semi-supervised and survival-driven learning analyses) based on learning style. Therefore, a total of 11 categories of methods are summarized with their inputs, objectives and description, advantages and potential limitations. Next, we briefly demonstrate well-known and most recently developed algorithms for each sub-category along with salient information, such as data profiles, statistical or feature selection methods and outputs. Finally, we summarize the appropriate use and efficiency of all categories of graph- and rule mining-based learning methods when input data and specific objective are given. This review aims to help readers to select and use the appropriate algorithms for cancer classification and prognosis study.
Collapse
Affiliation(s)
- Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center, Houston
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center, Houston
| |
Collapse
|
34
|
Baldwin E, Han J, Luo W, Zhou J, An L, Liu J, Zhang HH, Li H. On fusion methods for knowledge discovery from multi-omics datasets. Comput Struct Biotechnol J 2020; 18:509-517. [PMID: 32206210 PMCID: PMC7078495 DOI: 10.1016/j.csbj.2020.02.011] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2019] [Revised: 01/25/2020] [Accepted: 02/19/2020] [Indexed: 12/22/2022] Open
Abstract
Recent years have witnessed the tendency of measuring a biological sample on multiple omics scales for a comprehensive understanding of how biological activities on varying levels are perturbed by genetic variants, environments, and their interactions. This new trend raises substantial challenges to data integration and fusion, of which the latter is a specific type of integration that applies a uniform method in a scalable manner, to solve biological problems which the multi-omics measurements target. Fusion-based analysis has advanced rapidly in the past decade, thanks to application drivers and theoretical breakthroughs in mathematics, statistics, and computer science. We will briefly address these methods from methodological and mathematical perspectives and categorize them into three types of approaches: data fusion (a narrowed definition as compared to the general data fusion concept), model fusion, and mixed fusion. We will demonstrate at least one typical example in each specific category to exemplify the characteristics, principles, and applications of the methods in general, as well as discuss the gaps and potential issues for future studies.
Collapse
Affiliation(s)
- Edwin Baldwin
- Department of Biosystems Engineering, University of Arizona, United States
| | - Jiali Han
- Department of Systems and Industrial Engineering, University of Arizona, United States
| | - Wenting Luo
- Department of Biosystems Engineering, University of Arizona, United States
| | - Jin Zhou
- Department of Epidemiology and Biostatics, University of Arizona, United States
| | - Lingling An
- Department of Biosystems Engineering, University of Arizona, United States.,Department of Epidemiology and Biostatics, University of Arizona, United States
| | - Jian Liu
- Department of Systems and Industrial Engineering, University of Arizona, United States
| | - Hao Helen Zhang
- Department of Mathematics, University of Arizona, United States
| | - Haiquan Li
- Department of Biosystems Engineering, University of Arizona, United States
| |
Collapse
|
35
|
Subramanian I, Verma S, Kumar S, Jere A, Anamika K. Multi-omics Data Integration, Interpretation, and Its Application. Bioinform Biol Insights 2020; 14:1177932219899051. [PMID: 32076369 PMCID: PMC7003173 DOI: 10.1177/1177932219899051] [Citation(s) in RCA: 520] [Impact Index Per Article: 130.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Accepted: 11/09/2019] [Indexed: 12/22/2022] Open
Abstract
To study complex biological processes holistically, it is imperative to take an integrative approach that combines multi-omics data to highlight the interrelationships of the involved biomolecules and their functions. With the advent of high-throughput techniques and availability of multi-omics data generated from a large set of samples, several promising tools and methods have been developed for data integration and interpretation. In this review, we collected the tools and methods that adopt integrative approach to analyze multiple omics data and summarized their ability to address applications such as disease subtyping, biomarker prediction, and deriving insights into the data. We provide the methodology, use-cases, and limitations of these tools; brief account of multi-omics data repositories and visualization portals; and challenges associated with multi-omics data integration.
Collapse
Affiliation(s)
| | | | | | - Abhay Jere
- Innovation Cell, Ministry of Human Resource Development, New Delhi, India
| | | |
Collapse
|
36
|
Simidjievski N, Bodnar C, Tariq I, Scherer P, Andres Terre H, Shams Z, Jamnik M, Liò P. Variational Autoencoders for Cancer Data Integration: Design Principles and Computational Practice. Front Genet 2019; 10:1205. [PMID: 31921281 PMCID: PMC6917668 DOI: 10.3389/fgene.2019.01205] [Citation(s) in RCA: 50] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2019] [Accepted: 10/31/2019] [Indexed: 12/27/2022] Open
Abstract
International initiatives such as the Molecular Taxonomy of Breast Cancer International Consortium are collecting multiple data sets at different genome-scales with the aim to identify novel cancer bio-markers and predict patient survival. To analyze such data, several machine learning, bioinformatics, and statistical methods have been applied, among them neural networks such as autoencoders. Although these models provide a good statistical learning framework to analyze multi-omic and/or clinical data, there is a distinct lack of work on how to integrate diverse patient data and identify the optimal design best suited to the available data.In this paper, we investigate several autoencoder architectures that integrate a variety of cancer patient data types (e.g., multi-omics and clinical data). We perform extensive analyses of these approaches and provide a clear methodological and computational framework for designing systems that enable clinicians to investigate cancer traits and translate the results into clinical applications. We demonstrate how these networks can be designed, built, and, in particular, applied to tasks of integrative analyses of heterogeneous breast cancer data. The results show that these approaches yield relevant data representations that, in turn, lead to accurate and stable diagnosis.
Collapse
Affiliation(s)
- Nikola Simidjievski
- Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
| | - Cristian Bodnar
- Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
| | - Ifrah Tariq
- Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom.,Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, United States
| | - Paul Scherer
- Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
| | - Helena Andres Terre
- Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
| | - Zohreh Shams
- Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
| | - Mateja Jamnik
- Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
| | - Pietro Liò
- Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
37
|
Duan R, Gao L, Xu H, Song K, Hu Y, Wang H, Dong Y, Zhang C, Jia S. CEPICS: A Comparison and Evaluation Platform for Integration Methods in Cancer Subtyping. Front Genet 2019; 10:966. [PMID: 31649733 PMCID: PMC6792302 DOI: 10.3389/fgene.2019.00966] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2019] [Accepted: 09/10/2019] [Indexed: 11/17/2022] Open
Abstract
Cancer subtypes can improve our understanding of cancer, and suggest more precise treatment for patients. Multi-omics molecular data can characterize cancers at different levels. Up to now, many computational methods that integrate multi-omics data for cancer subtyping have been proposed. However, there are no consistent criteria to evaluate the integration methods due to the lack of gold standards (e.g., the number of subtypes in a specific cancer). Since comprehensive evaluation and comparison between different methods serves as a useful tool or guideline for users to select an optimal method for their own purpose, we develop a scalable platform, CEPICS, for comprehensively evaluating and comparing multi-omics data integration methods in cancer subtyping. Given a user-specified maximum number of subtypes, k-max, CEPICS provides (1) cancer subtyping results using up to five built-in state-of-the-art integration methods under the number of subtypes from two to k-max, (2) a report including the evaluation of each user-selected method and comparisons across them using clustering performance metrics and clinical survival analysis, and (3) an overall analysis of subtyping results by different methods representing a robust cancer subtype prediction for samples. Furthermore, users can upload subtyping results of their own methods to compare with the built-in methods. CEPICS is implemented as an R package and is freely available at https://github.com/GaoLabXDU/CEPICS.
Collapse
Affiliation(s)
- Ran Duan
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Lin Gao
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Han Xu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Kuo Song
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Yuxuan Hu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Hongda Wang
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Yongqiang Dong
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Chenxing Zhang
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Songwei Jia
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
38
|
Zitnik M, Nguyen F, Wang B, Leskovec J, Goldenberg A, Hoffman MM. Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities. AN INTERNATIONAL JOURNAL ON INFORMATION FUSION 2019; 50:71-91. [PMID: 30467459 PMCID: PMC6242341 DOI: 10.1016/j.inffus.2018.09.012] [Citation(s) in RCA: 215] [Impact Index Per Article: 43.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include myriad properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches. The key challenge in developing such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation. In this Review, we describe the principles of data integration and discuss current methods and available implementations. We provide examples of successful data integration in biology and medicine. Finally, we discuss current challenges in biomedical integrative methods and our perspective on the future development of the field.
Collapse
Affiliation(s)
- Marinka Zitnik
- Department of Computer Science, Stanford University,
Stanford, CA, USA
| | - Francis Nguyen
- Department of Medical Biophysics, University of Toronto,
Toronto, ON, Canada
- Princess Margaret Cancer Centre, Toronto, ON, Canada
| | - Bo Wang
- Hikvision Research Institute, Santa Clara, CA, USA
| | - Jure Leskovec
- Department of Computer Science, Stanford University,
Stanford, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Anna Goldenberg
- Genetics & Genome Biology, SickKids Research Institute,
Toronto, ON, Canada
- Department of Computer Science, University of Toronto,
Toronto, ON, Canada
- Vector Institute, Toronto, ON, Canada
| | - Michael M. Hoffman
- Department of Medical Biophysics, University of Toronto,
Toronto, ON, Canada
- Princess Margaret Cancer Centre, Toronto, ON, Canada
- Department of Computer Science, University of Toronto,
Toronto, ON, Canada
- Vector Institute, Toronto, ON, Canada
| |
Collapse
|
39
|
Wani N, Raza K. Integrative approaches to reconstruct regulatory networks from multi-omics data: A review of state-of-the-art methods. Comput Biol Chem 2019; 83:107120. [PMID: 31499298 DOI: 10.1016/j.compbiolchem.2019.107120] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2018] [Revised: 02/22/2019] [Accepted: 08/27/2019] [Indexed: 02/06/2023]
Abstract
Data generation using high throughput technologies has led to the accumulation of diverse types of molecular data. These data have different types (discrete, real, string, etc.) and occur in various formats and sizes. Datasets including gene expression, miRNA expression, protein-DNA binding data (ChIP-Seq/ChIP-ChIP), mutation data (copy number variation, single nucleotide polymorphisms), annotations, interactions, and association data are some of the commonly used biological datasets to study various cellular mechanisms of living organisms. Each of them provides a unique, complementary and partly independent view of the genome and hence embed essential information about the regulatory mechanisms of genes and their products. Therefore, integrating these data and inferring regulatory interactions from them offer a system level of biological insight in predicting gene functions and their phenotypic outcomes. To study genome functionality through regulatory networks, different methods have been proposed for collective mining of information from an integrated dataset. We survey here integration methods that reconstruct regulatory networks using state-of-the-art techniques to handle multi-omics (i.e., genomic, transcriptomic, proteomic) and other biological datasets.
Collapse
Affiliation(s)
- Nisar Wani
- Govt. Degree College Baramulla, J & K, India; Department of Computer Science, jamia Milia Islamia, New Delhi, India
| | - Khalid Raza
- Department of Computer Science, jamia Milia Islamia, New Delhi, India.
| |
Collapse
|
40
|
Abstract
Autoimmune rheumatic diseases pose many problems that have, in general, already been solved in the field of cancer. The heterogeneity of each disease, the clinical similarities and differences between different autoimmune rheumatic diseases and the large number of patients that remain without a diagnosis underline the need to reclassify these diseases via new approaches. Knowledge about the molecular basis of systemic autoimmune diseases, along with the availability of bioinformatics tools capable of handling and integrating large volumes of various types of molecular data at once, offer the possibility of reclassifying these diseases. A new taxonomy could lead to the discovery of new biomarkers for patient stratification and prognosis. Most importantly, this taxonomy might enable important changes in clinical trial design to reach the expected outcomes or the design of molecularly targeted therapies. In this Review, we discuss the basis for a new molecular taxonomy for autoimmune rheumatic diseases. We highlight the evidence surrounding the idea that these diseases share molecular features related to their pathogenesis and development and discuss previous attempts to classify these diseases. We evaluate the tools available to analyse and combine different types of molecular data. Finally, we introduce PRECISESADS, a project aimed at reclassifying the systemic autoimmune diseases.
Collapse
|
41
|
Koh HWL, Fermin D, Vogel C, Choi KP, Ewing RM, Choi H. iOmicsPASS: network-based integration of multiomics data for predictive subnetwork discovery. NPJ Syst Biol Appl 2019; 5:22. [PMID: 31312515 PMCID: PMC6616462 DOI: 10.1038/s41540-019-0099-y] [Citation(s) in RCA: 64] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2018] [Accepted: 06/14/2019] [Indexed: 12/15/2022] Open
Abstract
Computational tools for multiomics data integration have usually been designed for unsupervised detection of multiomics features explaining large phenotypic variations. To achieve this, some approaches extract latent signals in heterogeneous data sets from a joint statistical error model, while others use biological networks to propagate differential expression signals and find consensus signatures. However, few approaches directly consider molecular interaction as a data feature, the essential linker between different omics data sets. The increasing availability of genome-scale interactome data connecting different molecular levels motivates a new class of methods to extract interactive signals from multiomics data. Here we developed iOmicsPASS, a tool to search for predictive subnetworks consisting of molecular interactions within and between related omics data types in a supervised analysis setting. Based on user-provided network data and relevant omics data sets, iOmicsPASS computes a score for each molecular interaction, and applies a modified nearest shrunken centroid algorithm to the scores to select densely connected subnetworks that can accurately predict each phenotypic group. iOmicsPASS detects a sparse set of predictive molecular interactions without loss of prediction accuracy compared to alternative methods, and the selected network signature immediately provides mechanistic interpretation of the multiomics profile representing each sample group. Extensive simulation studies demonstrate clear benefit of interaction-level modeling. iOmicsPASS analysis of TCGA/CPTAC breast cancer data also highlights new transcriptional regulatory network underlying the basal-like subtype as positive protein markers, a result not seen through analysis of individual omics data.
Collapse
Affiliation(s)
- Hiromi W. L. Koh
- Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- Saw Swee Hock School of Public Health, National University of Singapore, Singapore, Singapore
| | - Damian Fermin
- University of Michigan Medical School, Ann Arbor, MI USA
| | - Christine Vogel
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY 10003 USA
| | - Kwok Pui Choi
- Department of Statistics and Applied Probability, National University of Singapore, Singapore, Singapore
| | - Rob M. Ewing
- School of Biological Sciences, University of Southampton, Southampton, UK
| | - Hyungwon Choi
- Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- Saw Swee Hock School of Public Health, National University of Singapore, Singapore, Singapore
- Institute of Molecular and Cell Biology, Agency for Science, Technology and Research, Singapore, Singapore
| |
Collapse
|
42
|
Rappoport N, Shamir R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res 2019; 46:10546-10562. [PMID: 30295871 PMCID: PMC6237755 DOI: 10.1093/nar/gky889] [Citation(s) in RCA: 224] [Impact Index Per Article: 44.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2018] [Accepted: 09/20/2018] [Indexed: 12/18/2022] Open
Abstract
Recent high throughput experimental methods have been used to collect large biomedical omics datasets. Clustering of single omic datasets has proven invaluable for biological and medical research. The decreasing cost and development of additional high throughput methods now enable measurement of multi-omic data. Clustering multi-omic data has the potential to reveal further systems-level insights, but raises computational and biological challenges. Here, we review algorithms for multi-omics clustering, and discuss key issues in applying these algorithms. Our review covers methods developed specifically for omic data as well as generic multi-view methods developed in the machine learning community for joint clustering of multiple data types. In addition, using cancer data from TCGA, we perform an extensive benchmark spanning ten different cancer types, providing the first systematic comparison of leading multi-omics and multi-view clustering algorithms. The results highlight key issues regarding the use of single- versus multi-omics, the choice of clustering strategy, the power of generic multi-view methods and the use of approximated p-values for gauging solution quality. Due to the growing use of multi-omics data, we expect these issues to be important for future progress in the field.
Collapse
Affiliation(s)
- Nimrod Rappoport
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
43
|
Chauvel C, Novoloaca A, Veyre P, Reynier F, Becker J. Evaluation of integrative clustering methods for the analysis of multi-omics data. Brief Bioinform 2019; 21:541-552. [DOI: 10.1093/bib/bbz015] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Revised: 01/12/2019] [Accepted: 01/16/2019] [Indexed: 12/20/2022] Open
Abstract
Abstract
Recent advances in sequencing, mass spectrometry and cytometry technologies have enabled researchers to collect large-scale omics data from the same set of biological samples. The joint analysis of multiple omics offers the opportunity to uncover coordinated cellular processes acting across different omic layers. In this work, we present a thorough comparison of a selection of recent integrative clustering approaches, including Bayesian (BCC and MDI) and matrix factorization approaches (iCluster, moCluster, JIVE and iNMF). Based on simulations, the methods were evaluated on their sensitivity and their ability to recover both the correct number of clusters and the simulated clustering at the common and data-specific levels. Standard non-integrative approaches were also included to quantify the added value of integrative methods. For most matrix factorization methods and one Bayesian approach (BCC), the shared and specific structures were successfully recovered with high and moderate accuracy, respectively. An opposite behavior was observed on non-integrative approaches, i.e. high performances on specific structures only. Finally, we applied the methods on the Cancer Genome Atlas breast cancer data set to check whether results based on experimental data were consistent with those obtained in the simulations.
Collapse
Affiliation(s)
- Cécile Chauvel
- BIOASTER Research Institute, avenue Tony Garnier, Lyon, France
| | | | - Pierre Veyre
- BIOASTER Research Institute, avenue Tony Garnier, Lyon, France
| | | | - Jérémie Becker
- BIOASTER Research Institute, avenue Tony Garnier, Lyon, France
| |
Collapse
|
44
|
Wang Y, Yu G, Domeniconi C, Wang J, Zhang X, Guo M. Selective Matrix Factorization for Multi-relational Data Fusion. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS 2019. [DOI: 10.1007/978-3-030-18576-3_19] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
45
|
Bismeijer T, Canisius S, Wessels LFA. Molecular characterization of breast and lung tumors by integration of multiple data types with functional sparse-factor analysis. PLoS Comput Biol 2018; 14:e1006520. [PMID: 30379847 PMCID: PMC6231682 DOI: 10.1371/journal.pcbi.1006520] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2017] [Revised: 11/12/2018] [Accepted: 09/19/2018] [Indexed: 11/26/2022] Open
Abstract
Effective cancer treatment is crucially dependent on the identification of the biological processes that drive a tumor. However, multiple processes may be active simultaneously in a tumor. Clustering is inherently unsuitable to this task as it assigns a tumor to a single cluster. In addition, the wide availability of multiple data types per tumor provides the opportunity to profile the processes driving a tumor more comprehensively. Here we introduce Functional Sparse-Factor Analysis (funcSFA) to address these challenges. FuncSFA integrates multiple data types to define a lower dimensional space capturing the relevant variation. A tailor-made module associates biological processes with these factors. FuncSFA is inspired by iCluster, which we improve in several key aspects. First, we increase the convergence efficiency significantly, allowing the analysis of multiple molecular datasets that have not been pre-matched to contain only concordant features. Second, FuncSFA does not assign tumors to discrete clusters, but identifies the dominant driver processes active in each tumor. This is achieved by a regression of the factors on the RNA expression data followed by a functional enrichment analysis and manual curation step. We apply FuncSFA to the TCGA breast and lung datasets. We identify EMT and Immune processes common to both cancer types. In the breast cancer dataset we recover the known intrinsic subtypes and identify additional processes. These include immune infiltration and EMT, and processes driven by copy number gains on the 8q chromosome arm. In lung cancer we recover the major types (adenocarcinoma and squamous cell carcinoma) and processes active in both of these types. These include EMT, two immune processes, and the activity of the NFE2L2 transcription factor. We validate the breast cancer findings on the METABRIC set and demonstrate the translatability of the TCGA breast cancer factors to METABRIC. In summary, FuncSFA is a robust method to perform discovery of key driver processes in a collection of tumors through unsupervised integration of multiple molecular data types and functional annotation. In order to select effective cancer treatment, we need to determine which biological processes are active in a tumor. To this end, tumors have been quantified by high dimensional molecular measurements such as RNA sequencing and DNA copy number profiling. In order to support decision making, these measurements need to be condensed into interpretable summaries. Such summaries can be made interpretable by connecting them to biological processes. Biological process activity is continuous and multiple biological processes are taking place in a single tumor. Therefore, the biological processes associated with a tumor are misrepresented by clustering, which tries to put every tumor in a single cluster. In the method introduced in this paper (funcSFA), molecular measurements are summarized into a small number factors. A factor is a continuous value per tumor that aims to represent the activity of a biological process. When applied to breast and lung cancer, funcSFA identifies factors covering well known biology of these tumor types. FuncSFA also finds novel factors covering biology whose importance is not yet widely recognized in these tumor types. Some of the factors suggest treatment opportunities that can be further investigated in cell lines and mice.
Collapse
Affiliation(s)
- Tycho Bismeijer
- Division of Molecular Carcinogenesis, Oncode Institute, The Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Sander Canisius
- Division of Molecular Carcinogenesis, Oncode Institute, The Netherlands Cancer Institute, Amsterdam, The Netherlands
- Division of Molecular Pathology, The Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Lodewyk F. A. Wessels
- Division of Molecular Carcinogenesis, Oncode Institute, The Netherlands Cancer Institute, Amsterdam, The Netherlands
- Faculty of EEMCS, Delft University of Technology, Delft, The Netherlands
- * E-mail:
| |
Collapse
|
46
|
Misra BB, Langefeld CD, Olivier M, Cox LA. Integrated Omics: Tools, Advances, and Future Approaches. J Mol Endocrinol 2018; 62:JME-18-0055. [PMID: 30006342 DOI: 10.1530/jme-18-0055] [Citation(s) in RCA: 206] [Impact Index Per Article: 34.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/24/2018] [Revised: 07/02/2018] [Accepted: 07/12/2018] [Indexed: 12/13/2022]
Abstract
With the rapid adoption of high-throughput omic approaches to analyze biological samples such as genomics, transcriptomics, proteomics, and metabolomics, each analysis can generate tera- to peta-byte sized data files on a daily basis. These data file sizes, together with differences in nomenclature among these data types, make the integration of these multi-dimensional omics data into biologically meaningful context challenging. Variously named as integrated omics, multi-omics, poly-omics, trans-omics, pan-omics, or shortened to just 'omics', the challenges include differences in data cleaning, normalization, biomolecule identification, data dimensionality reduction, biological contextualization, statistical validation, data storage and handling, sharing, and data archiving. The ultimate goal is towards the holistic realization of a 'systems biology' understanding of the biological question in hand. Commonly used approaches in these efforts are currently limited by the 3 i's - integration, interpretation, and insights. Post integration, these very large datasets aim to yield unprecedented views of cellular systems at exquisite resolution for transformative insights into processes, events, and diseases through various computational and informatics frameworks. With the continued reduction in costs and processing time for sample analyses, and increasing types of omics datasets generated such as glycomics, lipidomics, microbiomics, and phenomics, an increasing number of scientists in this interdisciplinary domain of bioinformatics face these challenges. We discuss recent approaches, existing tools, and potential caveats in the integration of omics datasets for development of standardized analytical pipelines that could be adopted by the global omics research community.
Collapse
Affiliation(s)
- Biswapriya B Misra
- B Misra, Internal Medicine, Wake Forest University School of Medicine, Winston-Salem, United States
| | - Carl D Langefeld
- C Langefeld, Biostatistical Sciences, Wake Forest University School of Medicine, Winston-Salem, United States
| | - Michael Olivier
- M Olivier, Internal Medicine, Wake Forest University School of Medicine, Winston-Salem, United States
| | - Laura A Cox
- L Cox, Internal Medicine, Wake Forest University School of Medicine, Winston-Salem, United States
| |
Collapse
|
47
|
De Meulder B, Lefaudeux D, Bansal AT, Mazein A, Chaiboonchoe A, Ahmed H, Balaur I, Saqi M, Pellet J, Ballereau S, Lemonnier N, Sun K, Pandis I, Yang X, Batuwitage M, Kretsos K, van Eyll J, Bedding A, Davison T, Dodson P, Larminie C, Postle A, Corfield J, Djukanovic R, Chung KF, Adcock IM, Guo YK, Sterk PJ, Manta A, Rowe A, Baribaud F, Auffray C. A computational framework for complex disease stratification from multiple large-scale datasets. BMC SYSTEMS BIOLOGY 2018; 12:60. [PMID: 29843806 PMCID: PMC5975674 DOI: 10.1186/s12918-018-0556-z] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/20/2017] [Accepted: 02/21/2018] [Indexed: 01/05/2023]
Abstract
BACKGROUND Multilevel data integration is becoming a major area of research in systems biology. Within this area, multi-'omics datasets on complex diseases are becoming more readily available and there is a need to set standards and good practices for integrated analysis of biological, clinical and environmental data. We present a framework to plan and generate single and multi-'omics signatures of disease states. METHODS The framework is divided into four major steps: dataset subsetting, feature filtering, 'omics-based clustering and biomarker identification. RESULTS We illustrate the usefulness of this framework by identifying potential patient clusters based on integrated multi-'omics signatures in a publicly available ovarian cystadenocarcinoma dataset. The analysis generated a higher number of stable and clinically relevant clusters than previously reported, and enabled the generation of predictive models of patient outcomes. CONCLUSIONS This framework will help health researchers plan and perform multi-'omics big data analyses to generate hypotheses and make sense of their rich, diverse and ever growing datasets, to enable implementation of translational P4 medicine.
Collapse
Affiliation(s)
- Bertrand De Meulder
- European Institute for Systems Biology and Medicine, CNRS-ENS-UCBL, EISBM, 50 Avenue Tony Garnier, 69007, Lyon, France.
| | - Diane Lefaudeux
- European Institute for Systems Biology and Medicine, CNRS-ENS-UCBL, EISBM, 50 Avenue Tony Garnier, 69007, Lyon, France
| | - Aruna T Bansal
- Acclarogen Ltd, St John's Innovation Centre, Cambridge, CB4 OWS, UK
| | - Alexander Mazein
- European Institute for Systems Biology and Medicine, CNRS-ENS-UCBL, EISBM, 50 Avenue Tony Garnier, 69007, Lyon, France
| | - Amphun Chaiboonchoe
- European Institute for Systems Biology and Medicine, CNRS-ENS-UCBL, EISBM, 50 Avenue Tony Garnier, 69007, Lyon, France
| | - Hassan Ahmed
- European Institute for Systems Biology and Medicine, CNRS-ENS-UCBL, EISBM, 50 Avenue Tony Garnier, 69007, Lyon, France
| | - Irina Balaur
- European Institute for Systems Biology and Medicine, CNRS-ENS-UCBL, EISBM, 50 Avenue Tony Garnier, 69007, Lyon, France
| | - Mansoor Saqi
- European Institute for Systems Biology and Medicine, CNRS-ENS-UCBL, EISBM, 50 Avenue Tony Garnier, 69007, Lyon, France
| | - Johann Pellet
- European Institute for Systems Biology and Medicine, CNRS-ENS-UCBL, EISBM, 50 Avenue Tony Garnier, 69007, Lyon, France
| | - Stéphane Ballereau
- European Institute for Systems Biology and Medicine, CNRS-ENS-UCBL, EISBM, 50 Avenue Tony Garnier, 69007, Lyon, France
| | - Nathanaël Lemonnier
- European Institute for Systems Biology and Medicine, CNRS-ENS-UCBL, EISBM, 50 Avenue Tony Garnier, 69007, Lyon, France
| | - Kai Sun
- Data Science Institute, Imperial College, London, SW7 2AZ, UK
| | - Ioannis Pandis
- Data Science Institute, Imperial College, London, SW7 2AZ, UK.,Janssen Research and Development Ltd, High Wycombe, HP12 4DP, UK
| | - Xian Yang
- Data Science Institute, Imperial College, London, SW7 2AZ, UK
| | | | | | | | | | - Timothy Davison
- Janssen Research and Development Ltd, High Wycombe, HP12 4DP, UK
| | - Paul Dodson
- AstraZeneca Ltd, Alderley Park, Macclesfield, SK10 4TG, UK
| | | | - Anthony Postle
- Faculty of Medicine, University of Southampton, Southampton, SO17 1BJ, UK
| | - Julie Corfield
- AstraZeneca R & D, 43150, Mölndal, Sweden.,Arateva R & D Ltd, Nottingham, NG1 1GF, UK
| | - Ratko Djukanovic
- Faculty of Medicine, University of Southampton, Southampton, SO17 1BJ, UK
| | - Kian Fan Chung
- National Hearth and Lung Institute, Imperial College London, London, SW3 6LY, UK
| | - Ian M Adcock
- National Hearth and Lung Institute, Imperial College London, London, SW3 6LY, UK
| | - Yi-Ke Guo
- Data Science Institute, Imperial College, London, SW7 2AZ, UK
| | - Peter J Sterk
- Department of Respiratory Medicine, Academic Medical Centre, University of Amsterdam, Amsterdam, AZ1105, The Netherlands
| | - Alexander Manta
- Research Informatics, Roche Diagnostics GmbH, 82008, Unterhaching, Germany
| | - Anthony Rowe
- Janssen Research and Development Ltd, High Wycombe, HP12 4DP, UK
| | | | - Charles Auffray
- European Institute for Systems Biology and Medicine, CNRS-ENS-UCBL, EISBM, 50 Avenue Tony Garnier, 69007, Lyon, France.
| | | |
Collapse
|
48
|
Noell G, Faner R, Agustí A. From systems biology to P4 medicine: applications in respiratory medicine. Eur Respir Rev 2018; 27:27/147/170110. [PMID: 29436404 PMCID: PMC9489012 DOI: 10.1183/16000617.0110-2017] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2017] [Accepted: 11/30/2017] [Indexed: 12/22/2022] Open
Abstract
Human health and disease are emergent properties of a complex, nonlinear, dynamic multilevel biological system: the human body. Systems biology is a comprehensive research strategy that has the potential to understand these emergent properties holistically. It stems from advancements in medical diagnostics, “omics” data and bioinformatic computing power. It paves the way forward towards “P4 medicine” (predictive, preventive, personalised and participatory), which seeks to better intervene preventively to preserve health or therapeutically to cure diseases. In this review, we: 1) discuss the principles of systems biology; 2) elaborate on how P4 medicine has the potential to shift healthcare from reactive medicine (treatment of illness) to predict and prevent illness, in a revolution that will be personalised in nature, probabilistic in essence and participatory driven; 3) review the current state of the art of network (systems) medicine in three prevalent respiratory diseases (chronic obstructive pulmonary disease, asthma and lung cancer); and 4) outline current challenges and future goals in the field. Systems biology and network medicine have the potential to transform medical research and practicehttp://ow.ly/r3jR30hf35x
Collapse
Affiliation(s)
- Guillaume Noell
- Institut d'Investigacions Biomediques August Pi i Sunyer (IDIBAPS), Barcelona, Spain.,CIBER Enfermedades Respiratorias (CIBERES), Barcelona, Spain
| | - Rosa Faner
- Institut d'Investigacions Biomediques August Pi i Sunyer (IDIBAPS), Barcelona, Spain.,CIBER Enfermedades Respiratorias (CIBERES), Barcelona, Spain
| | - Alvar Agustí
- Institut d'Investigacions Biomediques August Pi i Sunyer (IDIBAPS), Barcelona, Spain .,CIBER Enfermedades Respiratorias (CIBERES), Barcelona, Spain.,Respiratory Institute, Hospital Clinic, Universitat de Barcelona, Barcelona, Spain
| |
Collapse
|
49
|
Abstract
The diversity and huge omics data take biology and biomedicine research and application into a big data era, just like that popular in human society a decade ago. They are opening a new challenge from horizontal data ensemble (e.g., the similar types of data collected from different labs or companies) to vertical data ensemble (e.g., the different types of data collected for a group of person with match information), which requires the integrative analysis in biology and biomedicine and also asks for emergent development of data integration to address the great changes from previous population-guided to newly individual-guided investigations.Data integration is an effective concept to solve the complex problem or understand the complicate system. Several benchmark studies have revealed the heterogeneity and trade-off that existed in the analysis of omics data. Integrative analysis can combine and investigate many datasets in a cost-effective reproducible way. Current integration approaches on biological data have two modes: one is "bottom-up integration" mode with follow-up manual integration, and the other one is "top-down integration" mode with follow-up in silico integration.This paper will firstly summarize the combinatory analysis approaches to give candidate protocol on biological experiment design for effectively integrative study on genomics and then survey the data fusion approaches to give helpful instruction on computational model development for biological significance detection, which have also provided newly data resources and analysis tools to support the precision medicine dependent on the big biomedical data. Finally, the problems and future directions are highlighted for integrative analysis of omics big data.
Collapse
Affiliation(s)
- Xiang-Tian Yu
- Key Laboratory of Systems Biology, Institute of Biochemistry and Cell Biology, Chinese Academy Science, Shanghai, China
| | - Tao Zeng
- Key Laboratory of Systems Biology, Institute of Biochemistry and Cell Biology, Chinese Academy Science, Shanghai, China.
| |
Collapse
|
50
|
Le Van T, van Leeuwen M, Carolina Fierro A, De Maeyer D, Van den Eynden J, Verbeke L, De Raedt L, Marchal K, Nijssen S. Simultaneous discovery of cancer subtypes and subtype features by molecular data integration. Bioinformatics 2017; 32:i445-i454. [PMID: 27587661 DOI: 10.1093/bioinformatics/btw434] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Subtyping cancer is key to an improved and more personalized prognosis/treatment. The increasing availability of tumor related molecular data provides the opportunity to identify molecular subtypes in a data-driven way. Molecular subtypes are defined as groups of samples that have a similar molecular mechanism at the origin of the carcinogenesis. The molecular mechanisms are reflected by subtype-specific mutational and expression features. Data-driven subtyping is a complex problem as subtyping and identifying the molecular mechanisms that drive carcinogenesis are confounded problems. Many current integrative subtyping methods use global mutational and/or expression tumor profiles to group tumor samples in subtypes but do not explicitly extract the subtype-specific features. We therefore present a method that solves both tasks of subtyping and identification of subtype-specific features simultaneously. Hereto our method integrates` mutational and expression data while taking into account the clonal properties of carcinogenesis. Key to our method is a formalization of the problem as a rank matrix factorization of ranked data that approaches the subtyping problem as multi-view bi-clustering RESULTS We introduce a novel integrative framework to identify subtypes by combining mutational and expression features. The incomparable measurement data is integrated by transformation into ranked data and subtypes are defined as multi-view bi-clusters We formalize the model using rank matrix factorization, resulting in the SRF algorithm. Experiments on simulated data and the TCGA breast cancer data demonstrate that SRF is able to capture subtle differences that existing methods may miss. AVAILABILITY AND IMPLEMENTATION The implementation is available at: https://github.com/rankmatrixfactorisation/SRF CONTACT: kathleen.marchal@intec.ugent.be, siegfried.nijssen@cs.kuleuven.be SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Thanh Le Van
- Department of Computer Science, KULeuven, Leuven, Belgium
| | - Matthijs van Leeuwen
- Leiden Institute for Advanced Computer Science, Universiteit Leiden, Leiden, The Netherlands
| | - Ana Carolina Fierro
- Department of Information Technology, iMinds, Ghent University, Gent, Belgium, Bioinformatics Institute Ghent, 9052 Gent, Belgium, Department of Plant Biotechnology and Bioinformatics, Ghent University, Gent, Belgium
| | - Dries De Maeyer
- Department of Information Technology, iMinds, Ghent University, Gent, Belgium, Bioinformatics Institute Ghent, 9052 Gent, Belgium, Department of Plant Biotechnology and Bioinformatics, Ghent University, Gent, Belgium
| | - Jimmy Van den Eynden
- Department of Medical Biochemisty and Cell Biology, Institute of Biomedicine, University of Gothenburg, Gothenburg, Sweden
| | - Lieven Verbeke
- Department of Information Technology, iMinds, Ghent University, Gent, Belgium, Bioinformatics Institute Ghent, 9052 Gent, Belgium, Department of Plant Biotechnology and Bioinformatics, Ghent University, Gent, Belgium
| | - Luc De Raedt
- Department of Computer Science, KULeuven, Leuven, Belgium
| | - Kathleen Marchal
- Department of Information Technology, iMinds, Ghent University, Gent, Belgium, Bioinformatics Institute Ghent, 9052 Gent, Belgium, Department of Plant Biotechnology and Bioinformatics, Ghent University, Gent, Belgium Department of Genetics, University of Pretoria, Hatfield Campus, Pretoria 0028, South Africa
| | - Siegfried Nijssen
- Department of Computer Science, KULeuven, Leuven, Belgium, Leiden Institute for Advanced Computer Science, Universiteit Leiden, Leiden, The Netherlands
| |
Collapse
|