1
|
Mansoor S, Hamid S, Tuan TT, Park JE, Chung YS. Advance computational tools for multiomics data learning. Biotechnol Adv 2024; 77:108447. [PMID: 39251098 DOI: 10.1016/j.biotechadv.2024.108447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2024] [Revised: 09/01/2024] [Accepted: 09/05/2024] [Indexed: 09/11/2024]
Abstract
The burgeoning field of bioinformatics has seen a surge in computational tools tailored for omics data analysis driven by the heterogeneous and high-dimensional nature of omics data. In biomedical and plant science research multi-omics data has become pivotal for predictive analytics in the era of big data necessitating sophisticated computational methodologies. This review explores a diverse array of computational approaches which play crucial role in processing, normalizing, integrating, and analyzing omics data. Notable methods such similarity-based methods, network-based approaches, correlation-based methods, Bayesian methods, fusion-based methods and multivariate techniques among others are discussed in detail, each offering unique functionalities to address the complexities of multi-omics data. Furthermore, this review underscores the significance of computational tools in advancing our understanding of data and their transformative impact on research.
Collapse
Affiliation(s)
- Sheikh Mansoor
- Department of Plant Resources and Environment, Jeju National University, 63243, Republic of Korea
| | - Saira Hamid
- Watson Crick Centre for Molecular Medicine, Islamic University of Science and Technology, Awantipora, Pulwama, J&K, India
| | - Thai Thanh Tuan
- Department of Plant Resources and Environment, Jeju National University, 63243, Republic of Korea; Multimedia Communications Laboratory, University of Information Technology, Ho Chi Minh city 70000, Vietnam; Multimedia Communications Laboratory, Vietnam National University, Ho Chi Minh city 70000, Vietnam
| | - Jong-Eun Park
- Department of Animal Biotechnology, College of Applied Life Science, Jeju National University, Jeju, Jeju-do, Republic of Korea.
| | - Yong Suk Chung
- Department of Plant Resources and Environment, Jeju National University, 63243, Republic of Korea.
| |
Collapse
|
2
|
Goodrich JA, Wang H, Jia Q, Stratakis N, Zhao Y, Maitre L, Bustamante M, Vafeiadi M, Aung M, Andrušaitytė S, Basagana X, Farzan SF, Heude B, Keun H, McConnell R, Yang TC, Siskos AP, Urquiza J, Valvi D, Varo N, Småstuen Haug L, Oftedal BM, Gražulevičienė R, Philippat C, Wright J, Vrijheid M, Chatzi L, Conti DV. Integrating Multi-Omics with environmental data for precision health: A novel analytic framework and case study on prenatal mercury induced childhood fatty liver disease. ENVIRONMENT INTERNATIONAL 2024; 190:108930. [PMID: 39128376 DOI: 10.1016/j.envint.2024.108930] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/16/2023] [Revised: 06/24/2024] [Accepted: 07/31/2024] [Indexed: 08/13/2024]
Abstract
BACKGROUND Precision Health aims to revolutionize disease prevention by leveraging information across multiple omic datasets (multi-omics). However, existing methods generally do not consider personalized environmental risk factors (e.g., environmental pollutants). OBJECTIVE To develop and apply a precision health framework which combines multiomic integration (including early, intermediate, and late integration, representing sequential stages at which omics layers are combined for modeling) with mediation approaches (including high-dimensional mediation to identify biomarkers, mediation with latent factors to identify pathways, and integrated/quasi-mediation to identify high-risk subpopulations) to identify novel biomarkers of prenatal mercury induced metabolic dysfunction-associated fatty liver disease (MAFLD), elucidate molecular pathways linking prenatal mercury with MAFLD in children, and identify high-risk children based on integrated exposure and multiomics data. METHODS This prospective cohort study used data from 420 mother-child pairs from the Human Early Life Exposome (HELIX) project. Mercury concentrations were determined in maternal or cord blood from pregnancy. Cytokeratin 18 (CK-18; a MAFLD biomarker) and five omics layers (DNA Methylation, gene transcription, microRNA, proteins, and metabolites) were measured in blood in childhood (age 6-10 years). RESULTS Each standard deviation increase in prenatal mercury was associated with a 0.11 [95% confidence interval: 0.02-0.21] standard deviation increase in CK-18. High dimensional mediation analysis identified 10 biomarkers linking prenatal mercury and CK-18, including six CpG sites and four transcripts. Mediation with latent factors identified molecular pathways linking mercury and MAFLD, including altered cytokine signaling and hepatic stellate cell activation. Integrated/quasi-mediation identified high risk subgroups of children based on unique combinations of exposure levels, omics profiles (driven by epigenetic markers), and MAFLD. CONCLUSIONS Prenatal mercury exposure is associated with elevated liver enzymes in childhood, likely through alterations in DNA methylation and gene expression. Our analytic framework can be applied across many different fields and serve as a resource to help guide future precision health investigations.
Collapse
Affiliation(s)
- Jesse A Goodrich
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA, United States.
| | - Hongxu Wang
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA, United States
| | - Qiran Jia
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA, United States
| | - Nikos Stratakis
- Barcelona Institute for Global Health (ISGlobal), Barcelona, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain; CIBER Epidemiología y Salud Pública (CIBERESP), Spain
| | - Yinqi Zhao
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA, United States
| | - Léa Maitre
- Barcelona Institute for Global Health (ISGlobal), Barcelona, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain; CIBER Epidemiología y Salud Pública (CIBERESP), Spain
| | - Mariona Bustamante
- Barcelona Institute for Global Health (ISGlobal), Barcelona, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain; CIBER Epidemiología y Salud Pública (CIBERESP), Spain
| | - Marina Vafeiadi
- Department of Social Medicine Faculty of Medicine, University of Crete, Heraklion, Greece
| | - Max Aung
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA, United States
| | - Sandra Andrušaitytė
- Department of Environmental Sciences, Vytauto Didžiojo Universitetas, Kaunas, Lithuania
| | - Xavier Basagana
- Barcelona Institute for Global Health (ISGlobal), Barcelona, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain; CIBER Epidemiología y Salud Pública (CIBERESP), Spain
| | - Shohreh F Farzan
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA, United States
| | - Barbara Heude
- Université de Paris Cité, Institut National de la Santé et de la Recherche Médicale (INSERM), National Research Institute for Agriculture, Food and Environment, Centre of Research in Epidemiology and Statistics, Paris, France
| | - Hector Keun
- Department of Surgery & Cancer and Department of Metabolism Digestion & Reproduction Imperial College London, London, United Kingdom
| | - Rob McConnell
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA, United States
| | - Tiffany C Yang
- Bradford Institute for Health Research, Bradford Teaching Hospitals NHS Foundation Trust, Bradford, United Kingdom
| | - Alexandros P Siskos
- Department of Surgery & Cancer and Department of Metabolism Digestion & Reproduction Imperial College London, London, United Kingdom
| | - Jose Urquiza
- Barcelona Institute for Global Health (ISGlobal), Barcelona, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain; CIBER Epidemiología y Salud Pública (CIBERESP), Spain
| | - Damaskini Valvi
- Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Nerea Varo
- Laboratory of Biochemistry, University Clinic of Navarra, Pamplona, Spain
| | | | | | - Regina Gražulevičienė
- Department of Environmental Sciences, Vytauto Didžiojo Universitetas, Kaunas, Lithuania
| | - Claire Philippat
- University Grenoble Alpes, Institut National de la Santé et de la Recherche Médicale (INSERM) U 1209, CNRS UMR 5309, Team of Environmental Epidemiology Applied to Development and Respiratory Health, Institute for Advanced Biosciences, 38000 Grenoble, France
| | - John Wright
- Bradford Institute for Health Research, Bradford Teaching Hospitals NHS Foundation Trust, Bradford, United Kingdom
| | - Martine Vrijheid
- Barcelona Institute for Global Health (ISGlobal), Barcelona, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain; CIBER Epidemiología y Salud Pública (CIBERESP), Spain
| | - Leda Chatzi
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA, United States
| | - David V Conti
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA, United States
| |
Collapse
|
3
|
Swapna LS, Stevens GC, Sardinha-Silva A, Hu LZ, Brand V, Fusca DD, Wan C, Xiong X, Boyle JP, Grigg ME, Emili A, Parkinson J. ToxoNet: A high confidence map of protein-protein interactions in Toxoplasma gondii. PLoS Comput Biol 2024; 20:e1012208. [PMID: 38900844 PMCID: PMC11219001 DOI: 10.1371/journal.pcbi.1012208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Revised: 07/02/2024] [Accepted: 05/28/2024] [Indexed: 06/22/2024] Open
Abstract
The apicomplexan intracellular parasite Toxoplasma gondii is a major food borne pathogen that is highly prevalent in the global population. The majority of the T. gondii proteome remains uncharacterized and the organization of proteins into complexes is unclear. To overcome this knowledge gap, we used a biochemical fractionation strategy to predict interactions by correlation profiling. To overcome the deficit of high-quality training data in non-model organisms, we complemented a supervised machine learning strategy, with an unsupervised approach, based on similarity network fusion. The resulting combined high confidence network, ToxoNet, comprises 2,063 interactions connecting 652 proteins. Clustering identifies 93 protein complexes. We identified clusters enriched in mitochondrial machinery that include previously uncharacterized proteins that likely represent novel adaptations to oxidative phosphorylation. Furthermore, complexes enriched in proteins localized to secretory organelles and the inner membrane complex, predict additional novel components representing novel targets for detailed functional characterization. We present ToxoNet as a publicly available resource with the expectation that it will help drive future hypotheses within the research community.
Collapse
Affiliation(s)
| | - Grant C. Stevens
- Program in Molecular Medicine, Hospital for Sick Children, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| | - Aline Sardinha-Silva
- Molecular Parasitology Section, Laboratory of Parasitic Diseases, NIAID, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Lucas Zhongming Hu
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| | - Verena Brand
- Program in Molecular Medicine, Hospital for Sick Children, Toronto, Ontario, Canada
| | - Daniel D. Fusca
- Program in Molecular Medicine, Hospital for Sick Children, Toronto, Ontario, Canada
| | - Cuihong Wan
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| | - Xuejian Xiong
- Program in Molecular Medicine, Hospital for Sick Children, Toronto, Ontario, Canada
| | - Jon P. Boyle
- Department of Biological Sciences, Dietrich School of Arts and Sciences, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Michael E. Grigg
- Molecular Parasitology Section, Laboratory of Parasitic Diseases, NIAID, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Andrew Emili
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- Department of Biology and Biochemistry, Boston University, Boston, Massachusetts, United States of America
| | - John Parkinson
- Program in Molecular Medicine, Hospital for Sick Children, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- Department of Biochemistry, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
4
|
Zeng IS. Integrating omics atlas in health informatics system design-an opinion article. Front Digit Health 2024; 6:1374359. [PMID: 38784702 PMCID: PMC11111845 DOI: 10.3389/fdgth.2024.1374359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Accepted: 04/22/2024] [Indexed: 05/25/2024] Open
Affiliation(s)
- Irene Suilan Zeng
- Department of Biostatistics and Epidemiology, Auckland University of Technology, Auckland, New Zealand
- School of Clinical Science, Faculty of Health and Environmental Sciences, Auckland University of Technology, Auckland, New Zealand
| |
Collapse
|
5
|
Rouanet A, Johnson R, Strauss M, Richardson S, Tom BD, White SR, Kirk PDW. Bayesian profile regression for clustering analysis involving a longitudinal response and explanatory variables. METHODOLOGY-EUROPEAN JOURNAL OF RESEARCH METHODS FOR THE BEHAVIORAL AND SOCIAL SCIENCES 2024; 73:314-339. [PMID: 38577633 PMCID: PMC7615733 DOI: 10.1093/jrsssc/qlad097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/06/2024] Open
Abstract
The identification of sets of co-regulated genes that share a common function is a key question of modern genomics. Bayesian profile regression is a semi-supervised mixture modelling approach that makes use of a response to guide inference toward relevant clusterings. Previous applications of profile regression have considered univariate continuous, categorical, and count outcomes. In this work, we extend Bayesian profile regression to cases where the outcome is longitudinal (or multivariate continuous) and provide PReMiuMlongi, an updated version of PReMiuM, the R package for profile regression. We consider multivariate normal and Gaussian process regression response models and provide proof of principle applications to four simulation studies. The model is applied on budding yeast data to identify groups of genes co-regulated during the Saccharomyces cerevisiae cell cycle. We identify 4 distinct groups of genes associated with specific patterns of gene expression trajectories, along with the bound transcriptional factors, likely involved in their co-regulation process.
Collapse
Affiliation(s)
- Anaïs Rouanet
- MRC Biostatistics Unit, School of Clinical Medicine, University of Cambridge, U.K
| | - Rob Johnson
- MRC Biostatistics Unit, School of Clinical Medicine, University of Cambridge, U.K
| | - Magdalena Strauss
- MRC Biostatistics Unit, School of Clinical Medicine, University of Cambridge, U.K
- EMBL-EBI, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Sylvia Richardson
- MRC Biostatistics Unit, School of Clinical Medicine, University of Cambridge, U.K
| | - Brian D Tom
- MRC Biostatistics Unit, School of Clinical Medicine, University of Cambridge, U.K
| | - Simon R White
- MRC Biostatistics Unit, School of Clinical Medicine, University of Cambridge, U.K
- Department of Psychiatry, University of Cambridge, Cambridge, CB2 3EB, UK
| | - Paul D. W. Kirk
- MRC Biostatistics Unit, School of Clinical Medicine, University of Cambridge, U.K
- Cambridge Institute of Therapeutic Immunology & Infectious Disease (CITIID), Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, University of Cambridge, U.K
| |
Collapse
|
6
|
Bhargava M, Crouser ED. Application of laboratory models for sarcoidosis research. J Autoimmun 2024:103184. [PMID: 38443221 DOI: 10.1016/j.jaut.2024.103184] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Revised: 02/12/2024] [Accepted: 02/15/2024] [Indexed: 03/07/2024]
Abstract
This manuscript will review the implications and applications of sarcoidosis models towards advancing our understanding of sarcoidosis disease mechanisms, identification of biomarkers, and preclinical testing of novel therapies. Emerging disease models and innovative research tools will also be considered.
Collapse
Affiliation(s)
- Maneesh Bhargava
- University of Minnesota Medical Center, Division of Pulmonary, Allergy, Critical Care and Sleep Medicine, 420 Delaware Street SE, MMC 276. Minneapolis, MN 55455, USA
| | - Elliott D Crouser
- Ohio State University Wexner Medicine Center, Division of Pulmonary, Allergy, Critical Care and Sleep Medicine, 241 W. 11th Street, Suite 5000, Columbus, OH 43201, USA.
| |
Collapse
|
7
|
Shapiro BD, Battle A. Bayesian Multi-View Clustering given complex inter-view structure. F1000Res 2024; 11:1460. [PMID: 38495778 PMCID: PMC10940850 DOI: 10.12688/f1000research.126215.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/02/2024] [Indexed: 03/19/2024] Open
Abstract
Multi-view datasets are becoming increasingly prevalent. These datasets consist of different modalities that provide complementary characterizations of the same underlying system. They can include heterogeneous types of information with complex relationships, varying degrees of missingness, and assorted sample sizes, as is often the case in multi-omic biological studies. Clustering multi-view data allows us to leverage different modalities to infer underlying systematic structure, but most existing approaches are limited to contexts in which entities are the same across views or have clear one-to-one relationships across data types with a common sample size. Many methods also make strong assumptions about the similarities of clusterings across views. We propose a Bayesian multi-view clustering approach (BMVC) which can handle the realities of multi-view datasets that often have complex relationships and diverse structure. BMVC incorporates known and complex many-to-many relationships between entities via a probabilistic graphical model that enables the joint inference of clusterings specific to each view, but where each view informs the others. Additionally, BMVC estimates the strength of the relationships between each pair of views, thus moderating the degree to which it imposes dependence constraints. We benchmarked BMVC on simulated data to show that it accurately estimates varying degrees of inter-view dependence when inter-view relationships are not limited to one-to-one correspondence. Next, we demonstrated its ability to capture visually interpretable inter-view structure in a public health survey of individuals and households in Puerto Rico following Hurricane Maria. Finally, we showed that BMVC clusters integrate the complex relationships between multi-omic profiles of breast cancer patient data, improving the biological homogeneity of clusters and elucidating hypotheses for functional biological mechanisms. We found that BMVC leverages complex inter-view structure to produce higher quality clusters than those generated by standard approaches. We also showed that BMVC is a valuable tool for real-world discovery and hypothesis generation.
Collapse
Affiliation(s)
- Benjamin D. Shapiro
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Alexis Battle
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
| |
Collapse
|
8
|
Shakyawar SK, Sajja BR, Patel JC, Guda C. iCluF: an unsupervised iterative cluster-fusion method for patient stratification using multiomics data. BIOINFORMATICS ADVANCES 2024; 4:vbae015. [PMID: 38698887 PMCID: PMC11063539 DOI: 10.1093/bioadv/vbae015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 12/10/2023] [Accepted: 01/26/2024] [Indexed: 05/05/2024]
Abstract
Motivation Patient stratification is crucial for the effective treatment or management of heterogeneous diseases, including cancers. Multiomic technologies facilitate molecular characterization of human diseases; however, the complexity of data warrants the need for the development of robust data integration tools for patient stratification using machine-learning approaches. Results iCluF iteratively integrates three types of multiomic data (mRNA, miRNA, and DNA methylation) using pairwise patient similarity matrices built from each omic data. The intermediate omic-specific neighborhood matrices implement iterative matrix fusion and message passing among the similarity matrices to derive a final integrated matrix representing all the omics profiles of a patient, which is used to further cluster patients into subtypes. iCluF outperforms other methods with significant differences in the survival profiles of 8581 patients belonging to 30 different cancers in TCGA. iCluF also predicted the four intrinsic subtypes of Breast Invasive Carcinomas with adjusted rand index and Fowlkes-Mallows scores of 0.72 and 0.83, respectively. The Gini importance score showed that methylation features were the primary decisive players, followed by mRNA and miRNA to identify disease subtypes. iCluF can be applied to stratify patients with any disease containing multiomic datasets. Availability and implementation Source code and datasets are available at https://github.com/GudaLab/iCluF_core.
Collapse
Affiliation(s)
- Sushil K Shakyawar
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE 68198, United States
| | - Balasrinivasa R Sajja
- Department of Radiology, University of Nebraska Medical Center, Omaha, NE 68198, United States
| | - Jai Chand Patel
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE 68198, United States
| | - Chittibabu Guda
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE 68198, United States
- Department of Genetics, Cell Biology and Anatomy, Center for Biomedical Informatics Research and Innovation, University of Nebraska Medical Center, Omaha, NE 68198-5805, United States
| |
Collapse
|
9
|
Lu Z, Ahmadiankalati M, Tan Z. Joint clustering multiple longitudinal features: A comparison of methods and software packages with practical guidance. Stat Med 2023; 42:5513-5540. [PMID: 37789706 DOI: 10.1002/sim.9917] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2022] [Revised: 06/07/2023] [Accepted: 09/13/2023] [Indexed: 10/05/2023]
Abstract
Clustering longitudinal features is a common goal in medical studies to identify distinct disease developmental trajectories. Compared to clustering a single longitudinal feature, integrating multiple longitudinal features allows additional information to be incorporated into the clustering process, which may reveal co-existing longitudinal patterns and generate deeper biological insight. Despite its increasing importance and popularity, there is limited practical guidance for implementing cluster analysis approaches for multiple longitudinal features and evaluating their comparative performance in medical datasets. In this paper, we provide an overview of several commonly used approaches to clustering multiple longitudinal features, with an emphasis on application and implementation through R software. These methods can be broadly categorized into two categories, namely model-based (including frequentist and Bayesian) approaches and algorithm-based approaches. To evaluate their performance, we compare these approaches using real-life and simulated datasets. These results provide practical guidance to applied researchers who are interested in applying these approaches for clustering multiple longitudinal features. Recommendations for applied researchers and suggestions for future research in this area are also discussed.
Collapse
Affiliation(s)
- Zihang Lu
- Department of Public Health Sciences, Queen's University, Kingston, Ontario, Canada
- Department of Mathematics and Statistics, Queen's University, Kingston, Ontario, Canada
| | | | - Zhiwen Tan
- Department of Public Health Sciences, Queen's University, Kingston, Ontario, Canada
| |
Collapse
|
10
|
Downing T, Angelopoulos N. A primer on correlation-based dimension reduction methods for multi-omics analysis. J R Soc Interface 2023; 20:20230344. [PMID: 37817584 PMCID: PMC10565429 DOI: 10.1098/rsif.2023.0344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 09/19/2023] [Indexed: 10/12/2023] Open
Abstract
The continuing advances of omic technologies mean that it is now more tangible to measure the numerous features collectively reflecting the molecular properties of a sample. When multiple omic methods are used, statistical and computational approaches can exploit these large, connected profiles. Multi-omics is the integration of different omic data sources from the same biological sample. In this review, we focus on correlation-based dimension reduction approaches for single omic datasets, followed by methods for pairs of omics datasets, before detailing further techniques for three or more omic datasets. We also briefly detail network methods when three or more omic datasets are available and which complement correlation-oriented tools. To aid readers new to this area, these are all linked to relevant R packages that can implement these procedures. Finally, we discuss scenarios of experimental design and present road maps that simplify the selection of appropriate analysis methods. This review will help researchers navigate emerging methods for multi-omics and integrating diverse omic datasets appropriately. This raises the opportunity of implementing population multi-omics with large sample sizes as omics technologies and our understanding improve.
Collapse
Affiliation(s)
- Tim Downing
- Pirbright Institute, Pirbright, Surrey, UK
- Department of Biotechnology, Dublin City University, Dublin, Ireland
| | | |
Collapse
|
11
|
Hao Y, Jing XY, Sun Q. Cancer survival prediction by learning comprehensive deep feature representation for multiple types of genetic data. BMC Bioinformatics 2023; 24:267. [PMID: 37380946 DOI: 10.1186/s12859-023-05392-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2023] [Accepted: 06/19/2023] [Indexed: 06/30/2023] Open
Abstract
BACKGROUND Cancer is one of the leading death causes around the world. Accurate prediction of its survival time is significant, which can help clinicians make appropriate therapeutic schemes. Cancer data can be characterized by varied molecular features, clinical behaviors and morphological appearances. However, the cancer heterogeneity problem usually makes patient samples with different risks (i.e., short and long survival time) inseparable, thereby causing unsatisfactory prediction results. Clinical studies have shown that genetic data tends to contain more molecular biomarkers associated with cancer, and hence integrating multi-type genetic data may be a feasible way to deal with cancer heterogeneity. Although multi-type gene data have been used in the existing work, how to learn more effective features for cancer survival prediction has not been well studied. RESULTS To this end, we propose a deep learning approach to reduce the negative impact of cancer heterogeneity and improve the cancer survival prediction effect. It represents each type of genetic data as the shared and specific features, which can capture the consensus and complementary information among all types of data. We collect mRNA expression, DNA methylation and microRNA expression data for four cancers to conduct experiments. CONCLUSIONS Experimental results demonstrate that our approach substantially outperforms established integrative methods and is effective for cancer survival prediction. AVAILABILITY AND IMPLEMENTATION https://github.com/githyr/ComprehensiveSurvival .
Collapse
Affiliation(s)
- Yaru Hao
- School of Computer Science, Wuhan University, Wuhan, China.
| | - Xiao-Yuan Jing
- School of Computer Science, Wuhan University, Wuhan, China.
- School of Computer, Guangdong University of Petrochemical Technology, Maoming, China.
- State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China.
| | - Qixing Sun
- School of Computer Science, Wuhan University, Wuhan, China
| |
Collapse
|
12
|
Rahnenführer J, De Bin R, Benner A, Ambrogi F, Lusa L, Boulesteix AL, Migliavacca E, Binder H, Michiels S, Sauerbrei W, McShane L. Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges. BMC Med 2023; 21:182. [PMID: 37189125 DOI: 10.1186/s12916-023-02858-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/28/2022] [Accepted: 04/03/2023] [Indexed: 05/17/2023] Open
Abstract
BACKGROUND In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. METHODS Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 "High-dimensional data" of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. RESULTS The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. CONCLUSIONS This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses.
Collapse
Affiliation(s)
| | | | - Axel Benner
- Division of Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Federico Ambrogi
- Department of Clinical Sciences and Community Health, University of Milan, Milan, Italy
- Scientific Directorate, IRCCS Policlinico San Donato, San Donato Milanese, Italy
| | - Lara Lusa
- Department of Mathematics, Faculty of Mathematics, Natural Sciences and Information Technology, University of Primorksa, Koper, Slovenia
- Institute of Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia
| | - Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig Maximilian University of Munich, Munich, Germany
| | | | - Harald Binder
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Stefan Michiels
- Service de Biostatistique et d'Épidémiologie, Gustave Roussy, Université Paris-Saclay, Villejuif, France
- Oncostat U1018, Inserm, Université Paris-Saclay, Labeled Ligue Contre le Cancer, Villejuif, France
| | - Willi Sauerbrei
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Lisa McShane
- Biometric Research Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute, Bethesda, MD, USA.
| |
Collapse
|
13
|
Wade S. Bayesian cluster analysis. PHILOSOPHICAL TRANSACTIONS. SERIES A, MATHEMATICAL, PHYSICAL, AND ENGINEERING SCIENCES 2023; 381:20220149. [PMID: 36970819 PMCID: PMC10041359 DOI: 10.1098/rsta.2022.0149] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Accepted: 01/03/2023] [Indexed: 06/18/2023]
Abstract
Bayesian cluster analysis offers substantial benefits over algorithmic approaches by providing not only point estimates but also uncertainty in the clustering structure and patterns within each cluster. An overview of Bayesian cluster analysis is provided, including both model-based and loss-based approaches, along with a discussion on the importance of the kernel or loss selected and prior specification. Advantages are demonstrated in an application to cluster cells and discover latent cell types in single-cell RNA sequencing data to study embryonic cellular development. Lastly, we focus on the ongoing debate between finite and infinite mixtures in a model-based approach and robustness to model misspecification. While much of the debate and asymptotic theory focuses on the marginal posterior of the number of clusters, we empirically show that quite a different behaviour is obtained when estimating the full clustering structure. This article is part of the theme issue 'Bayesian inference: challenges, perspectives, and prospects'.
Collapse
Affiliation(s)
- S. Wade
- School of Mathematics and Maxwell Institute for Mathematical Sciences, University of Edinburgh, James Clerk Maxwell Building, Edinburgh, UK
| |
Collapse
|
14
|
Unlu Yazici M, Marron JS, Bakir-Gungor B, Zou F, Yousef M. Invention of 3Mint for feature grouping and scoring in multi-omics. Front Genet 2023; 14:1093326. [PMID: 37007972 PMCID: PMC10050723 DOI: 10.3389/fgene.2023.1093326] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 02/27/2023] [Indexed: 03/17/2023] Open
Abstract
Advanced genomic and molecular profiling technologies accelerated the enlightenment of the regulatory mechanisms behind cancer development and progression, and the targeted therapies in patients. Along this line, intense studies with immense amounts of biological information have boosted the discovery of molecular biomarkers. Cancer is one of the leading causes of death around the world in recent years. Elucidation of genomic and epigenetic factors in Breast Cancer (BRCA) can provide a roadmap to uncover the disease mechanisms. Accordingly, unraveling the possible systematic connections between-omics data types and their contribution to BRCA tumor progression is crucial. In this study, we have developed a novel machine learning (ML) based integrative approach for multi-omics data analysis. This integrative approach combines information from gene expression (mRNA), microRNA (miRNA) and methylation data. Due to the complexity of cancer, this integrated data is expected to improve the prediction, diagnosis and treatment of disease through patterns only available from the 3-way interactions between these 3-omics datasets. In addition, the proposed method bridges the interpretation gap between the disease mechanisms that drive onset and progression. Our fundamental contribution is the 3 Multi-omics integrative tool (3Mint). This tool aims to perform grouping and scoring of groups using biological knowledge. Another major goal is improved gene selection via detection of novel groups of cross-omics biomarkers. Performance of 3Mint is assessed using different metrics. Our computational performance evaluations showed that the 3Mint classifies the BRCA molecular subtypes with lower number of genes when compared to the miRcorrNet tool which uses miRNA and mRNA gene expression profiles in terms of similar performance metrics (95% Accuracy). The incorporation of methylation data in 3Mint yields a much more focused analysis. The 3Mint tool and all other supplementary files are available at https://github.com/malikyousef/3Mint/.
Collapse
Affiliation(s)
- Miray Unlu Yazici
- Department of Bioengineering, Abdullah Gül University, Kayseri, Türkiye
| | - J. S. Marron
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC, United States
| | - Burcu Bakir-Gungor
- Department of Bioengineering, Abdullah Gül University, Kayseri, Türkiye
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Türkiye
| | - Fei Zou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center, Zefat Academic College, Zefat, Israel
- *Correspondence: Malik Yousef,
| |
Collapse
|
15
|
Akdemir D, Somo M, Isidro-Sanchéz J. An Expectation-Maximization Algorithm for Combining a Sample of Partially Overlapping Covariance Matrices. AXIOMS 2023; 12:161. [PMID: 37284612 PMCID: PMC10243021 DOI: 10.3390/axioms12020161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The generation of unprecedented amounts of data brings new challenges in data management, but also an opportunity to accelerate the identification of processes of multiple science disciplines. One of these challenges is the harmonization of high-dimensional unbalanced and heterogeneous data. In this manuscript, we propose a statistical approach to combine incomplete and partially-overlapping pieces of covariance matrices that come from independent experiments. We assume that the data are a random sample of partial covariance matrices sampled from Wishart distributions and we derive an expectation-maximization algorithm for parameter estimation. We demonstrate the properties of our method by (i) using simulation studies and (ii) using empirical datasets. In general, being able to make inferences about the covariance of variables not observed in the same experiment is a valuable tool for data analysis since covariance estimation is an important step in many statistical applications, such as multivariate analysis, principal component analysis, factor analysis, and structural equation modeling.
Collapse
Affiliation(s)
- Deniz Akdemir
- Center of International Bone Marrow Transplantation Research, Minneapolis, MN 55401-1206, USA
| | | | - Julio Isidro-Sanchéz
- Centro de Biotecnologia y Genómica de Plantas, Instituto Nacional de Investigación y Tecnologia Agraria y Alimentaria, Universidad Politécnica de Madrid, 28223, Madrid, Spain
| |
Collapse
|
16
|
Chen X, Han M, Li Y, Li X, Zhang J, Zhu Y. Identification of functional gene modules by integrating multi-omics data and known molecular interactions. Front Genet 2023; 14:1082032. [PMID: 36760999 PMCID: PMC9902936 DOI: 10.3389/fgene.2023.1082032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 01/11/2023] [Indexed: 01/25/2023] Open
Abstract
Multi-omics data integration has emerged as a promising approach to identify patient subgroups. However, in terms of grouping genes (or gene products) into co-expression modules, data integration methods suffer from two main drawbacks. First, most existing methods only consider genes or samples measured in all different datasets. Second, known molecular interactions (e.g., transcriptional regulatory interactions, protein-protein interactions and biological pathways) cannot be utilized to assist in module detection. Herein, we present a novel data integration framework, Correlation-based Local Approximation of Membership (CLAM), which provides two methodological innovations to address these limitations: 1) constructing a trans-omics neighborhood matrix by integrating multi-omics datasets and known molecular interactions, and 2) using a local approximation procedure to define gene modules from the matrix. Applying Correlation-based Local Approximation of Membership to human colorectal cancer (CRC) and mouse B-cell differentiation multi-omics data obtained from The Cancer Genome Atlas (TCGA), Clinical Proteomics Tumor Analysis Consortium (CPTAC), Gene Expression Omnibus (GEO) and ProteomeXchange database, we demonstrated its superior ability to recover biologically relevant modules and gene ontology (GO) terms. Further investigation of the colorectal cancer modules revealed numerous transcription factors and KEGG pathways that played crucial roles in colorectal cancer progression. Module-based survival analysis constructed four survival-related networks in which pairwise gene correlations were significantly correlated with colorectal cancer patient survival. Overall, the series of evaluations demonstrated the great potential of Correlation-based Local Approximation of Membership for identifying modular biomarkers for complex diseases. We implemented Correlation-based Local Approximation of Membership as a user-friendly application available at https://github.com/free1234hm/CLAM.
Collapse
Affiliation(s)
- Xiaoqing Chen
- Basic Medical School, Anhui Medical University, Hefei, China,National Center for Protein Sciences (Beijing), Beijing Proteome Research Center, Beijing Institute of Lifeomics, Beijing, China
| | - Mingfei Han
- National Center for Protein Sciences (Beijing), Beijing Proteome Research Center, Beijing Institute of Lifeomics, Beijing, China
| | - Yingxing Li
- Central Research Laboratory, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Xiao Li
- National Center for Protein Sciences (Beijing), Beijing Proteome Research Center, Beijing Institute of Lifeomics, Beijing, China
| | - Jiaqi Zhang
- National Center for Protein Sciences (Beijing), Beijing Proteome Research Center, Beijing Institute of Lifeomics, Beijing, China
| | - Yunping Zhu
- Basic Medical School, Anhui Medical University, Hefei, China,National Center for Protein Sciences (Beijing), Beijing Proteome Research Center, Beijing Institute of Lifeomics, Beijing, China,*Correspondence: Yunping Zhu,
| |
Collapse
|
17
|
Niranjan V, Uttarkar A, Kaul A, Varghese M. A Machine Learning-Based Approach Using Multi-omics Data to Predict Metabolic Pathways. Methods Mol Biol 2023; 2553:441-452. [PMID: 36227554 DOI: 10.1007/978-1-0716-2617-7_19] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
The integrative method approaches are continuously evolving to provide accurate insights from the data that is received through experimentation on various biological systems. Multi-omics data can be integrated with predictive machine learning algorithms in order to provide results with high accuracy. This protocol chapter defines the steps required for the ML-multi-omics integration methods that are applied on biological datasets for its analysis and the visual interpretation of the results thus obtained.
Collapse
Affiliation(s)
- Vidya Niranjan
- Department of Biotechnology, R V College of Engineering, Mysuru Road, Kengeri, Bengaluru, India.
| | - Akshay Uttarkar
- Department of Biotechnology, R V College of Engineering, Mysuru Road, Kengeri, Bengaluru, India
| | - Aakaanksha Kaul
- Department of Biotechnology, R V College of Engineering, Mysuru Road, Kengeri, Bengaluru, India
| | - Maryanne Varghese
- Department of Biotechnology, R V College of Engineering, Mysuru Road, Kengeri, Bengaluru, India
| |
Collapse
|
18
|
Chalise P, Kwon D, Fridley BL, Mo Q. Statistical Methods for Integrative Clustering of Multi-omics Data. Methods Mol Biol 2023; 2629:73-93. [PMID: 36929074 PMCID: PMC10950392 DOI: 10.1007/978-1-0716-2986-4_5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/18/2023]
Abstract
Cancers are heterogeneous diseases caused by accumulated mutations or abnormal alterations at multi-levels of biological processes including genomics, epigenomics, transcriptomics, and proteomics. There is a great clinical interest in identifying cancer molecular subtypes for disease prognosis and personalized medicine. Integrative clustering is a powerful unsupervised learning method that has been increasingly used to identify cancer molecular subtypes using multi-omics data including somatic mutations, DNA copy numbers, DNA methylation, and gene expression. Integrative clustering methods are generally classified into model-based or nonparametric approaches. In this chapter, we will give an overview of the frequently used model-based methods, including iCluster, iClusterPlus, and iClusterBayes, and the nonparametric method, integrative nonnegative matrix factorization (intNMF). We will use the integrative analyses of uveal melanoma and lower-grade glioma to illustrate these representative methods. Finally, we will discuss the strengths and limitations of these representative methods and give suggestions for performing integrative analyses of cancer multi-omics data in practice.
Collapse
Affiliation(s)
- Prabhakar Chalise
- Department of Biostatistics & Data Science, University of Kansas Medical Center, Kansas City, KS, USA
| | - Deukwoo Kwon
- Department of Population Health Science & Policy, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Brooke L Fridley
- Department of Biostatistics & Bioinformatics, H. Lee Moffitt Cancer Center & Research Institute, Tampa, FL, USA
| | - Qianxing Mo
- Department of Biostatistics & Bioinformatics, H. Lee Moffitt Cancer Center & Research Institute, Tampa, FL, USA.
| |
Collapse
|
19
|
Mokou M, Narayanasamy S, Stroggilos R, Balaur IA, Vlahou A, Mischak H, Frantzi M. A Drug Repurposing Pipeline Based on Bladder Cancer Integrated Proteotranscriptomics Signatures. Methods Mol Biol 2023; 2684:59-99. [PMID: 37410228 DOI: 10.1007/978-1-0716-3291-8_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/07/2023]
Abstract
Delivering better care for patients with bladder cancer (BC) necessitates the development of novel therapeutic strategies that address both the high disease heterogeneity and the limitations of the current therapeutic modalities, such as drug low efficacy and patient resistance acquisition. Drug repurposing is a cost-effective strategy that targets the reuse of existing drugs for new therapeutic purposes. Such a strategy could open new avenues toward more effective BC treatment. BC patients' multi-omics signatures can be used to guide the investigation of existing drugs that show an effective therapeutic potential through drug repurposing. In this book chapter, we present an integrated multilayer approach that includes cross-omics analyses from publicly available transcriptomics and proteomics data derived from BC tissues and cell lines that were investigated for the development of disease-specific signatures. These signatures are subsequently used as input for a signature-based repurposing approach using the Connectivity Map (CMap) tool. We further explain the steps that may be followed to identify and select existing drugs of increased potential for repurposing in BC patients.
Collapse
Affiliation(s)
- Marika Mokou
- Department of Biomarker Research, Mosaiques Diagnostics, Hannover, Germany.
| | - Shaman Narayanasamy
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Rafael Stroggilos
- Systems Biology Center, Biomedical Research Foundation, Academy of Athens, Athens, Greece
| | - Irina-Afrodita Balaur
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Antonia Vlahou
- Systems Biology Center, Biomedical Research Foundation, Academy of Athens, Athens, Greece
| | - Harald Mischak
- Department of Biomarker Research, Mosaiques Diagnostics, Hannover, Germany
- Institute of Cardiovascular and Medical Sciences, University of Glasgow, Glasgow, UK
| | - Maria Frantzi
- Department of Biomarker Research, Mosaiques Diagnostics, Hannover, Germany
| |
Collapse
|
20
|
Ni Y, He J, Chalise P. Randomized singular value decomposition for integrative subtype analysis of 'omics data' using non-negative matrix factorization. Stat Appl Genet Mol Biol 2023; 22:sagmb-2022-0047. [PMID: 37937887 DOI: 10.1515/sagmb-2022-0047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2022] [Accepted: 09/25/2023] [Indexed: 11/09/2023]
Abstract
Integration of multiple 'omics datasets for differentiating cancer subtypes is a powerful technic that leverages the consistent and complementary information across multi-omics data. Matrix factorization is a common technique used in integrative clustering for identifying latent subtype structure across multi-omics data. High dimensionality of the omics data and long computation time have been common challenges of clustering methods. In order to address the challenges, we propose randomized singular value decomposition (RSVD) for integrative clustering using Non-negative Matrix Factorization: intNMF-rsvd. The method utilizes RSVD to reduce the dimensionality by projecting the data into eigen vector space with user specified lower rank. Then, clustering analysis is carried out by estimating common basis matrix across the projected multi-omics datasets. The performance of the proposed method was assessed using the simulated datasets and compared with six state-of-the-art integrative clustering methods using real-life datasets from The Cancer Genome Atlas Study. intNMF-rsvd was found working efficiently and competitively as compared to standard intNMF and other multi-omics clustering methods. Most importantly, intNMF-rsvd can handle large number of features and significantly reduce the computation time. The identified subtypes can be utilized for further clinical association studies to understand the etiology of the disease.
Collapse
Affiliation(s)
- Yonghui Ni
- Department of Biostatistics and Data Science, University of Kansas Medical Center, 3901 Rainbow Blvd, Kansas City, KS 66160, USA
| | - Jianghua He
- Department of Biostatistics and Data Science, University of Kansas Medical Center, 3901 Rainbow Blvd, Kansas City, KS 66160, USA
| | - Prabhakar Chalise
- Department of Biostatistics and Data Science, University of Kansas Medical Center, 3901 Rainbow Blvd, Kansas City, KS 66160, USA
| |
Collapse
|
21
|
Hight SK, Clark TN, Kurita KL, McMillan EA, Bray W, Shaikh AF, Khadilkar A, Haeckl FPJ, Carnevale-Neto F, La S, Lohith A, Vaden RM, Lee J, Wei S, Lokey RS, White MA, Linington RG, MacMillan JB. High-throughput functional annotation of natural products by integrated activity profiling. Proc Natl Acad Sci U S A 2022; 119:e2208458119. [PMID: 36449542 PMCID: PMC9894231 DOI: 10.1073/pnas.2208458119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2022] [Accepted: 10/19/2022] [Indexed: 12/05/2022] Open
Abstract
Determining mechanism of action (MOA) is one of the biggest challenges in natural products discovery. Here, we report a comprehensive platform that uses Similarity Network Fusion (SNF) to improve MOA predictions by integrating data from the cytological profiling high-content imaging platform and the gene expression platform Functional Signature Ontology, and pairs these data with untargeted metabolomics analysis for de novo bioactive compound discovery. The predictive value of the integrative approach was assessed using a library of target-annotated small molecules as benchmarks. Using Kolmogorov-Smirnov (KS) tests to compare in-class to out-of-class similarity, we found that SNF retains the ability to identify significant in-class similarity across a diverse set of target classes, and could find target classes not detectable in either platform alone. This confirmed that integration of expression-based and image-based phenotypes can accurately report on MOA. Furthermore, we integrated untargeted metabolomics of complex natural product fractions with the SNF network to map biological signatures to specific metabolites. Three examples are presented where SNF coupled with metabolomics was used to directly functionally characterize natural products and accelerate identification of bioactive metabolites, including the discovery of the azoxy-containing biaryl compounds parkamycins A and B. Our results support SNF integration of multiple phenotypic screening approaches along with untargeted metabolomics as a powerful approach for advancing natural products drug discovery.
Collapse
Affiliation(s)
- Suzie K Hight
- Department of Cell Biology, University of Texas Southwestern Medical Center, Dallas, TX 75390
| | - Trevor N Clark
- Department of Chemistry, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| | - Kenji L Kurita
- Department of Chemistry, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| | - Elizabeth A McMillan
- Department of Cell Biology, University of Texas Southwestern Medical Center, Dallas, TX 75390
| | - Walter Bray
- Department of Chemistry, University of California Santa Cruz, Santa Cruz, CA 95064
| | - Anam F Shaikh
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX 75390
| | - Aswad Khadilkar
- Department of Chemistry, University of California Santa Cruz, Santa Cruz, CA 95064
| | - F P Jake Haeckl
- Department of Chemistry, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| | | | - Scott La
- Department of Chemistry, University of California Santa Cruz, Santa Cruz, CA 95064
| | - Akshar Lohith
- Department of Chemistry, University of California Santa Cruz, Santa Cruz, CA 95064
| | - Rachel M Vaden
- Department of Cell Biology, University of Texas Southwestern Medical Center, Dallas, TX 75390
| | - Jeon Lee
- Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX 75390
| | - Shuguang Wei
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX 75390
| | - R Scott Lokey
- Department of Chemistry, University of California Santa Cruz, Santa Cruz, CA 95064
| | - Michael A White
- Department of Cell Biology, University of Texas Southwestern Medical Center, Dallas, TX 75390
| | - Roger G Linington
- Department of Chemistry, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| | - John B MacMillan
- Department of Chemistry, University of California Santa Cruz, Santa Cruz, CA 95064
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX 75390
| |
Collapse
|
22
|
Tomkins M, Hoerbst F, Gupta S, Apelt F, Kehr J, Kragler F, Morris RJ. Exact Bayesian inference for the detection of graft-mobile transcripts from sequencing data. J R Soc Interface 2022; 19:20220644. [PMID: 36514890 PMCID: PMC9748499 DOI: 10.1098/rsif.2022.0644] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Accepted: 11/24/2022] [Indexed: 12/15/2022] Open
Abstract
The long-distance transport of messenger RNAs (mRNAs) has been shown to be important for several developmental processes in plants. A popular method for identifying travelling mRNAs is to perform RNA-Seq on grafted plants. This approach depends on the ability to correctly assign sequenced mRNAs to the genetic background from which they originated. The assignment is often based on the identification of single-nucleotide polymorphisms (SNPs) between otherwise identical sequences. A major challenge is therefore to distinguish SNPs from sequencing errors. Here, we show how Bayes factors can be computed analytically using RNA-Seq data over all the SNPs in an mRNA. We used simulations to evaluate the performance of the proposed framework and demonstrate how Bayes factors accurately identify graft-mobile transcripts. The comparison with other detection methods using simulated data shows how not taking the variability in read depth, error rates and multiple SNPs per transcript into account can lead to incorrect classification. Our results suggest experimental design criteria for successful graft-mobile mRNA detection and show the pitfalls of filtering for sequencing errors or focusing on single SNPs within an mRNA.
Collapse
Affiliation(s)
- Melissa Tomkins
- Computational and Systems Biology, John Innes Centre, Norwich Research Park, Norwich NR47UH, UK
| | - Franziska Hoerbst
- Computational and Systems Biology, John Innes Centre, Norwich Research Park, Norwich NR47UH, UK
| | - Saurabh Gupta
- Max Planck Institute of Molecular Plant Physiology, Max Planck Institute, Am Mühlenberg 1, Potsdam-Golm 14476, Germany
| | - Federico Apelt
- Max Planck Institute of Molecular Plant Physiology, Max Planck Institute, Am Mühlenberg 1, Potsdam-Golm 14476, Germany
| | - Julia Kehr
- Institute of Plant Science and Microbiology, Universität Hamburg, Ohnhorststrasse 18, Hamburg 22609, Germany
| | - Friedrich Kragler
- Max Planck Institute of Molecular Plant Physiology, Max Planck Institute, Am Mühlenberg 1, Potsdam-Golm 14476, Germany
| | - Richard J. Morris
- Computational and Systems Biology, John Innes Centre, Norwich Research Park, Norwich NR47UH, UK
| |
Collapse
|
23
|
Crook OM, Lilley KS, Gatto L, Kirk PD. Semi-Supervised Non-Parametric Bayesian Modelling of Spatial Proteomics. Ann Appl Stat 2022; 16:22-aoas1603. [PMID: 36507469 PMCID: PMC7613899 DOI: 10.1214/22-aoas1603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
Understanding sub-cellular protein localisation is an essential component in the analysis of context specific protein function. Recent advances in quantitative mass-spectrometry (MS) have led to high resolution mapping of thousands of proteins to sub-cellular locations within the cell. Novel modelling considerations to capture the complex nature of these data are thus necessary. We approach analysis of spatial proteomics data in a non-parametric Bayesian framework, using K-component mixtures of Gaussian process regression models. The Gaussian process regression model accounts for correlation structure within a sub-cellular niche, with each mixture component capturing the distinct correlation structure observed within each niche. The availability of marker proteins (i.e. proteins with a priori known labelled locations) motivates a semi-supervised learning approach to inform the Gaussian process hyperparameters. We moreover provide an efficient Hamiltonian-within-Gibbs sampler for our model. Furthermore, we reduce the computational burden associated with inversion of covariance matrices by exploiting the structure in the covariance matrix. A tensor decomposition of our covariance matrices allows extended Trench and Durbin algorithms to be applied to reduce the computational complexity of inversion and hence accelerate computation. We provide detailed case-studies on Drosophila embryos and mouse pluripotent embryonic stem cells to illustrate the benefit of semi-supervised functional Bayesian modelling of the data.
Collapse
|
24
|
Li W, Shao C, Zhou H, Du H, Chen H, Wan H, He Y. Multi-omics research strategies in ischemic stroke: A multidimensional perspective. Ageing Res Rev 2022; 81:101730. [PMID: 36087702 DOI: 10.1016/j.arr.2022.101730] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2022] [Revised: 08/23/2022] [Accepted: 09/03/2022] [Indexed: 01/31/2023]
Abstract
Ischemic stroke (IS) is a multifactorial and heterogeneous neurological disorder with high rate of death and long-term impairment. Despite years of studies, there are still no stroke biomarkers for clinical practice, and the molecular mechanisms of stroke remain largely unclear. The high-throughput omics approach provides new avenues for discovering biomarkers of IS and explaining its pathological mechanisms. However, single-omics approaches only provide a limited understanding of the biological pathways of diseases. The integration of multiple omics data means the simultaneous analysis of thousands of genes, RNAs, proteins and metabolites, revealing networks of interactions between multiple molecular levels. Integrated analysis of multi-omics approaches will provide helpful insights into stroke pathogenesis, therapeutic target identification and biomarker discovery. Here, we consider advances in genomics, transcriptomics, proteomics and metabolomics and outline their use in discovering the biomarkers and pathological mechanisms of IS. We then delineate strategies for achieving integration at the multi-omics level and discuss how integrative omics and systems biology can contribute to our understanding and management of IS.
Collapse
Affiliation(s)
- Wentao Li
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| | - Chongyu Shao
- School of Life Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| | - Huifen Zhou
- School of Life Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| | - Haixia Du
- School of Life Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| | - Haiyang Chen
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| | - Haitong Wan
- School of Life Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| | - Yu He
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou 310053, China.
| |
Collapse
|
25
|
Rong Z, Liu Z, Song J, Cao L, Yu Y, Qiu M, Hou Y. MCluster-VAEs: An end-to-end variational deep learning-based clustering method for subtype discovery using multi-omics data. Comput Biol Med 2022; 150:106085. [PMID: 36162197 DOI: 10.1016/j.compbiomed.2022.106085] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Revised: 07/30/2022] [Accepted: 09/03/2022] [Indexed: 11/03/2022]
Abstract
The discovery of cancer subtypes based on unsupervised clustering helps in providing a precise diagnosis, guide treatment, and improve patients' prognoses. Instead of single-omics data, multi-omics data can improve the clustering performance because it obtains a comprehensive landscape for understanding biological systems and mechanisms. However, heterogeneous data from multiple sources raises high complexity and different kinds of noise, which are detrimental to the extraction of clustering information. We propose an end-to-end deep learning based method, called Multi-omics Clustering Variational Autoencoders (MCluster-VAEs), that can extract cluster-friendly representations on multi-omics data. First, a unified network architecture with an attention mechanism was developed for accurately modeling multi-omics data. Then, using a novel objective function built from the Variational Bayes technique, the model was trained to effectively obtain the posterior estimation of the clustering assignments. Compared with 12 other state-of-the-art multi-omics clustering methods, MCluster-VAEs achieved an outstanding performance on benchmark datasets from the TCGA database. On the Pan Cancer dataset, MCluster-VAEs achieved an adjusted Rand index of approximately 0.78 for cancer category recognition, an increase of more than 18% compared with other methods. Furthermore, a survival analysis and clinical parameter enrichment tests conducted on 10 cancer datasets demonstrated that MCluster-VAEs provides comparable and even better results than many common integrative approaches. These results demonstrate that MCluster-VAEs are a powerful new tool for dissecting complex multi-omics relationships and providing new insights for cancer subtype discovery.
Collapse
Affiliation(s)
- Zhiwei Rong
- Department of Biostatistics Beijing, Peking University School of Public Health, No. 38 Xueyuan Road, Haidian District, Beijing, 100000, China
| | - Zhilin Liu
- Department of Biostatistics Beijing, Peking University School of Public Health, No. 38 Xueyuan Road, Haidian District, Beijing, 100000, China
| | - Jiali Song
- Department of Biostatistics Beijing, Peking University School of Public Health, No. 38 Xueyuan Road, Haidian District, Beijing, 100000, China
| | - Lei Cao
- Department of Epidemiology and Biostatistics Harbin, Harbin Medical University School of Public Health, Harbin, 150000, Heilongjiang, China
| | - Yipe Yu
- Department of Biostatistics Beijing, Peking University School of Public Health, No. 38 Xueyuan Road, Haidian District, Beijing, 100000, China
| | - Mantang Qiu
- Department of Thoracic Surgery Beijing, Peking University People's Hospital, Beijing, 100000, China.
| | - Yan Hou
- Department of Biostatistics Beijing, Peking University School of Public Health, No. 38 Xueyuan Road, Haidian District, Beijing, 100000, China; Peking University Clinical Research Center, No. 38 Xueyuan Road, Haidian District, Beijing, 100000, China.
| |
Collapse
|
26
|
Zhanpeng H, Jiekang W. A Multiview Clustering Method With Low-Rank and Sparsity Constraints for Cancer Subtyping. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3213-3223. [PMID: 34705654 DOI: 10.1109/tcbb.2021.3122917] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Multiomics data clustering is one of the major challenges in the field of precision medicine. Integration of multiomics data for cancer subtyping can improve the understanding on cancer and reveal systems-level insights. How to integrate multiomics data for accurate cancer subtyping is an interesting and challenging research problem. To capture the global and the local structure of omics data, a novel framework for integrating multiomics data is proposed for cancer subtyping. Multiview clustering with low-rank and sparsity constraints (MVCLRS) can measure the local similarities of samples in each omics data and obtain global consensus structures by integrating the multiomics data. The main insight provided by MVCLRS is that low-rank sparse subspace clustering for the construction of an affinity matrix can best capture the local similarities in omics data. Extensive testing is conducted on 10 real world cancer datasets with multiomics from The Cancer Genome Atlas. Compared with 10 state-of-the-art multiomics clustering algorithms, the MVCLRS performs better in the 10 cancer datasets by providing its clustering results with at least one enriched clinical label in nine of ten cancer subtypes, the most of any method.
Collapse
|
27
|
Bi-EB: Empirical Bayesian Biclustering for Multi-Omics Data Integration Pattern Identification among Species. Genes (Basel) 2022; 13:genes13111982. [DOI: 10.3390/genes13111982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2022] [Revised: 09/20/2022] [Accepted: 09/23/2022] [Indexed: 11/16/2022] Open
Abstract
Although several biclustering algorithms have been studied, few are used for cross-pattern identification across species using multi-omics data mining. A fast empirical Bayesian biclustering (Bi-EB) algorithm is developed to detect the patterns shared from both integrated omics data and between species. The Bi-EB algorithm addresses the clinical critical translational question using the bioinformatics strategy, which addresses how modules of genotype variation associated with phenotype from cancer cell screening data can be identified and how these findings can be directly translated to a cancer patient subpopulation. Empirical Bayesian probabilistic interpretation and ratio strategy are proposed in Bi-EB for the first time to detect the pairwise regulation patterns among species and variations in multiple omics on a gene level, such as proteins and mRNA. An expectation–maximization (EM) optimal algorithm is used to extract the foreground co-current variations out of its background noise data by adjusting parameters with bicluster membership probability threshold Ac; and the bicluster average probability p. Three simulation experiments and two real biology mRNA and protein data analyses conducted on the well-known Cancer Genomics Atlas (TCGA) and The Cancer Cell Line Encyclopedia (CCLE) verify that the proposed Bi-EB algorithm can significantly improve the clustering recovery and relevance accuracy, outperforming the other seven biclustering methods—Cheng and Church (CC), xMOTIFs, BiMax, Plaid, Spectral, FABIA, and QUBIC—with a recovery score of 0.98 and a relevance score of 0.99. At the same time, the Bi-EB algorithm is used to determine shared the causality patterns of mRNA to the protein between patients and cancer cells in TCGA and CCLE breast cancer. The clinically well-known treatment target protein module estrogen receptor (ER), ER (p118), AR, BCL2, cyclin E1, and IGFBP2 are identified in accordance with their mRNA expression variations in the luminal-like subtype. Ten genes, including CCNB1, CDH1, KDR, RAB25, PRKCA, etc., found which can maintain the high accordance of mRNA–protein for both breast cancer patients and cell lines in basal-like subtypes for the first time. Bi-EB provides a useful biclustering analysis tool to discover the cross patterns hidden both in multiple data matrixes (omics) and species. The implementation of the Bi-EB method in the clinical setting will have a direct impact on administrating translational research based on the cancer cell screening guidance.
Collapse
|
28
|
Coleman S, Kirk PDW, Wallace C. Consensus clustering for Bayesian mixture models. BMC Bioinformatics 2022; 23:290. [PMID: 35864476 PMCID: PMC9306175 DOI: 10.1186/s12859-022-04830-8] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Accepted: 07/05/2022] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND Cluster analysis is an integral part of precision medicine and systems biology, used to define groups of patients or biomolecules. Consensus clustering is an ensemble approach that is widely used in these areas, which combines the output from multiple runs of a non-deterministic clustering algorithm. Here we consider the application of consensus clustering to a broad class of heuristic clustering algorithms that can be derived from Bayesian mixture models (and extensions thereof) by adopting an early stopping criterion when performing sampling-based inference for these models. While the resulting approach is non-Bayesian, it inherits the usual benefits of consensus clustering, particularly in terms of computational scalability and providing assessments of clustering stability/robustness. RESULTS In simulation studies, we show that our approach can successfully uncover the target clustering structure, while also exploring different plausible clusterings of the data. We show that, when a parallel computation environment is available, our approach offers significant reductions in runtime compared to performing sampling-based Bayesian inference for the underlying model, while retaining many of the practical benefits of the Bayesian approach, such as exploring different numbers of clusters. We propose a heuristic to decide upon ensemble size and the early stopping criterion, and then apply consensus clustering to a clustering algorithm derived from a Bayesian integrative clustering method. We use the resulting approach to perform an integrative analysis of three 'omics datasets for budding yeast and find clusters of co-expressed genes with shared regulatory proteins. We validate these clusters using data external to the analysis. CONCLUSTIONS Our approach can be used as a wrapper for essentially any existing sampling-based Bayesian clustering implementation, and enables meaningful clustering analyses to be performed using such implementations, even when computational Bayesian inference is not feasible, e.g. due to poor exploration of the target density (often as a result of increasing numbers of features) or a limited computational budget that does not along sufficient samples to drawn from a single chain. This enables researchers to straightforwardly extend the applicability of existing software to much larger datasets, including implementations of sophisticated models such as those that jointly model multiple datasets.
Collapse
Affiliation(s)
- Stephen Coleman
- MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
| | - Paul D. W. Kirk
- MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
- Cambridge Institute of Therapeutic Immunology and Infectious Disease, University of Cambridge, Cambridge, UK
| | - Chris Wallace
- MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
- Cambridge Institute of Therapeutic Immunology and Infectious Disease, University of Cambridge, Cambridge, UK
| |
Collapse
|
29
|
Gliozzo J, Mesiti M, Notaro M, Petrini A, Patak A, Puertas-Gallardo A, Paccanaro A, Valentini G, Casiraghi E. Heterogeneous data integration methods for patient similarity networks. Brief Bioinform 2022; 23:6604996. [PMID: 35679533 PMCID: PMC9294435 DOI: 10.1093/bib/bbac207] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2021] [Revised: 04/14/2022] [Accepted: 05/04/2022] [Indexed: 12/29/2022] Open
Abstract
Patient similarity networks (PSNs), where patients are represented as nodes and their similarities as weighted edges, are being increasingly used in clinical research. These networks provide an insightful summary of the relationships among patients and can be exploited by inductive or transductive learning algorithms for the prediction of patient outcome, phenotype and disease risk. PSNs can also be easily visualized, thus offering a natural way to inspect complex heterogeneous patient data and providing some level of explainability of the predictions obtained by machine learning algorithms. The advent of high-throughput technologies, enabling us to acquire high-dimensional views of the same patients (e.g. omics data, laboratory data, imaging data), calls for the development of data fusion techniques for PSNs in order to leverage this rich heterogeneous information. In this article, we review existing methods for integrating multiple biomedical data views to construct PSNs, together with the different patient similarity measures that have been proposed. We also review methods that have appeared in the machine learning literature but have not yet been applied to PSNs, thus providing a resource to navigate the vast machine learning literature existing on this topic. In particular, we focus on methods that could be used to integrate very heterogeneous datasets, including multi-omics data as well as data derived from clinical information and medical imaging.
Collapse
Affiliation(s)
- Jessica Gliozzo
- AnacletoLab - Computer Science Department, Universitá degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy.,European Commission, Joint Research Centre (JRC), Ispra (VA), Italy.,CINI, Infolife National Laboratory, Roma, Italy
| | - Marco Mesiti
- AnacletoLab - Computer Science Department, Universitá degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy.,CINI, Infolife National Laboratory, Roma, Italy
| | - Marco Notaro
- AnacletoLab - Computer Science Department, Universitá degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy.,CINI, Infolife National Laboratory, Roma, Italy
| | - Alessandro Petrini
- AnacletoLab - Computer Science Department, Universitá degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy.,CINI, Infolife National Laboratory, Roma, Italy
| | - Alex Patak
- European Commission, Joint Research Centre (JRC), Ispra (VA), Italy
| | | | - Alberto Paccanaro
- Department of Computer Science, Royal Holloway, University of London, Egham, TW20 0EX UK.,School of Applied Mathematics (EMAp), Fundação Getúlio Vargas, Rio de Janeiro Brazil
| | - Giorgio Valentini
- AnacletoLab - Computer Science Department, Universitá degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy.,CINI, Infolife National Laboratory, Roma, Italy.,DSRC UNIMI, Data Science Research Center, Milano, 20135, Italy.,ELLIS, European Laboratory for Learning and Intelligent Systems, Berlin, Germany
| | - Elena Casiraghi
- AnacletoLab - Computer Science Department, Universitá degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy.,CINI, Infolife National Laboratory, Roma, Italy
| |
Collapse
|
30
|
Hesami M, Alizadeh M, Jones AMP, Torkamaneh D. Machine learning: its challenges and opportunities in plant system biology. Appl Microbiol Biotechnol 2022; 106:3507-3530. [PMID: 35575915 DOI: 10.1007/s00253-022-11963-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 03/14/2022] [Accepted: 05/07/2022] [Indexed: 12/25/2022]
Abstract
Sequencing technologies are evolving at a rapid pace, enabling the generation of massive amounts of data in multiple dimensions (e.g., genomics, epigenomics, transcriptomic, metabolomics, proteomics, and single-cell omics) in plants. To provide comprehensive insights into the complexity of plant biological systems, it is important to integrate different omics datasets. Although recent advances in computational analytical pipelines have enabled efficient and high-quality exploration and exploitation of single omics data, the integration of multidimensional, heterogenous, and large datasets (i.e., multi-omics) remains a challenge. In this regard, machine learning (ML) offers promising approaches to integrate large datasets and to recognize fine-grained patterns and relationships. Nevertheless, they require rigorous optimizations to process multi-omics-derived datasets. In this review, we discuss the main concepts of machine learning as well as the key challenges and solutions related to the big data derived from plant system biology. We also provide in-depth insight into the principles of data integration using ML, as well as challenges and opportunities in different contexts including multi-omics, single-cell omics, protein function, and protein-protein interaction. KEY POINTS: • The key challenges and solutions related to the big data derived from plant system biology have been highlighted. • Different methods of data integration have been discussed. • Challenges and opportunities of the application of machine learning in plant system biology have been highlighted and discussed.
Collapse
Affiliation(s)
- Mohsen Hesami
- Department of Plant Agriculture, University of Guelph, Guelph, ON, N1G 2W1, Canada
| | - Milad Alizadeh
- Department of Botany, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | | | - Davoud Torkamaneh
- Département de Phytologie, Université Laval, Québec City, QC, G1V 0A6, Canada. .,Institut de Biologie Intégrative Et Des Systèmes (IBIS), Université Laval, Québec City, QC, G1V 0A6, Canada.
| |
Collapse
|
31
|
Zhang X, Zhou Z, Xu H, Liu CT. Integrative clustering methods for multi-omics data. WILEY INTERDISCIPLINARY REVIEWS. COMPUTATIONAL STATISTICS 2022; 14. [PMID: 35573155 PMCID: PMC9097984 DOI: 10.1002/wics.1553] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Integrative analysis of multi-omics data has drawn much attention from the scientific community due to the technological advancements which have generated various omics data. Leveraging these multi-omics data potentially provides a more comprehensive view of the disease mechanism or biological processes. Integrative multi-omics clustering is an unsupervised integrative method specifically used to find coherent groups of samples or features by utilizing information across multi-omics data. It aims to better stratify diseases and to suggest biological mechanisms and potential targeted therapies for the diseases. However, applying integrative multi-omics clustering is both statistically and computationally challenging due to various reasons such as high dimensionality and heterogeneity. In this review, we summarized integrative multi-omics clustering methods into three general categories: concatenated clustering, clustering of clusters, and interactive clustering based on when and how the multi-omics data are processed for clustering. We further classified the methods into different approaches under each category based on the main statistical strategy used during clustering. In addition, we have provided recommended practices tailored to four real-life scenarios to help researchers to strategize their selection in integrative multi-omics clustering methods for their future studies.
Collapse
Affiliation(s)
- Xiaoyu Zhang
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA
| | - Zhenwei Zhou
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA
| | - Hanfei Xu
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA
| | - Ching-Ti Liu
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA
| |
Collapse
|
32
|
Vahabi N, Michailidis G. Unsupervised Multi-Omics Data Integration Methods: A Comprehensive Review. Front Genet 2022; 13:854752. [PMID: 35391796 PMCID: PMC8981526 DOI: 10.3389/fgene.2022.854752] [Citation(s) in RCA: 44] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Accepted: 02/28/2022] [Indexed: 12/26/2022] Open
Abstract
Through the developments of Omics technologies and dissemination of large-scale datasets, such as those from The Cancer Genome Atlas, Alzheimer’s Disease Neuroimaging Initiative, and Genotype-Tissue Expression, it is becoming increasingly possible to study complex biological processes and disease mechanisms more holistically. However, to obtain a comprehensive view of these complex systems, it is crucial to integrate data across various Omics modalities, and also leverage external knowledge available in biological databases. This review aims to provide an overview of multi-Omics data integration methods with different statistical approaches, focusing on unsupervised learning tasks, including disease onset prediction, biomarker discovery, disease subtyping, module discovery, and network/pathway analysis. We also briefly review feature selection methods, multi-Omics data sets, and resources/tools that constitute critical components for carrying out the integration.
Collapse
Affiliation(s)
- Nasim Vahabi
- Informatics Institute, University of Florida, Gainesville, FL, United States
| | - George Michailidis
- Informatics Institute, University of Florida, Gainesville, FL, United States
| |
Collapse
|
33
|
ALOHA: Aggregated local extrema splines for high-throughput dose-response analysis. COMPUTATIONAL TOXICOLOGY (AMSTERDAM, NETHERLANDS) 2022; 21:100196. [PMID: 35083394 PMCID: PMC8785973 DOI: 10.1016/j.comtox.2021.100196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Computational methods for genomic dose-response integrate dose-response modeling with bioinformatics tools to evaluate changes in molecular and cellular functions related to pathogenic processes. These methods use parametric models to describe each gene's dose-response, but such models may not adequately capture expression changes. Additionally, current approaches do not consider gene co-expression networks. When assessing co-expression networks, one typically does not consider the dose-response relationship, resulting in 'co-regulated' gene sets containing genes having different dose-response patterns. To avoid these limitations, we develop an analysis pipeline called Aggregated Local Extrema Splines for High-throughput Analysis (ALOHA), which computes individual genomic dose-response functions using a flexible class Bayesian shape constrained splines and clusters gene co-regulation based upon these fits. Using splines, we reduce information loss due to parametric lack-of-fit issues, and because we cluster on dose-response relationships, we better identify co-regulation clusters for genes that have co-expressed dose-response patterns from chemical exposure. The clustered pathways can then be used to estimate a dose associated with a pre-specified biological response, i.e., the benchmark dose (BMD), and approximate a point of departure dose corresponding to minimal adverse response in the whole tissue/organism. We compare our approach to current parametric methods and our biologically enriched gene sets to cluster on normalized expression data. Using this methodology, we can more effectively extract the underlying structure leading to more cohesive estimates of gene set potency.
Collapse
|
34
|
Viaud G, Mayilvahanan P, Cournede PH. Representation Learning for the Clustering of Multi-Omics Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:135-145. [PMID: 33600320 DOI: 10.1109/tcbb.2021.3060340] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The integration of several sources of data for the identification of subtypes of diseases has gained attention over the past few years. The heterogeneity and the high dimensions of the data sets calls for an adequate representation of the data. We summarize the field of representation learning for the multi-omics clustering problem and we investigate several techniques to learn relevant combined representations, using methods from group factor analysis (PCA, MFA and extensions) and from machine learning with autoencoders. We highlight the importance of appropriately designing and training the latter, notably with a novel combination of a disjointed deep autoencoder (DDAE) architecture and a layer-wise reconstruction loss. These different representations can then be clustered to identify biologically meaningful clusters of patients. We provide a unifying framework for model comparison between statistical and deep learning approaches with the introduction of a new weighted internal clustering index that evaluates how well the clustering information is retained from each source, favoring contributions from all data sets. We apply our methodology to two case studies for which previous works of integrative clustering exist, TCGA Breast Cancer and TARGET Neuroblastoma, and show how our method can yield good and well-balanced clusters across the different data sources.
Collapse
|
35
|
Tripp BA, Otu HH. Integration of Multi-Omics Data Using Probabilistic Graph Models and
External Knowledge. Curr Bioinform 2022. [DOI: 10.2174/1574893616666210906141545] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
High-throughput sequencing technologies have revolutionized the ability to
perform systems-level biology and elucidate molecular mechanisms of disease through the comprehensive
characterization of different layers of biological information. Integration of these heterogeneous
layers can provide insight into the underlying biology but is challenged by modeling complex interactions.
Objective:
We introduce OBaNK: omics integration using Bayesian networks and external knowledge,
an algorithm to model interactions between heterogeneous high-dimensional biological data to elucidate
complex functional clusters and emergent relationships associated with an observed phenotype.
Method:
Using Bayesian network learning, we modeled the statistical dependencies and interactions
between lipidomics, proteomics, and metabolomics data. The strength of a learned interaction between
molecules was altered based on external knowledge.
Results :
Networks learned from synthetic datasets based on real pathways achieved an average area under
the curve score of ~0.85, an improvement of ~0.23 from baseline methods. When applied to real
multi-omics data collected during pregnancy, five distinct functional networks of heterogeneous biological
data were identified, and the results were compared to other multi-omics integration approaches.
Conclusion:
OBaNK successfully improved the accuracy of learning interaction networks from data integrating
external knowledge, identified heterogeneous functional networks from real data, and suggested
potential novel interactions associated with the phenotype. These findings can guide future hypothesis
generation. OBaNK source code is available at: https://github.com/bridgettripp/OBaNK.git, and a
graphical user interface is available at: http://otulab.unl.edu/OBaNK.
Collapse
Affiliation(s)
- Bridget A. Tripp
- Department of Electrical and Computer Engineering, University of Nebraska-Lincoln, Lincoln, Nebraska, USA
- PhD Program of Complex Biosystems, University of Nebraska-Lincoln, Lincoln, Nebraska, USA
| | - Hasan H. Otu
- Department of Electrical and Computer Engineering, University of Nebraska-Lincoln, Lincoln, Nebraska, USA
| |
Collapse
|
36
|
Wandy J, Daly R. GraphOmics: an interactive platform to explore and integrate multi-omics data. BMC Bioinformatics 2021; 22:603. [PMID: 34922446 PMCID: PMC8684259 DOI: 10.1186/s12859-021-04500-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Accepted: 11/30/2021] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND An increasing number of studies now produce multiple omics measurements that require using sophisticated computational methods for analysis. While each omics data can be examined separately, jointly integrating multiple omics data allows for deeper understanding and insights to be gained from the study. In particular, data integration can be performed horizontally, where biological entities from multiple omics measurements are mapped to common reactions and pathways. However, data integration remains a challenge due to the complexity of the data and the difficulty in interpreting analysis results. RESULTS Here we present GraphOmics, a user-friendly platform to explore and integrate multiple omics datasets and support hypothesis generation. Users can upload transcriptomics, proteomics and metabolomics data to GraphOmics. Relevant entities are connected based on their biochemical relationships, and mapped to reactions and pathways from Reactome. From the Data Browser in GraphOmics, mapped entities and pathways can be ranked, sorted and filtered according to their statistical significance (p values) and fold changes. Context-sensitive panels provide information on the currently selected entities, while interactive heatmaps and clustering functionalities are also available. As a case study, we demonstrated how GraphOmics was used to interactively explore multi-omics data and support hypothesis generation using two complex datasets from existing Zebrafish regeneration and Covid-19 human studies. CONCLUSIONS GraphOmics is fully open-sourced and freely accessible from https://graphomics.glasgowcompbio.org/ . It can be used to integrate multiple omics data horizontally by mapping entities across omics to reactions and pathways. Our demonstration showed that by using interactive explorations from GraphOmics, interesting insights and biological hypotheses could be rapidly revealed.
Collapse
Affiliation(s)
- Joe Wandy
- Glasgow Polyomics, University of Glasgow, Glasgow, G61 1BD, UK
| | - Rónán Daly
- Glasgow Polyomics, University of Glasgow, Glasgow, G61 1BD, UK.
| |
Collapse
|
37
|
Madjar K, Zucknick M, Ickstadt K, Rahnenführer J. Combining heterogeneous subgroups with graph-structured variable selection priors for Cox regression. BMC Bioinformatics 2021; 22:586. [PMID: 34895139 PMCID: PMC8665528 DOI: 10.1186/s12859-021-04483-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2021] [Accepted: 11/15/2021] [Indexed: 11/12/2022] Open
Abstract
Background Important objectives in cancer research are the prediction of a patient’s risk based on molecular measurements such as gene expression data and the identification of new prognostic biomarkers (e.g. genes). In clinical practice, this is often challenging because patient cohorts are typically small and can be heterogeneous. In classical subgroup analysis, a separate prediction model is fitted using only the data of one specific cohort. However, this can lead to a loss of power when the sample size is small. Simple pooling of all cohorts, on the other hand, can lead to biased results, especially when the cohorts are heterogeneous. Results We propose a new Bayesian approach suitable for continuous molecular measurements and survival outcome that identifies the important predictors and provides a separate risk prediction model for each cohort. It allows sharing information between cohorts to increase power by assuming a graph linking predictors within and across different cohorts. The graph helps to identify pathways of functionally related genes and genes that are simultaneously prognostic in different cohorts. Conclusions Results demonstrate that our proposed approach is superior to the standard approaches in terms of prediction performance and increased power in variable selection when the sample size is small. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04483-z.
Collapse
Affiliation(s)
- Katrin Madjar
- Department of Statistics, TU Dortmund University, 44221, Dortmund, Germany.
| | - Manuela Zucknick
- Department of Biostatistics, Oslo Centre for Biostatistics and Epidemiology, University of Oslo, 0317, Oslo, Norway
| | - Katja Ickstadt
- Department of Statistics, TU Dortmund University, 44221, Dortmund, Germany
| | - Jörg Rahnenführer
- Department of Statistics, TU Dortmund University, 44221, Dortmund, Germany
| |
Collapse
|
38
|
Madjar K, Rahnenführer J. Weighted Cox regression for the prediction of heterogeneous patient subgroups. BMC Med Inform Decis Mak 2021; 21:342. [PMID: 34876106 PMCID: PMC8650299 DOI: 10.1186/s12911-021-01698-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Accepted: 11/23/2021] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND An important task in clinical medicine is the construction of risk prediction models for specific subgroups of patients based on high-dimensional molecular measurements such as gene expression data. Major objectives in modeling high-dimensional data are good prediction performance and feature selection to find a subset of predictors that are truly associated with a clinical outcome such as a time-to-event endpoint. In clinical practice, this task is challenging since patient cohorts are typically small and can be heterogeneous with regard to their relationship between predictors and outcome. When data of several subgroups of patients with the same or similar disease are available, it is tempting to combine them to increase sample size, such as in multicenter studies. However, heterogeneity between subgroups can lead to biased results and subgroup-specific effects may remain undetected. METHODS For this situation, we propose a penalized Cox regression model with a weighted version of the Cox partial likelihood that includes patients of all subgroups but assigns them individual weights based on their subgroup affiliation. The weights are estimated from the data such that patients who are likely to belong to the subgroup of interest obtain higher weights in the subgroup-specific model. RESULTS Our proposed approach is evaluated through simulations and application to real lung cancer cohorts, and compared to existing approaches. Simulation results demonstrate that our proposed model is superior to standard approaches in terms of prediction performance and variable selection accuracy when the sample size is small. CONCLUSIONS The results suggest that sharing information between subgroups by incorporating appropriate weights into the likelihood can increase power to identify the prognostic covariates and improve risk prediction.
Collapse
Affiliation(s)
- Katrin Madjar
- Department of Statistics, TU Dortmund University, 44221, Dortmund, Germany.
| | - Jörg Rahnenführer
- Department of Statistics, TU Dortmund University, 44221, Dortmund, Germany
| |
Collapse
|
39
|
|
40
|
Nguyen H, Tran D, Tran B, Roy M, Cassell A, Dascalu S, Draghici S, Nguyen T. SMRT: Randomized Data Transformation for Cancer Subtyping and Big Data Analysis. Front Oncol 2021; 11:725133. [PMID: 34745946 PMCID: PMC8563705 DOI: 10.3389/fonc.2021.725133] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Accepted: 09/28/2021] [Indexed: 12/25/2022] Open
Abstract
Cancer is an umbrella term that includes a range of disorders, from those that are fast-growing and lethal to indolent lesions with low or delayed potential for progression to death. The treatment options, as well as treatment success, are highly dependent on the correct subtyping of individual patients. With the advancement of high-throughput platforms, we have the opportunity to differentiate among cancer subtypes from a holistic perspective that takes into consideration phenomena at different molecular levels (mRNA, methylation, etc.). This demands powerful integrative methods to leverage large multi-omics datasets for a better subtyping. Here we introduce Subtyping Multi-omics using a Randomized Transformation (SMRT), a new method for multi-omics integration and cancer subtyping. SMRT offers the following advantages over existing approaches: (i) the scalable analysis pipeline allows researchers to integrate multi-omics data and analyze hundreds of thousands of samples in minutes, (ii) the ability to integrate data types with different numbers of patients, (iii) the ability to analyze un-matched data of different types, and (iv) the ability to offer users a convenient data analysis pipeline through a web application. We also improve the efficiency of our ensemble-based, perturbation clustering to support analysis on machines with memory constraints. In an extensive analysis, we compare SMRT with eight state-of-the-art subtyping methods using 37 TCGA and two METABRIC datasets comprising a total of almost 12,000 patient samples from 28 different types of cancer. We also performed a number of simulation studies. We demonstrate that SMRT outperforms other methods in identifying subtypes with significantly different survival profiles. In addition, SMRT is extremely fast, being able to analyze hundreds of thousands of samples in minutes. The web application is available at http://SMRT.tinnguyen-lab.com. The R package will be deposited to CRAN as part of our PINSPlus software suite.
Collapse
Affiliation(s)
- Hung Nguyen
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Duc Tran
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Bang Tran
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Monikrishna Roy
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Adam Cassell
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Sergiu Dascalu
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Sorin Draghici
- Department of Computer Science, Wayne State University, Detroit, MI, United States
| | - Tin Nguyen
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| |
Collapse
|
41
|
Lu Z, Lou W. Bayesian consensus clustering for multivariate longitudinal data. Stat Med 2021; 41:108-127. [PMID: 34672001 DOI: 10.1002/sim.9225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 09/26/2021] [Accepted: 09/27/2021] [Indexed: 11/06/2022]
Abstract
In clinical and epidemiological studies, there is a growing interest in studying the heterogeneity among patients based on longitudinal characteristics to identify subtypes of the study population. Compared to clustering a single longitudinal marker, simultaneously clustering multiple longitudinal markers allow additional information to be incorporated into the clustering process, which reveals co-existing longitudinal patterns and generates deeper biological insight. In the current study, we propose a Bayesian consensus clustering (BCC) model for multivariate longitudinal data. Instead of arriving at a single overall clustering, the proposed model allows each marker to follow marker-specific local clustering and these local clusterings are aggregated to find a global (consensus) clustering. To estimate the posterior distribution of model parameters, a Gibbs sampling algorithm is proposed. We apply our proposed model to the primary biliary cirrhosis study to identify patient subtypes that may be associated with their prognosis. We also perform simulation studies to compare the clustering performance between the proposed model and existing models under several scenarios. The results demonstrate that the proposed BCC model serves as a useful tool for clustering multivariate longitudinal data.
Collapse
Affiliation(s)
- Zihang Lu
- Department of Public Health Sciences, Queen's University, Kingston, Ontario, Canada
| | - Wendy Lou
- Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
42
|
Mulvey CM, Breckels LM, Crook OM, Sanders DJ, Ribeiro ALR, Geladaki A, Christoforou A, Britovšek NK, Hurrell T, Deery MJ, Gatto L, Smith AM, Lilley KS. Spatiotemporal proteomic profiling of the pro-inflammatory response to lipopolysaccharide in the THP-1 human leukaemia cell line. Nat Commun 2021; 12:5773. [PMID: 34599159 PMCID: PMC8486773 DOI: 10.1038/s41467-021-26000-9] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2021] [Accepted: 09/07/2021] [Indexed: 02/07/2023] Open
Abstract
Protein localisation and translocation between intracellular compartments underlie almost all physiological processes. The hyperLOPIT proteomics platform combines mass spectrometry with state-of-the-art machine learning to map the subcellular location of thousands of proteins simultaneously. We combine global proteome analysis with hyperLOPIT in a fully Bayesian framework to elucidate spatiotemporal proteomic changes during a lipopolysaccharide (LPS)-induced inflammatory response. We report a highly dynamic proteome in terms of both protein abundance and subcellular localisation, with alterations in the interferon response, endo-lysosomal system, plasma membrane reorganisation and cell migration. Proteins not previously associated with an LPS response were found to relocalise upon stimulation, the functional consequences of which are still unclear. By quantifying proteome-wide uncertainty through Bayesian modelling, a necessary role for protein relocalisation and the importance of taking a holistic overview of the LPS-driven immune response has been revealed. The data are showcased as an interactive application freely available for the scientific community.
Collapse
Affiliation(s)
- Claire M Mulvey
- Cambridge Centre for Proteomics, Cambridge Systems Biology Centre and Department of Biochemistry, University of Cambridge, Cambridge, CB2 1QR, UK
- Cancer Research UK Cambridge Institute, University of Cambridge, Li Ka Shing Centre, Cambridge, CB2 0RE, UK
| | - Lisa M Breckels
- Cambridge Centre for Proteomics, Cambridge Systems Biology Centre and Department of Biochemistry, University of Cambridge, Cambridge, CB2 1QR, UK
| | - Oliver M Crook
- Cambridge Centre for Proteomics, Cambridge Systems Biology Centre and Department of Biochemistry, University of Cambridge, Cambridge, CB2 1QR, UK
- MRC Biostatistics Unit, Cambridge Institute for Public Health, Forvie Site, Robinson Way, Cambridge, CB2 0SR, UK
| | - David J Sanders
- Department of Microbial Diseases, Eastman Dental Institute, University College London, Royal Free Campus, Rowland Hill Street, London, NW3 2PF, UK
| | - Andre L R Ribeiro
- Department of Microbial Diseases, Eastman Dental Institute, University College London, Royal Free Campus, Rowland Hill Street, London, NW3 2PF, UK
| | - Aikaterini Geladaki
- Cambridge Centre for Proteomics, Cambridge Systems Biology Centre and Department of Biochemistry, University of Cambridge, Cambridge, CB2 1QR, UK
| | | | - Nina Kočevar Britovšek
- Cambridge Centre for Proteomics, Cambridge Systems Biology Centre and Department of Biochemistry, University of Cambridge, Cambridge, CB2 1QR, UK
- Lek d.d., Kolodvorska 27, Mengeš, 1234, Slovenia
| | - Tracey Hurrell
- Cambridge Centre for Proteomics, Cambridge Systems Biology Centre and Department of Biochemistry, University of Cambridge, Cambridge, CB2 1QR, UK
| | - Michael J Deery
- Cambridge Centre for Proteomics, Cambridge Systems Biology Centre and Department of Biochemistry, University of Cambridge, Cambridge, CB2 1QR, UK
| | - Laurent Gatto
- Cambridge Centre for Proteomics, Cambridge Systems Biology Centre and Department of Biochemistry, University of Cambridge, Cambridge, CB2 1QR, UK
- de Duve Institute, UCLouvain, Avenue Hippocrate 75, Brussels, 1200, Belgium
| | - Andrew M Smith
- Department of Microbial Diseases, Eastman Dental Institute, University College London, Royal Free Campus, Rowland Hill Street, London, NW3 2PF, UK.
| | - Kathryn S Lilley
- Cambridge Centre for Proteomics, Cambridge Systems Biology Centre and Department of Biochemistry, University of Cambridge, Cambridge, CB2 1QR, UK.
| |
Collapse
|
43
|
Tendon and multiomics: advantages, advances, and opportunities. NPJ Regen Med 2021; 6:61. [PMID: 34599188 PMCID: PMC8486786 DOI: 10.1038/s41536-021-00168-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2020] [Accepted: 09/01/2021] [Indexed: 02/08/2023] Open
Abstract
Tendons heal by fibrosis, which hinders function and increases re-injury risk. Yet the biology that leads to degeneration and regeneration of tendons is not completely understood. Improved understanding of the metabolic nuances that cause diverse outcomes in tendinopathies is required to solve these problems. 'Omics methods are increasingly used to characterize phenotypes in tissues. Multiomics integrates 'omic datasets to identify coherent relationships and provide insight into differences in molecular and metabolic pathways between anatomic locations, and disease stages. This work reviews the current literature pertaining to multiomics in tendon and the potential of these platforms to improve tendon regeneration. We assessed the literature and identified areas where 'omics platforms contribute to the field: (1) Tendon biology where their hierarchical complexity and demographic factors are studied. (2) Tendon degeneration and healing, where comparisons across tendon pathologies are analyzed. (3) The in vitro engineered tendon phenotype, where we compare the engineered phenotype to relevant native tissues. (4) Finally, we review regenerative and therapeutic approaches. We identified gaps in current knowledge and opportunities for future study: (1) The need to increase the diversity of human subjects and cell sources. (2) Opportunities to improve understanding of tendon heterogeneity. (3) The need to use these improvements to inform new engineered and regenerative therapeutic approaches. (4) The need to increase understanding of the development of tendon pathology. Together, the expanding use of various 'omics platforms and data analysis resulting from these platforms could substantially contribute to major advances in the tendon tissue engineering and regenerative medicine field.
Collapse
|
44
|
Duan R, Gao L, Gao Y, Hu Y, Xu H, Huang M, Song K, Wang H, Dong Y, Jiang C, Zhang C, Jia S. Evaluation and comparison of multi-omics data integration methods for cancer subtyping. PLoS Comput Biol 2021; 17:e1009224. [PMID: 34383739 PMCID: PMC8384175 DOI: 10.1371/journal.pcbi.1009224] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Revised: 08/24/2021] [Accepted: 06/28/2021] [Indexed: 11/18/2022] Open
Abstract
Computational integrative analysis has become a significant approach in the data-driven exploration of biological problems. Many integration methods for cancer subtyping have been proposed, but evaluating these methods has become a complicated problem due to the lack of gold standards. Moreover, questions of practical importance remain to be addressed regarding the impact of selecting appropriate data types and combinations on the performance of integrative studies. Here, we constructed three classes of benchmarking datasets of nine cancers in TCGA by considering all the eleven combinations of four multi-omics data types. Using these datasets, we conducted a comprehensive evaluation of ten representative integration methods for cancer subtyping in terms of accuracy measured by combining both clustering accuracy and clinical significance, robustness, and computational efficiency. We subsequently investigated the influence of different omics data on cancer subtyping and the effectiveness of their combinations. Refuting the widely held intuition that incorporating more types of omics data always produces better results, our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. Our analyses also suggested several effective combinations for most cancers under our studies, which may be of particular interest to researchers in omics data analysis. Cancer is one of the most heterogeneous diseases, characterized by diverse morphological, phenotypic, and genomic profiles between tumors and their subtypes. Identifying cancer subtypes can help patients receive precise treatments. With the development of high-throughput technologies, genomics, epigenomics, and transcriptomics data have been generated for large cancer patient cohorts. It is believed that the more omics data we use, the more accurate identification of cancer subtypes. To examine this assumption, we first constructed three classes of benchmarking datasets to conduct a comprehensive evaluation and comparison of ten representative multi-omics data integration methods for cancer subtyping by considering their accuracy, robustness, and computational efficiency. Then, we investigated the influence of different omics data and their various combinations on the effectiveness of cancer subtyping. Our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. We hope that our work may help researchers choose a proper method and an effective data combination when identifying cancer subtypes using data integration methods.
Collapse
Affiliation(s)
- Ran Duan
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Lin Gao
- School of Computer Science and Technology, Xidian University, Xi’an, China
- * E-mail:
| | - Yong Gao
- Department of Computer Science, The University of British Columbia Okanagan, Kelowna, British Columbia, Canada
| | - Yuxuan Hu
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Han Xu
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Mingfeng Huang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Kuo Song
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Hongda Wang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Yongqiang Dong
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Chaoqun Jiang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Chenxing Zhang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Songwei Jia
- School of Computer Science and Technology, Xidian University, Xi’an, China
| |
Collapse
|
45
|
Heo YJ, Hwa C, Lee GH, Park JM, An JY. Integrative Multi-Omics Approaches in Cancer Research: From Biological Networks to Clinical Subtypes. Mol Cells 2021; 44:433-443. [PMID: 34238766 PMCID: PMC8334347 DOI: 10.14348/molcells.2021.0042] [Citation(s) in RCA: 56] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2021] [Revised: 04/09/2021] [Accepted: 05/12/2021] [Indexed: 11/27/2022] Open
Abstract
Multi-omics approaches are novel frameworks that integrate multiple omics datasets generated from the same patients to better understand the molecular and clinical features of cancers. A wide range of emerging omics and multi-view clustering algorithms now provide unprecedented opportunities to further classify cancers into subtypes, improve the survival prediction and therapeutic outcome of these subtypes, and understand key pathophysiological processes through different molecular layers. In this review, we overview the concept and rationale of multi-omics approaches in cancer research. We also introduce recent advances in the development of multi-omics algorithms and integration methods for multiple-layered datasets from cancer patients. Finally, we summarize the latest findings from large-scale multi-omics studies of various cancers and their implications for patient subtyping and drug development.
Collapse
Affiliation(s)
- Yong Jin Heo
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, Seoul 02841, Korea
- Department of Integrated Biomedical and Life Science, Korea University, Seoul 02841, Korea
| | - Chanwoong Hwa
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, Seoul 02841, Korea
| | - Gang-Hee Lee
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, Seoul 02841, Korea
| | - Jae-Min Park
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, Seoul 02841, Korea
| | - Joon-Yong An
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, Seoul 02841, Korea
- Department of Integrated Biomedical and Life Science, Korea University, Seoul 02841, Korea
| |
Collapse
|
46
|
Qi L, Wang W, Wu T, Zhu L, He L, Wang X. Multi-Omics Data Fusion for Cancer Molecular Subtyping Using Sparse Canonical Correlation Analysis. Front Genet 2021; 12:607817. [PMID: 34367231 PMCID: PMC8341864 DOI: 10.3389/fgene.2021.607817] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2020] [Accepted: 07/05/2021] [Indexed: 12/24/2022] Open
Abstract
It is now clear that major malignancies are heterogeneous diseases associated with diverse molecular properties and clinical outcomes, posing a great challenge for more individualized therapy. In the last decade, cancer molecular subtyping studies were mostly based on transcriptomic profiles, ignoring heterogeneity at other (epi-)genetic levels of gene regulation. Integrating multiple types of (epi)genomic data generates a more comprehensive landscape of biological processes, providing an opportunity to better dissect cancer heterogeneity. Here, we propose sparse canonical correlation analysis for cancer classification (SCCA-CC), which projects each type of single-omics data onto a unified space for data fusion, followed by clustering and classification analysis. Without loss of generality, as case studies, we integrated two types of omics data, mRNA and miRNA profiles, for molecular classification of ovarian cancer (n = 462), and breast cancer (n = 451). The two types of omics data were projected onto a unified space using SCCA, followed by data fusion to identify cancer subtypes. The subtypes we identified recapitulated subtypes previously recognized by other groups (all P- values < 0.001), but display more significant clinical associations. Especially in ovarian cancer, the four subtypes we identified were significantly associated with overall survival, while the taxonomy previously established by TCGA did not (P- values: 0.039 vs. 0.12). The multi-omics classifiers we established can not only classify individual types of data but also demonstrated higher accuracies on the fused data. Compared with iCluster, SCCA-CC demonstrated its superiority by identifying subtypes of higher coherence, clinical relevance, and time efficiency. In conclusion, we developed an integrated bioinformatic framework SCCA-CC for cancer molecular subtyping. Using two case studies in breast and ovarian cancer, we demonstrated its effectiveness in identifying biologically meaningful and clinically relevant subtypes. SCCA-CC presented a unique advantage in its ability to classify both single-omics data and multi-omics data, which significantly extends the applicability to various data types, and making more efficient use of published omics resources.
Collapse
Affiliation(s)
- Lin Qi
- Department of Biomedical Sciences, City University of Hong Kong, Shenzhen, China
| | - Wei Wang
- Department of Biomedical Sciences, City University of Hong Kong, Shenzhen, China
| | - Tan Wu
- Department of Biomedical Sciences, City University of Hong Kong, Shenzhen, China
| | - Lina Zhu
- Department of Biomedical Sciences, City University of Hong Kong, Shenzhen, China
| | - Lingli He
- Department of Biomedical Sciences, City University of Hong Kong, Shenzhen, China
| | - Xin Wang
- Department of Biomedical Sciences, City University of Hong Kong, Shenzhen, China.,Key Laboratory of Biochip Technology, Biotech and Health Centre, Shenzhen Research Institute, City University of Hong Kong, Shenzhen, China
| |
Collapse
|
47
|
Reel PS, Reel S, Pearson E, Trucco E, Jefferson E. Using machine learning approaches for multi-omics data analysis: A review. Biotechnol Adv 2021; 49:107739. [PMID: 33794304 DOI: 10.1016/j.biotechadv.2021.107739] [Citation(s) in RCA: 287] [Impact Index Per Article: 95.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Revised: 03/01/2021] [Accepted: 03/25/2021] [Indexed: 02/06/2023]
Abstract
With the development of modern high-throughput omic measurement platforms, it has become essential for biomedical studies to undertake an integrative (combined) approach to fully utilise these data to gain insights into biological systems. Data from various omics sources such as genetics, proteomics, and metabolomics can be integrated to unravel the intricate working of systems biology using machine learning-based predictive algorithms. Machine learning methods offer novel techniques to integrate and analyse the various omics data enabling the discovery of new biomarkers. These biomarkers have the potential to help in accurate disease prediction, patient stratification and delivery of precision medicine. This review paper explores different integrative machine learning methods which have been used to provide an in-depth understanding of biological systems during normal physiological functioning and in the presence of a disease. It provides insight and recommendations for interdisciplinary professionals who envisage employing machine learning skills in multi-omics studies.
Collapse
Affiliation(s)
- Parminder S Reel
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, United Kingdom
| | - Smarti Reel
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, United Kingdom
| | - Ewan Pearson
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, United Kingdom
| | - Emanuele Trucco
- VAMPIRE project, Computing, School of Science and Engineering, University of Dundee, Dundee, United Kingdom
| | - Emily Jefferson
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, United Kingdom.
| |
Collapse
|
48
|
Fang S, Kirk PDW, Bantscheff M, Lilley KS, Crook OM. A Bayesian semi-parametric model for thermal proteome profiling. Commun Biol 2021; 4:810. [PMID: 34188175 PMCID: PMC8241860 DOI: 10.1038/s42003-021-02306-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Accepted: 06/07/2021] [Indexed: 02/06/2023] Open
Abstract
The thermal stability of proteins can be altered when they interact with small molecules, other biomolecules or are subject to post-translation modifications. Thus monitoring the thermal stability of proteins under various cellular perturbations can provide insights into protein function, as well as potentially determine drug targets and off-targets. Thermal proteome profiling is a highly multiplexed mass-spectrommetry method for monitoring the melting behaviour of thousands of proteins in a single experiment. In essence, thermal proteome profiling assumes that proteins denature upon heating and hence become insoluble. Thus, by tracking the relative solubility of proteins at sequentially increasing temperatures, one can report on the thermal stability of a protein. Standard thermodynamics predicts a sigmoidal relationship between temperature and relative solubility and this is the basis of current robust statistical procedures. However, current methods do not model deviations from this behaviour and they do not quantify uncertainty in the melting profiles. To overcome these challenges, we propose the application of Bayesian functional data analysis tools which allow complex temperature-solubility behaviours. Our methods have improved sensitivity over the state-of-the art, identify new drug-protein associations and have less restrictive assumptions than current approaches. Our methods allows for comprehensive analysis of proteins that deviate from the predicted sigmoid behaviour and we uncover potentially biphasic phenomena with a series of published datasets.
Collapse
Affiliation(s)
- Siqi Fang
- Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Cambridge, UK
- Milner Therapeutics Institute, Jeffrey Cheah Biomedical Centre, University of Cambridge, Cambridge, UK
| | - Paul D W Kirk
- MRC Biostatistics Unit, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Cambridge Institute of Therapeutic Immunology & Infectious Disease (CITIID), Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, University of Cambridge, Cambridge, UK
| | | | - Kathryn S Lilley
- Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Cambridge, UK.
- Milner Therapeutics Institute, Jeffrey Cheah Biomedical Centre, University of Cambridge, Cambridge, UK.
| | - Oliver M Crook
- Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Cambridge, UK.
- MRC Biostatistics Unit, School of Clinical Medicine, University of Cambridge, Cambridge, UK.
- Milner Therapeutics Institute, Jeffrey Cheah Biomedical Centre, University of Cambridge, Cambridge, UK.
| |
Collapse
|
49
|
Adossa N, Khan S, Rytkönen KT, Elo LL. Computational strategies for single-cell multi-omics integration. Comput Struct Biotechnol J 2021; 19:2588-2596. [PMID: 34025945 PMCID: PMC8114078 DOI: 10.1016/j.csbj.2021.04.060] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2021] [Revised: 04/23/2021] [Accepted: 04/24/2021] [Indexed: 02/06/2023] Open
Abstract
Single-cell omics technologies are currently solving biological and medical problems that earlier have remained elusive, such as discovery of new cell types, cellular differentiation trajectories and communication networks across cells and tissues. Current advances especially in single-cell multi-omics hold high potential for breakthroughs by integration of multiple different omics layers. To pair with the recent biotechnological developments, many computational approaches to process and analyze single-cell multi-omics data have been proposed. In this review, we first introduce recent developments in single-cell multi-omics in general and then focus on the available data integration strategies. The integration approaches are divided into three categories: early, intermediate, and late data integration. For each category, we describe the underlying conceptual principles and main characteristics, as well as provide examples of currently available tools and how they have been applied to analyze single-cell multi-omics data. Finally, we explore the challenges and prospective future directions of single-cell multi-omics data integration, including examples of adopting multi-view analysis approaches used in other disciplines to single-cell multi-omics.
Collapse
Affiliation(s)
- Nigatu Adossa
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, 20520 Turku, Finland
| | - Sofia Khan
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, 20520 Turku, Finland
| | - Kalle T. Rytkönen
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, 20520 Turku, Finland
- Institute of Biomedicine, University of Turku, 20520 Turku, Finland
| | - Laura L. Elo
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, 20520 Turku, Finland
- Institute of Biomedicine, University of Turku, 20520 Turku, Finland
| |
Collapse
|
50
|
Wang W, Zhang X, Dai DQ. DeFusion: a denoised network regularization framework for multi-omics integration. Brief Bioinform 2021; 22:6210063. [PMID: 33822879 DOI: 10.1093/bib/bbab057] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Revised: 02/03/2021] [Accepted: 01/14/2020] [Indexed: 11/13/2022] Open
Abstract
With diverse types of omics data widely available, many computational methods have been recently developed to integrate these heterogeneous data, providing a comprehensive understanding of diseases and biological mechanisms. But most of them hardly take noise effects into account. Data-specific patterns unique to data types also make it challenging to uncover the consistent patterns and learn a compact representation of multi-omics data. Here we present a multi-omics integration method considering these issues. We explicitly model the error term in data reconstruction and simultaneously consider noise effects and data-specific patterns. We utilize a denoised network regularization in which we build a fused network using a denoising procedure to suppress noise effects and data-specific patterns. The error term collaborates with the denoised network regularization to capture data-specific patterns. We solve the optimization problem via an inexact alternating minimization algorithm. A comparative simulation study shows the method's superiority at discovering common patterns among data types at three noise levels. Transcriptomics-and-epigenomics integration, in seven cancer cohorts from The Cancer Genome Atlas, demonstrates that the learned integrative representation extracted in an unsupervised manner can depict survival information. Specially in liver hepatocellular carcinoma, the learned integrative representation attains average Harrell's C-index of 0.78 in 10 times 3-fold cross-validation for survival prediction, which far exceeds competing methods, and we discover an aggressive subtype in liver hepatocellular carcinoma with this latent representation, which is validated by an external dataset GSE14520. We also show that DeFusion is applicable to the integration of other omics types.
Collapse
Affiliation(s)
- Weiwen Wang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, China
| | - Xiwen Zhang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, China
| | - Dao-Qing Dai
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, China
| |
Collapse
|