1
|
Zhu B, Zhang Z, Leung SY, Fan X. NetMIM: network-based multi-omics integration with block missingness for biomarker selection and disease outcome prediction. Brief Bioinform 2024; 25:bbae454. [PMID: 39288230 PMCID: PMC11407451 DOI: 10.1093/bib/bbae454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Revised: 07/24/2024] [Accepted: 08/30/2024] [Indexed: 09/19/2024] Open
Abstract
Compared with analyzing omics data from a single platform, an integrative analysis of multi-omics data provides a more comprehensive understanding of the regulatory relationships among biological features associated with complex diseases. However, most existing frameworks for integrative analysis overlook two crucial aspects of multi-omics data. Firstly, they neglect the known dependencies among biological features that exist in highly credible biological databases. Secondly, most existing integrative frameworks just simply remove the subjects without full omics data to handle block missingness, resulting in decreasing statistical power. To overcome these issues, we propose a network-based integrative Bayesian framework for biomarker selection and disease outcome prediction based on multi-omics data. Our framework utilizes Dirac spike-and-slab variable selection prior to identifying a small subset of biomarkers. The incorporation of gene pathway information improves the interpretability of feature selection. Furthermore, with the strategy in the FBM (stand for "full Bayesian model with missingness") model where missing omics data are augmented via a mechanistic model, our framework handles block missingness in multi-omics data via a data augmentation approach. The real application illustrates that our approach, which incorporates existing gene pathway information and includes subjects without DNA methylation data, results in more interpretable feature selection results and more accurate predictions.
Collapse
Affiliation(s)
- Bencong Zhu
- Department of Statistics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
| | - Zhen Zhang
- Department of Statistics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
| | - Suet Yi Leung
- Department of Pathology, School of Clinical Medicine, LKS Faculty of Medicine, The University of Hong Kong, Queen Mary Hospital, Hong Kong SAR, China
| | - Xiaodan Fan
- Department of Statistics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
| |
Collapse
|
2
|
Cai P, Liu S, Zhang D, Xing H, Han M, Liu D, Gong L, Hu QN. SynBioTools: a one-stop facility for searching and selecting synthetic biology tools. BMC Bioinformatics 2023; 24:152. [PMID: 37069545 PMCID: PMC10111727 DOI: 10.1186/s12859-023-05281-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Accepted: 04/11/2023] [Indexed: 04/19/2023] Open
Abstract
BACKGROUND The rapid development of synthetic biology relies heavily on the use of databases and computational tools, which are also developing rapidly. While many tool registries have been created to facilitate tool retrieval, sharing, and reuse, no relatively comprehensive tool registry or catalog addresses all aspects of synthetic biology. RESULTS We constructed SynBioTools, a comprehensive collection of synthetic biology databases, computational tools, and experimental methods, as a one-stop facility for searching and selecting synthetic biology tools. SynBioTools includes databases, computational tools, and methods extracted from reviews via SCIentific Table Extraction, a scientific table-extraction tool that we built. Approximately 57% of the resources that we located and included in SynBioTools are not mentioned in bio.tools, the dominant tool registry. To improve users' understanding of the tools and to enable them to make better choices, the tools are grouped into nine modules (each with subdivisions) based on their potential biosynthetic applications. Detailed comparisons of similar tools in every classification are included. The URLs, descriptions, source references, and the number of citations of the tools are also integrated into the system. CONCLUSIONS SynBioTools is freely available at https://synbiotools.lifesynther.com/ . It provides end-users and developers with a useful resource of categorized synthetic biology databases, tools, and methods to facilitate tool retrieval and selection.
Collapse
Affiliation(s)
- Pengli Cai
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Sheng Liu
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Dachuan Zhang
- Ecological Systems Design, Institute of Environmental Engineering, ETH Zurich, 8093, Zurich, Switzerland
| | - Huadong Xing
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Mengying Han
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Dongliang Liu
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Linlin Gong
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Qian-Nan Hu
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China.
| |
Collapse
|
3
|
Bakr S, Brennan K, Mukherjee P, Argemi J, Hernaez M, Gevaert O. Identifying key multifunctional components shared by critical cancer and normal liver pathways via SparseGMM. CELL REPORTS METHODS 2023; 3:100392. [PMID: 36814838 PMCID: PMC9939431 DOI: 10.1016/j.crmeth.2022.100392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Revised: 09/16/2022] [Accepted: 12/21/2022] [Indexed: 01/19/2023]
Abstract
Despite the abundance of multimodal data, suitable statistical models that can improve our understanding of diseases with genetic underpinnings are challenging to develop. Here, we present SparseGMM, a statistical approach for gene regulatory network discovery. SparseGMM uses latent variable modeling with sparsity constraints to learn Gaussian mixtures from multiomic data. By combining coexpression patterns with a Bayesian framework, SparseGMM quantitatively measures confidence in regulators and uncertainty in target gene assignment by computing gene entropy. We apply SparseGMM to liver cancer and normal liver tissue data and evaluate discovered gene modules in an independent single-cell RNA sequencing (scRNA-seq) dataset. SparseGMM identifies PROCR as a regulator of angiogenesis and PDCD1LG2 and HNF4A as regulators of immune response and blood coagulation in cancer. Furthermore, we show that more genes have significantly higher entropy in cancer compared with normal liver. Among high-entropy genes are key multifunctional components shared by critical pathways, including p53 and estrogen signaling.
Collapse
Affiliation(s)
- Shaimaa Bakr
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
- Stanford Center for Biomedical Informatics Research, Department of Medicine and Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
- Department of Radiology, Stanford University, Stanford, CA 94305, USA
| | - Kevin Brennan
- Stanford Center for Biomedical Informatics Research, Department of Medicine and Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Pritam Mukherjee
- Stanford Center for Biomedical Informatics Research, Department of Medicine and Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Josepmaria Argemi
- Liver Unit, Clinica Universidad de Navarra, Hepatology Program, Center for Applied Medical Research, 31008 Pamplona, Navarra, Spain
| | - Mikel Hernaez
- Center for Applied Medical Research, University of Navarra, 31009 Pamplona, Navarra, Spain
| | - Olivier Gevaert
- Stanford Center for Biomedical Informatics Research, Department of Medicine and Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
4
|
Taguchi YH, Turki T. A tensor decomposition-based integrated analysis applicable to multiple gene expression profiles without sample matching. Sci Rep 2022; 12:21242. [PMID: 36481877 PMCID: PMC9732005 DOI: 10.1038/s41598-022-25524-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2021] [Accepted: 11/30/2022] [Indexed: 12/13/2022] Open
Abstract
The integrated analysis of multiple gene expression profiles previously measured in distinct studies is problematic since missing both sample matches and common labels prevent their integration in fully data-driven, unsupervised training. In this study, we propose a strategy to enable the integration of multiple gene expression profiles among multiple independent studies with neither labeling nor sample matching using tensor decomposition unsupervised feature extraction. We apply this strategy to Alzheimer's disease (AD)-related gene expression profiles that lack precise correspondence among samples, including AD single-cell RNA sequence (scRNA-seq) data. We were able to select biologically reasonable genes using the integrated analysis. Overall, integrated gene expression profiles can function analogously to prior- and/or transfer-learning strategies in other machine-learning applications. For scRNA-seq, the proposed approach significantly reduces the required computational memory.
Collapse
Affiliation(s)
- Y-h. Taguchi
- grid.443595.a0000 0001 2323 0843Department of Physics, Chuo University, Tokyo, 112-8551 Japan
| | - Turki Turki
- grid.412125.10000 0001 0619 1117Department of Computer Science, King Abdulaziz University, Jeddah, 21589 Saudi Arabia
| |
Collapse
|
5
|
Serra A, Saarimäki LA, Pavel A, del Giudice G, Fratello M, Cattelani L, Federico A, Laurino O, Marwah VS, Fortino V, Scala G, Sofia Kinaret PA, Greco D. Nextcast: A software suite to analyse and model toxicogenomics data. Comput Struct Biotechnol J 2022; 20:1413-1426. [PMID: 35386103 PMCID: PMC8956870 DOI: 10.1016/j.csbj.2022.03.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Revised: 03/16/2022] [Accepted: 03/16/2022] [Indexed: 11/28/2022] Open
Abstract
The recent advancements in toxicogenomics have led to the availability of large omics data sets, representing the starting point for studying the exposure mechanism of action and identifying candidate biomarkers for toxicity prediction. The current lack of standard methods in data generation and analysis hampers the full exploitation of toxicogenomics-based evidence in regulatory risk assessment. Moreover, the pipelines for the preprocessing and downstream analyses of toxicogenomic data sets can be quite challenging to implement. During the years, we have developed a number of software packages to address specific questions related to multiple steps of toxicogenomics data analysis and modelling. In this review we present the Nextcast software collection and discuss how its individual tools can be combined into efficient pipelines to answer specific biological questions. Nextcast components are of great support to the scientific community for analysing and interpreting large data sets for the toxicity evaluation of compounds in an unbiased, straightforward, and reliable manner. The Nextcast software suite is available at: ( https://github.com/fhaive/nextcast).
Collapse
Affiliation(s)
- Angela Serra
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere, Finland
| | - Laura Aliisa Saarimäki
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere, Finland
| | - Alisa Pavel
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere, Finland
| | - Giusy del Giudice
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere, Finland
| | - Michele Fratello
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere, Finland
| | - Luca Cattelani
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere, Finland
| | - Antonio Federico
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere, Finland
| | | | - Veer Singh Marwah
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
| | - Vittorio Fortino
- Institute of Biomedicine, University of Eastern Finland, Kuopio, Finland
| | - Giovanni Scala
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Department of Biology, University of Naples Federico II, Naples, Italy
| | - Pia Anneli Sofia Kinaret
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere, Finland
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | - Dario Greco
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere, Finland
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| |
Collapse
|
6
|
Towards the Genetic Architecture of Complex Gene Expression Traits: Challenges and Prospects for eQTL Mapping in Humans. Genes (Basel) 2022; 13:genes13020235. [PMID: 35205280 PMCID: PMC8871770 DOI: 10.3390/genes13020235] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2021] [Revised: 01/21/2022] [Accepted: 01/25/2022] [Indexed: 12/10/2022] Open
Abstract
The discovery of expression quantitative trait loci (eQTLs) and their target genes (eGenes) has not only compensated for the limitations of genome-wide association studies for complex phenotypes but has also provided a basis for predicting gene expression. Efforts have been made to develop analytical methods in statistical genetics, a key discipline in eQTL analysis. In particular, mixed model– and deep learning–based analytical methods have been extremely beneficial in mapping eQTLs and predicting gene expression. Nevertheless, we still face many challenges associated with eQTL discovery. Here, we discuss two key aspects of these challenges: 1, the complexity of eTraits with various factors such as polygenicity and epistasis and 2, the voluminous work required for various types of eQTL profiles. The properties and prospects of statistical methods, including the mixed model method, Bayesian inference, the deep learning method, and the integration method, are presented as future directions for eQTL discovery. This review will help expedite the design and use of efficient methods for eQTL discovery and eTrait prediction.
Collapse
|
7
|
Ma C, Wu M, Ma S. Analysis of cancer omics data: a selective review of statistical techniques. Brief Bioinform 2022; 23:6510158. [PMID: 35039832 DOI: 10.1093/bib/bbab585] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Revised: 12/19/2021] [Accepted: 12/20/2021] [Indexed: 11/13/2022] Open
Abstract
Cancer is an omics disease. The development in high-throughput profiling has fundamentally changed cancer research and clinical practice. Compared with clinical, demographic and environmental data, the analysis of omics data-which has higher dimensionality, weaker signals and more complex distributional properties-is much more challenging. Developments in the literature are often 'scattered', with individual studies focused on one or a few closely related methods. The goal of this review is to assist cancer researchers with limited statistical expertise in establishing the 'overall framework' of cancer omics data analysis. To facilitate understanding, we mainly focus on intuition, concepts and key steps, and refer readers to the original publications for mathematical details. This review broadly covers unsupervised and supervised analysis, as well as individual-gene-based, gene-set-based and gene-network-based analysis. We also briefly discuss 'special topics' including interaction analysis, multi-datasets analysis and multi-omics analysis.
Collapse
Affiliation(s)
- Chenjin Ma
- College of Statistics and Data Science, Faculty of Science, Beijing University of Technology, Beijing, China
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| |
Collapse
|