1
|
Lu Z, Chen X, Yang J, Ding Y. RSC-based differential model with correlation removal for improving multi-omics clustering. J Theor Biol 2023; 556:111328. [PMID: 36273593 DOI: 10.1016/j.jtbi.2022.111328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2022] [Revised: 09/21/2022] [Accepted: 10/17/2022] [Indexed: 11/06/2022]
Abstract
Multi-omics clustering plays an important role in cancer subtyping. However, the data of different kinds of omics are often related, these correlations may reduce the clustering algorithm performance. It is crucial to eliminate the unexpected redundant information caused by these correlations between different omics. We proposed RSC-based differential model with correlation removal for improving multi-omics clustering (RSC-MCR). This method first introduced RSC to calculate the pairwise correlations of all features, and decomposed it to obtain the pairwise correlations of different omics features, thus built the connection between different omics based on the pairwise correlations of different omics features. Then, to remove the redundant correlation, we designed a differential model to calculate the degree of difference between the original feature matrix and the correlation matrix which contained the most relevant information between different omics. We compared the performance of RSC-MCR with decorrelation methods on different clustering methods (CC, FCM, SNF, NMF, LRAcluster). The experimental results on five cancer datasets show the efficiency of the RSC-MCR as well as improvements over other decorrelation methods.
Collapse
Affiliation(s)
- Zhengshu Lu
- School of Science, Jiangnan University, Wuxi, Jiangsu 214122, PR China; Laboratory of Media Design and Software Technology, Jiangnan University, Wuxi, Jiangsu 214122, PR China
| | - Xu Chen
- School of Science, Jiangnan University, Wuxi, Jiangsu 214122, PR China; Laboratory of Media Design and Software Technology, Jiangnan University, Wuxi, Jiangsu 214122, PR China
| | - Jing Yang
- School of Science, Jiangnan University, Wuxi, Jiangsu 214122, PR China; Laboratory of Media Design and Software Technology, Jiangnan University, Wuxi, Jiangsu 214122, PR China
| | - Yanrui Ding
- School of Science, Jiangnan University, Wuxi, Jiangsu 214122, PR China; Key Laboratory of Industrial Biotechnology, Jiangnan University, Wuxi, Jiangsu 214122, PR China.
| |
Collapse
|
2
|
Zhang W, Wendt C, Bowler R, Hersh CP, Safo SE. Robust integrative biclustering for multi-view data. Stat Methods Med Res 2022; 31:2201-2216. [PMID: 36113157 PMCID: PMC10153449 DOI: 10.1177/09622802221122427] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
In many biomedical research, multiple views of data (e.g. genomics, proteomics) are available, and a particular interest might be the detection of sample subgroups characterized by specific groups of variables. Biclustering methods are well-suited for this problem as they assume that specific groups of variables might be relevant only to specific groups of samples. Many biclustering methods exist for detecting row-column clusters in a view but few methods exist for data from multiple views. The few existing algorithms are heavily dependent on regularization parameters for getting row-column clusters, and they impose unnecessary burden on users thus limiting their use in practice. We extend an existing biclustering method based on sparse singular value decomposition for single-view data to data from multiple views. Our method, integrative sparse singular value decomposition (iSSVD), incorporates stability selection to control Type I error rates, estimates the probability of samples and variables to belong to a bicluster, finds stable biclusters, and results in interpretable row-column associations. Simulations and real data analyses show that integrative sparse singular value decomposition outperforms several other single- and multi-view biclustering methods and is able to detect meaningful biclusters. iSSVD is a user-friendly, computationally efficient algorithm that will be useful in many disease subtyping applications.
Collapse
Affiliation(s)
- Weijie Zhang
- Division of Biostatistics, 5635University of Minnesota, MN, USA
| | - Christine Wendt
- Division of Pulmonary, Allergy and Critical Care, 5635University of Minnesota, MN, USA
| | - Russel Bowler
- Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, 551774National Jewish Health, Denver, USA
| | - Craig P Hersh
- Channing Division of Network Medicine, Brigham and Women's Hospital, 1811Harvard Medical School, USA
| | - Sandra E Safo
- Division of Biostatistics, 5635University of Minnesota, MN, USA
| |
Collapse
|
3
|
Quazi S. Artificial intelligence and machine learning in precision and genomic medicine. Med Oncol 2022; 39:120. [PMID: 35704152 PMCID: PMC9198206 DOI: 10.1007/s12032-022-01711-1] [Citation(s) in RCA: 34] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2022] [Accepted: 03/14/2022] [Indexed: 10/28/2022]
Abstract
The advancement of precision medicine in medical care has led behind the conventional symptom-driven treatment process by allowing early risk prediction of disease through improved diagnostics and customization of more effective treatments. It is necessary to scrutinize overall patient data alongside broad factors to observe and differentiate between ill and relatively healthy people to take the most appropriate path toward precision medicine, resulting in an improved vision of biological indicators that can signal health changes. Precision and genomic medicine combined with artificial intelligence have the potential to improve patient healthcare. Patients with less common therapeutic responses or unique healthcare demands are using genomic medicine technologies. AI provides insights through advanced computation and inference, enabling the system to reason and learn while enhancing physician decision making. Many cell characteristics, including gene up-regulation, proteins binding to nucleic acids, and splicing, can be measured at high throughput and used as training objectives for predictive models. Researchers can create a new era of effective genomic medicine with the improved availability of a broad range of datasets and modern computer techniques such as machine learning. This review article has elucidated the contributions of ML algorithms in precision and genome medicine.
Collapse
Affiliation(s)
- Sameer Quazi
- GenLab Biosolutions Private Limited, Bangalore, Karnataka, 560043, India.
- Department of Biomedical Sciences, School of Life Sciences, Anglia Ruskin University, Cambridge, UK.
| |
Collapse
|
4
|
Abstract
The advancement of precision medicine in medical care has led behind the conventional symptom-driven treatment process by allowing early risk prediction of disease through improved diagnostics and customization of more effective treatments. It is necessary to scrutinize overall patient data alongside broad factors to observe and differentiate between ill and relatively healthy people to take the most appropriate path toward precision medicine, resulting in an improved vision of biological indicators that can signal health changes. Precision and genomic medicine combined with artificial intelligence have the potential to improve patient healthcare. Patients with less common therapeutic responses or unique healthcare demands are using genomic medicine technologies. AI provides insights through advanced computation and inference, enabling the system to reason and learn while enhancing physician decision making. Many cell characteristics, including gene up-regulation, proteins binding to nucleic acids, and splicing, can be measured at high throughput and used as training objectives for predictive models. Researchers can create a new era of effective genomic medicine with the improved availability of a broad range of datasets and modern computer techniques such as machine learning. This review article has elucidated the contributions of ML algorithms in precision and genome medicine.
Collapse
Affiliation(s)
- Sameer Quazi
- GenLab Biosolutions Private Limited, Bangalore, Karnataka, 560043, India.
- Department of Biomedical Sciences, School of Life Sciences, Anglia Ruskin University, Cambridge, UK.
| |
Collapse
|
5
|
Abstract
This review provides the feasible literature on drug discovery through ML tools and techniques that are enforced in every phase of drug development to accelerate the research process and deduce the risk and expenditure in clinical trials. Machine learning techniques improve the decision-making in pharmaceutical data across various applications like QSAR analysis, hit discoveries, de novo drug architectures to retrieve accurate outcomes. Target validation, prognostic biomarkers, digital pathology are considered under problem statements in this review. ML challenges must be applicable for the main cause of inadequacy in interpretability outcomes that may restrict the applications in drug discovery. In clinical trials, absolute and methodological data must be generated to tackle many puzzles in validating ML techniques, improving decision-making, promoting awareness in ML approaches, and deducing risk failures in drug discovery.
Collapse
Affiliation(s)
- Suresh Dara
- Department of Computer Science and Engineering, B V Raju Institute of Technology, Narsapur, Medak, 502313 Telangana India
| | - Swetha Dhamercherla
- Department of Computer Science and Engineering, B V Raju Institute of Technology, Narsapur, Medak, 502313 Telangana India
| | - Surender Singh Jadav
- Centre for Molecular Cancer Research (CMCR) and Vishnu Institute of Pharmaceutical Education and Research (VIPER), Narsapur, Medak, 502313 Telangana India
| | - CH Madhu Babu
- Department of Computer Science and Engineering, B V Raju Institute of Technology, Narsapur, Medak, 502313 Telangana India
| | - Mohamed Jawed Ahsan
- Department of Pharmaceutical Chemistry, Maharishi Arvind College of Pharmacy, Jaipur, 302023 Rajasthan India
| |
Collapse
|
6
|
Sudhakar P, Verstockt B, Cremer J, Verstockt S, Sabino J, Ferrante M, Vermeire S. Understanding the Molecular Drivers of Disease Heterogeneity in Crohn's Disease Using Multi-omic Data Integration and Network Analysis. Inflamm Bowel Dis 2021; 27:870-886. [PMID: 33313682 PMCID: PMC8128416 DOI: 10.1093/ibd/izaa281] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Indexed: 12/12/2022]
Abstract
Crohn's disease (CD), a form of inflammatory bowel disease (IBD), is characterized by heterogeneity along multiple clinical axes, which in turn impacts disease progression and treatment modalities. Using advanced data integration approaches and systems biology tools, we studied the contribution of CD susceptibility variants and gene expression in distinct peripheral immune cell subsets (CD14+ monocytes and CD4+ T cells) to relevant clinical traits. Our analyses revealed that most clinical traits capturing CD heterogeneity could be associated with CD14+ and CD4+ gene expression rather than disease susceptibility variants. By disentangling the sources of variation, we identified molecular features that could potentially be driving the heterogeneity of various clinical traits of CD patients. Further downstream analyses identified contextual hub proteins such as genes encoding barrier functions, antimicrobial peptides, chemokines, and their receptors, which are either targeted by drugs used in CD or other inflammatory diseases or are relevant to the biological functions implicated in disease pathology. These hubs could be used as cell type-specific targets to treat specific subtypes of CD patients in a more individualized approach based on the underlying biology driving their disease subtypes. Our study highlights the importance of data integration and systems approaches to investigate complex and heterogeneous diseases such as IBD.
Collapse
Affiliation(s)
- Padhmanand Sudhakar
- Department of Chronic Diseases, Metabolism and Ageing, Translational Research Center for Gastrointestinal Disorders (TARGID)
| | - Bram Verstockt
- Department of Chronic Diseases, Metabolism and Ageing, Translational Research Center for Gastrointestinal Disorders (TARGID)
- University Hospitals Leuven, Department of Gastroenterology and Hepatology
| | - Jonathan Cremer
- Department of Microbiology and Immunology, Laboratory of Clinical Immunology, Katholieke Universiteit Leuven, Leuven, Belgium
| | - Sare Verstockt
- Department of Chronic Diseases, Metabolism and Ageing, Translational Research Center for Gastrointestinal Disorders (TARGID)
| | - João Sabino
- Department of Chronic Diseases, Metabolism and Ageing, Translational Research Center for Gastrointestinal Disorders (TARGID)
- University Hospitals Leuven, Department of Gastroenterology and Hepatology
| | - Marc Ferrante
- Department of Chronic Diseases, Metabolism and Ageing, Translational Research Center for Gastrointestinal Disorders (TARGID)
- University Hospitals Leuven, Department of Gastroenterology and Hepatology
| | - Séverine Vermeire
- Department of Chronic Diseases, Metabolism and Ageing, Translational Research Center for Gastrointestinal Disorders (TARGID)
- University Hospitals Leuven, Department of Gastroenterology and Hepatology
| |
Collapse
|
7
|
Zhang J, Liu L, Xu T, Zhang W, Zhao C, Li S, Li J, Rao N, Le TD. miRSM: an R package to infer and analyse miRNA sponge modules in heterogeneous data. RNA Biol 2021; 18:2308-2320. [PMID: 33822666 DOI: 10.1080/15476286.2021.1905341] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
In molecular biology, microRNA (miRNA) sponges are RNA transcripts which compete with other RNA transcripts for binding with miRNAs. Research has shown that miRNA sponges have a fundamental impact on tissue development and disease progression. Generally, to achieve a specific biological function, miRNA sponges tend to form modules or communities in a biological system. Until now, however, there is still a lack of tools to aid researchers to infer and analyse miRNA sponge modules from heterogeneous data. To fill this gap, we develop an R/Bioconductor package, miRSM, for facilitating the procedure of inferring and analysing miRNA sponge modules. miRSM provides a collection of 50 co-expression analysis methods to identify gene co-expression modules (which are candidate miRNA sponge modules), four module discovery methods to infer miRNA sponge modules and seven modular analysis methods for investigating miRNA sponge modules. miRSM will enable researchers to quickly apply new datasets to infer and analyse miRNA sponge modules, and will consequently accelerate the research on miRNA sponges.
Collapse
Affiliation(s)
- Junpeng Zhang
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, China.,School of Engineering, Dali University, Dali, Yunnan, China
| | - Lin Liu
- UniSA STEM, University of South Australia, Mawson Lakes, SA, Australia
| | - Taosheng Xu
- Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, Anhui, China
| | - Wu Zhang
- School of Agriculture and Biological Sciences, Dali University, Dali, Yunnan, China
| | - Chunwen Zhao
- School of Engineering, Dali University, Dali, Yunnan, China
| | - Sijing Li
- School of Engineering, Dali University, Dali, Yunnan, China
| | - Jiuyong Li
- UniSA STEM, University of South Australia, Mawson Lakes, SA, Australia
| | - Nini Rao
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Thuc Duy Le
- UniSA STEM, University of South Australia, Mawson Lakes, SA, Australia
| |
Collapse
|
8
|
Kong XZ, Song Y, Liu JX, Zheng CH, Yuan SS, Wang J, Dai LY. Joint Lp-Norm and L 2,1-Norm Constrained Graph Laplacian PCA for Robust Tumor Sample Clustering and Gene Network Module Discovery. Front Genet 2021; 12:621317. [PMID: 33708239 PMCID: PMC7940841 DOI: 10.3389/fgene.2021.621317] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Accepted: 01/29/2021] [Indexed: 11/17/2022] Open
Abstract
The dimensionality reduction method accompanied by different norm constraints plays an important role in mining useful information from large-scale gene expression data. In this article, a novel method named Lp-norm and L2,1-norm constrained graph Laplacian principal component analysis (PL21GPCA) based on traditional principal component analysis (PCA) is proposed for robust tumor sample clustering and gene network module discovery. Three aspects are highlighted in the PL21GPCA method. First, to degrade the high sensitivity to outliers and noise, the non-convex proximal Lp-norm (0 < p < 1)constraint is applied on the loss function. Second, to enhance the sparsity of gene expression in cancer samples, the L2,1-norm constraint is used on one of the regularization terms. Third, to retain the geometric structure of the data, we introduce the graph Laplacian regularization item to the PL21GPCA optimization model. Extensive experiments on five gene expression datasets, including one benchmark dataset, two single-cancer datasets from The Cancer Genome Atlas (TCGA), and two integrated datasets of multiple cancers from TCGA, are performed to validate the effectiveness of our method. The experimental results demonstrate that the PL21GPCA method performs better than many other methods in terms of tumor sample clustering. Additionally, this method is used to discover the gene network modules for the purpose of finding key genes that may be associated with some cancers.
Collapse
Affiliation(s)
| | | | - Jin-Xing Liu
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Chun-Hou Zheng
- School of Computer Science, Qufu Normal University, Rizhao, China
| | | | | | | |
Collapse
|
9
|
Kibble M, Khan SA, Ammad-ud-din M, Bollepalli S, Palviainen T, Kaprio J, Pietiläinen KH, Ollikainen M. An integrative machine learning approach to discovering multi-level molecular mechanisms of obesity using data from monozygotic twin pairs. ROYAL SOCIETY OPEN SCIENCE 2020; 7:200872. [PMID: 33204460 PMCID: PMC7657920 DOI: 10.1098/rsos.200872] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/10/2020] [Accepted: 09/29/2020] [Indexed: 05/19/2023]
Abstract
We combined clinical, cytokine, genomic, methylation and dietary data from 43 young adult monozygotic twin pairs (aged 22-36 years, 53% female), where 25 of the twin pairs were substantially weight discordant (delta body mass index > 3 kg m-2). These measurements were originally taken as part of the TwinFat study, a substudy of The Finnish Twin Cohort study. These five large multivariate datasets (comprising 42, 71, 1587, 1605 and 63 variables, respectively) were jointly analysed using an integrative machine learning method called group factor analysis (GFA) to offer new hypotheses into the multi-molecular-level interactions associated with the development of obesity. New potential links between cytokines and weight gain are identified, as well as associations between dietary, inflammatory and epigenetic factors. This encouraging case study aims to enthuse the research community to boldly attempt new machine learning approaches which have the potential to yield novel and unintuitive hypotheses. The source code of the GFA method is publically available as the R package GFA.
Collapse
Affiliation(s)
- Milla Kibble
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
- Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK
- Author for correspondence: Milla Kibble e-mail:
| | - Suleiman A. Khan
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
| | - Muhammad Ammad-ud-din
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
| | - Sailalitha Bollepalli
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
| | - Teemu Palviainen
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
| | - Jaakko Kaprio
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
- Department of Public Health, University of Helsinki, Helsinki, Finland
| | - Kirsi H. Pietiläinen
- Obesity Research Unit, Helsinki University Central Hospital and University of Helsinki, Helsinki, Finland
| | - Miina Ollikainen
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
| |
Collapse
|
10
|
LMSM: A modular approach for identifying lncRNA related miRNA sponge modules in breast cancer. PLoS Comput Biol 2020; 16:e1007851. [PMID: 32324747 PMCID: PMC7200020 DOI: 10.1371/journal.pcbi.1007851] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2019] [Revised: 05/05/2020] [Accepted: 04/06/2020] [Indexed: 12/12/2022] Open
Abstract
Until now, existing methods for identifying lncRNA related miRNA sponge modules mainly rely on lncRNA related miRNA sponge interaction networks, which may not provide a full picture of miRNA sponging activities in biological conditions. Hence there is a strong need of new computational methods to identify lncRNA related miRNA sponge modules. In this work, we propose a framework, LMSM, to identify LncRNA related MiRNA Sponge Modules from heterogeneous data. To understand the miRNA sponging activities in biological conditions, LMSM uses gene expression data to evaluate the influence of the shared miRNAs on the clustered sponge lncRNAs and mRNAs. We have applied LMSM to the human breast cancer (BRCA) dataset from The Cancer Genome Atlas (TCGA). As a result, we have found that the majority of LMSM modules are significantly implicated in BRCA and most of them are BRCA subtype-specific. Most of the mediating miRNAs act as crosslinks across different LMSM modules, and all of LMSM modules are statistically significant. Multi-label classification analysis shows that the performance of LMSM modules is significantly higher than baseline’s performance, indicating the biological meanings of LMSM modules in classifying BRCA subtypes. The consistent results suggest that LMSM is robust in identifying lncRNA related miRNA sponge modules. Moreover, LMSM can be used to predict miRNA targets. Finally, LMSM outperforms a graph clustering-based strategy in identifying BRCA-related modules. Altogether, our study shows that LMSM is a promising method to investigate modular regulatory mechanism of sponge lncRNAs from heterogeneous data. Previous studies have revealed that long non-coding RNAs (lncRNAs), as microRNA (miRNA) sponges or competing endogenous RNAs (ceRNAs), can regulate the expression levels of messenger RNAs (mRNAs) by decreasing the amount of miRNAs interacting with mRNAs. In this work, we hypothesize that the “tug-of-war” between RNA transcripts for attracting miRNAs is across groups or modules. Based on the hypothesis, we propose a framework called LMSM, to identify LncRNA related MiRNA Sponge Modules. Based on the two miRNA sponge modular competition principles, significant sharing of miRNAs and high canonical correlation between the sponge lncRNAs and mRNAs, LMSM is also capable of predicting miRNA targets. LMSM not only extends the ceRNA hypothesis, but also provides a novel way to investigate the biological functions and modular mechanism of lncRNAs in breast cancer.
Collapse
|
11
|
Xie J, Ma A, Fennell A, Ma Q, Zhao J. It is time to apply biclustering: a comprehensive review of biclustering applications in biological and biomedical data. Brief Bioinform 2020; 20:1449-1464. [PMID: 29490019 DOI: 10.1093/bib/bby014] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Revised: 01/16/2018] [Indexed: 12/12/2022] Open
Abstract
Biclustering is a powerful data mining technique that allows clustering of rows and columns, simultaneously, in a matrix-format data set. It was first applied to gene expression data in 2000, aiming to identify co-expressed genes under a subset of all the conditions/samples. During the past 17 years, tens of biclustering algorithms and tools have been developed to enhance the ability to make sense out of large data sets generated in the wake of high-throughput omics technologies. These algorithms and tools have been applied to a wide variety of data types, including but not limited to, genomes, transcriptomes, exomes, epigenomes, phenomes and pharmacogenomes. However, there is still a considerable gap between biclustering methodology development and comprehensive data interpretation, mainly because of the lack of knowledge for the selection of appropriate biclustering tools and further supporting computational techniques in specific studies. Here, we first deliver a brief introduction to the existing biclustering algorithms and tools in public domain, and then systematically summarize the basic applications of biclustering for biological data and more advanced applications of biclustering for biomedical data. This review will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency.
Collapse
|
12
|
Vamathevan J, Clark D, Czodrowski P, Dunham I, Ferran E, Lee G, Li B, Madabhushi A, Shah P, Spitzer M, Zhao S. Applications of machine learning in drug discovery and development. Nat Rev Drug Discov 2019; 18:463-477. [PMID: 30976107 DOI: 10.1038/s41573-019-0024-5] [Citation(s) in RCA: 925] [Impact Index Per Article: 185.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Drug discovery and development pipelines are long, complex and depend on numerous factors. Machine learning (ML) approaches provide a set of tools that can improve discovery and decision making for well-specified questions with abundant, high-quality data. Opportunities to apply ML occur in all stages of drug discovery. Examples include target validation, identification of prognostic biomarkers and analysis of digital pathology data in clinical trials. Applications have ranged in context and methodology, with some approaches yielding accurate predictions and insights. The challenges of applying ML lie primarily with the lack of interpretability and repeatability of ML-generated results, which may limit their application. In all areas, systematic and comprehensive high-dimensional data still need to be generated. With ongoing efforts to tackle these issues, as well as increasing awareness of the factors needed to validate ML approaches, the application of ML can promote data-driven decision making and has the potential to speed up the process and reduce failure rates in drug discovery and development.
Collapse
Affiliation(s)
- Jessica Vamathevan
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK.
| | - Dominic Clark
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | | | - Ian Dunham
- Open Targets and European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Edgardo Ferran
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - George Lee
- Bristol-Myers Squibb, Princeton, NJ, USA
| | - Bin Li
- Takeda Pharmaceuticals International Co., Cambridge, MA, USA
| | - Anant Madabhushi
- Case Western Reserve University, Cleveland, OH, USA.,Louis Stokes Cleveland Veterans Affair Medical Center, Cleveland, OH, USA
| | | | - Michaela Spitzer
- Open Targets and European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Shanrong Zhao
- Pfizer Worldwide Research and Development, Cambridge, MA, USA
| |
Collapse
|
13
|
Li Y, Wu FX, Ngom A. A review on machine learning principles for multi-view biological data integration. Brief Bioinform 2019; 19:325-340. [PMID: 28011753 DOI: 10.1093/bib/bbw113] [Citation(s) in RCA: 126] [Impact Index Per Article: 25.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2016] [Indexed: 01/08/2023] Open
Abstract
Driven by high-throughput sequencing techniques, modern genomic and clinical studies are in a strong need of integrative machine learning models for better use of vast volumes of heterogeneous information in the deep understanding of biological systems and the development of predictive models. How data from multiple sources (called multi-view data) are incorporated in a learning system is a key step for successful analysis. In this article, we provide a comprehensive review on omics and clinical data integration techniques, from a machine learning perspective, for various analyses such as prediction, clustering, dimension reduction and association. We shall show that Bayesian models are able to use prior information and model measurements with various distributions; tree-based methods can either build a tree with all features or collectively make a final decision based on trees learned from each view; kernel methods fuse the similarity matrices learned from individual views together for a final similarity matrix or learning model; network-based fusion methods are capable of inferring direct and indirect associations in a heterogeneous network; matrix factorization models have potential to learn interactions among features from different views; and a range of deep neural networks can be integrated in multi-modal learning for capturing the complex mechanism of biological systems.
Collapse
Affiliation(s)
- Yifeng Li
- Information and Communications Technologies, National Research Council Canada, Ottawa, Ontario, Canada
| | - Fang-Xiang Wu
- Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
| | - Alioune Ngom
- School of Computer Science, University of Windsor, Windsor, Ontario, Canada
| |
Collapse
|
14
|
Argelaguet R, Velten B, Arnol D, Dietrich S, Zenz T, Marioni JC, Buettner F, Huber W, Stegle O. Multi-Omics Factor Analysis-a framework for unsupervised integration of multi-omics data sets. Mol Syst Biol 2018; 14:e8124. [PMID: 29925568 PMCID: PMC6010767 DOI: 10.15252/msb.20178124] [Citation(s) in RCA: 481] [Impact Index Per Article: 80.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2017] [Revised: 05/28/2018] [Accepted: 05/29/2018] [Indexed: 12/19/2022] Open
Abstract
Multi-omics studies promise the improved characterization of biological processes across molecular layers. However, methods for the unsupervised integration of the resulting heterogeneous data sets are lacking. We present Multi-Omics Factor Analysis (MOFA), a computational method for discovering the principal sources of variation in multi-omics data sets. MOFA infers a set of (hidden) factors that capture biological and technical sources of variability. It disentangles axes of heterogeneity that are shared across multiple modalities and those specific to individual data modalities. The learnt factors enable a variety of downstream analyses, including identification of sample subgroups, data imputation and the detection of outlier samples. We applied MOFA to a cohort of 200 patient samples of chronic lymphocytic leukaemia, profiled for somatic mutations, RNA expression, DNA methylation and ex vivo drug responses. MOFA identified major dimensions of disease heterogeneity, including immunoglobulin heavy-chain variable region status, trisomy of chromosome 12 and previously underappreciated drivers, such as response to oxidative stress. In a second application, we used MOFA to analyse single-cell multi-omics data, identifying coordinated transcriptional and epigenetic changes along cell differentiation.
Collapse
Affiliation(s)
- Ricard Argelaguet
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK
| | - Britta Velten
- European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
| | - Damien Arnol
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK
| | | | - Thorsten Zenz
- Heidelberg University Hospital, Heidelberg, Germany
- German Cancer Research Center (dkfz) and National Center for Tumor Diseases (NCT), Heidelberg, Germany
- Germany & Hematology, University Hospital Zurich and University of Zurich, Zurich, Switzerland
| | - John C Marioni
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK
- Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK
- Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK
| | - Florian Buettner
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK
- Helmholtz Zentrum München-German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany
| | - Wolfgang Huber
- European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
| | - Oliver Stegle
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK
- European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
| |
Collapse
|
15
|
Hao X, Li C, Yan J, Yao X, Risacher SL, Saykin AJ, Shen L, Zhang D. Identification of associations between genotypes and longitudinal phenotypes via temporally-constrained group sparse canonical correlation analysis. Bioinformatics 2018; 33:i341-i349. [PMID: 28881979 PMCID: PMC5870577 DOI: 10.1093/bioinformatics/btx245] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Motivation Neuroimaging genetics identifies the relationships between genetic variants (i.e., the single nucleotide polymorphisms) and brain imaging data to reveal the associations from genotypes to phenotypes. So far, most existing machine-learning approaches are widely used to detect the effective associations between genetic variants and brain imaging data at one time-point. However, those associations are based on static phenotypes and ignore the temporal dynamics of the phenotypical changes. The phenotypes across multiple time-points may exhibit temporal patterns that can be used to facilitate the understanding of the degenerative process. In this article, we propose a novel temporally constrained group sparse canonical correlation analysis (TGSCCA) framework to identify genetic associations with longitudinal phenotypic markers. Results The proposed TGSCCA method is able to capture the temporal changes in brain from longitudinal phenotypes by incorporating the fused penalty, which requires that the differences between two consecutive canonical weight vectors from adjacent time-points should be small. A new efficient optimization algorithm is designed to solve the objective function. Furthermore, we demonstrate the effectiveness of our algorithm on both synthetic and real data (i.e., the Alzheimer’s Disease Neuroimaging Initiative cohort, including progressive mild cognitive impairment, stable MCI and Normal Control participants). In comparison with conventional SCCA, our proposed method can achieve strong associations and discover phenotypic biomarkers across multiple time-points to guide disease-progressive interpretation. Availability and implementation The Matlab code is available at https://sourceforge.net/projects/ibrain-cn/files/.
Collapse
Affiliation(s)
- Xiaoke Hao
- School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| | - Chanxiu Li
- School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| | - Jingwen Yan
- Department of Radiology and Imaging Sciences, School of Medicine, Indiana University, Indianapolis, IN, USA.,School of Informatics and Computing, Indiana University, Indianapolis, IN, USA
| | - Xiaohui Yao
- Department of Radiology and Imaging Sciences, School of Medicine, Indiana University, Indianapolis, IN, USA.,School of Informatics and Computing, Indiana University, Indianapolis, IN, USA
| | - Shannon L Risacher
- Department of Radiology and Imaging Sciences, School of Medicine, Indiana University, Indianapolis, IN, USA
| | - Andrew J Saykin
- Department of Radiology and Imaging Sciences, School of Medicine, Indiana University, Indianapolis, IN, USA
| | - Li Shen
- Department of Radiology and Imaging Sciences, School of Medicine, Indiana University, Indianapolis, IN, USA.,School of Informatics and Computing, Indiana University, Indianapolis, IN, USA
| | - Daoqiang Zhang
- School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| | | |
Collapse
|
16
|
Multiple co-clustering based on nonparametric mixture models with heterogeneous marginal distributions. PLoS One 2017; 12:e0186566. [PMID: 29049392 PMCID: PMC5648298 DOI: 10.1371/journal.pone.0186566] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2017] [Accepted: 10/03/2017] [Indexed: 11/19/2022] Open
Abstract
We propose a novel method for multiple clustering, which is useful for analysis of high-dimensional data containing heterogeneous types of features. Our method is based on nonparametric Bayesian mixture models in which features are automatically partitioned (into views) for each clustering solution. This feature partition works as feature selection for a particular clustering solution, which screens out irrelevant features. To make our method applicable to high-dimensional data, a co-clustering structure is newly introduced for each view. Further, the outstanding novelty of our method is that we simultaneously model different distribution families, such as Gaussian, Poisson, and multinomial distributions in each cluster block, which widens areas of application to real data. We apply the proposed method to synthetic and real data, and show that our method outperforms other multiple clustering methods both in recovering true cluster structures and in computation time. Finally, we apply our method to a depression dataset with no true cluster structure available, from which useful inferences are drawn about possible clustering structures of the data.
Collapse
|
17
|
Islam S, Anand S, Hamid J, Thabane L, Beyene J. Comparing the performance of linear and nonlinear principal components in the context of high-dimensional genomic data integration. Stat Appl Genet Mol Biol 2017; 16:199-216. [PMID: 28727569 DOI: 10.1515/sagmb-2016-0066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Linear principal component analysis (PCA) is a widely used approach to reduce the dimension of gene or miRNA expression data sets. This method relies on the linearity assumption, which often fails to capture the patterns and relationships inherent in the data. Thus, a nonlinear approach such as kernel PCA might be optimal. We develop a copula-based simulation algorithm that takes into account the degree of dependence and nonlinearity observed in these data sets. Using this algorithm, we conduct an extensive simulation to compare the performance of linear and kernel principal component analysis methods towards data integration and death classification. We also compare these methods using a real data set with gene and miRNA expression of lung cancer patients. First few kernel principal components show poor performance compared to the linear principal components in this occasion. Reducing dimensions using linear PCA and a logistic regression model for classification seems to be adequate for this purpose. Integrating information from multiple data sets using either of these two approaches leads to an improved classification accuracy for the outcome.
Collapse
|