1
|
Petti M, Farina L. Network medicine for patients' stratification: From single-layer to multi-omics. WIREs Mech Dis 2023; 15:e1623. [PMID: 37323106 DOI: 10.1002/wsbm.1623] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Revised: 03/08/2023] [Accepted: 05/30/2023] [Indexed: 06/17/2023]
Abstract
Precision medicine research increasingly relies on the integrated analysis of multiple types of omics. In the era of big data, the large availability of different health-related information represents a great, but at the same time untapped, chance with a potentially fundamental role in the prevention, diagnosis and prognosis of diseases. Computational methods are needed to combine this data to create a comprehensive view of a given disease. Network science can model biomedical data in terms of relationships among molecular players of different nature and has been successfully proposed as a new paradigm for studying human diseases. Patient stratification is an open challenge aimed at identifying subtypes with different disease manifestations, severity, and expected survival time. Several stratification approaches based on high-throughput gene expression measurements have been successfully applied. However, few attempts have been proposed to exploit the integration of various genotypic and phenotypic data to discover novel sub-types or improve the detection of known groupings. This article is categorized under: Cancer > Biomedical Engineering Cancer > Computational Models Cancer > Genetics/Genomics/Epigenetics.
Collapse
Affiliation(s)
- Manuela Petti
- Department of Computer, Control and Management Engineering, Sapienza University of Rome, Rome, Italy
| | - Lorenzo Farina
- Department of Computer, Control and Management Engineering, Sapienza University of Rome, Rome, Italy
| |
Collapse
|
2
|
Gliozzo J, Mesiti M, Notaro M, Petrini A, Patak A, Puertas-Gallardo A, Paccanaro A, Valentini G, Casiraghi E. Heterogeneous data integration methods for patient similarity networks. Brief Bioinform 2022; 23:6604996. [PMID: 35679533 PMCID: PMC9294435 DOI: 10.1093/bib/bbac207] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2021] [Revised: 04/14/2022] [Accepted: 05/04/2022] [Indexed: 12/29/2022] Open
Abstract
Patient similarity networks (PSNs), where patients are represented as nodes and their similarities as weighted edges, are being increasingly used in clinical research. These networks provide an insightful summary of the relationships among patients and can be exploited by inductive or transductive learning algorithms for the prediction of patient outcome, phenotype and disease risk. PSNs can also be easily visualized, thus offering a natural way to inspect complex heterogeneous patient data and providing some level of explainability of the predictions obtained by machine learning algorithms. The advent of high-throughput technologies, enabling us to acquire high-dimensional views of the same patients (e.g. omics data, laboratory data, imaging data), calls for the development of data fusion techniques for PSNs in order to leverage this rich heterogeneous information. In this article, we review existing methods for integrating multiple biomedical data views to construct PSNs, together with the different patient similarity measures that have been proposed. We also review methods that have appeared in the machine learning literature but have not yet been applied to PSNs, thus providing a resource to navigate the vast machine learning literature existing on this topic. In particular, we focus on methods that could be used to integrate very heterogeneous datasets, including multi-omics data as well as data derived from clinical information and medical imaging.
Collapse
Affiliation(s)
- Jessica Gliozzo
- AnacletoLab - Computer Science Department, Universitá degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy.,European Commission, Joint Research Centre (JRC), Ispra (VA), Italy.,CINI, Infolife National Laboratory, Roma, Italy
| | - Marco Mesiti
- AnacletoLab - Computer Science Department, Universitá degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy.,CINI, Infolife National Laboratory, Roma, Italy
| | - Marco Notaro
- AnacletoLab - Computer Science Department, Universitá degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy.,CINI, Infolife National Laboratory, Roma, Italy
| | - Alessandro Petrini
- AnacletoLab - Computer Science Department, Universitá degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy.,CINI, Infolife National Laboratory, Roma, Italy
| | - Alex Patak
- European Commission, Joint Research Centre (JRC), Ispra (VA), Italy
| | | | - Alberto Paccanaro
- Department of Computer Science, Royal Holloway, University of London, Egham, TW20 0EX UK.,School of Applied Mathematics (EMAp), Fundação Getúlio Vargas, Rio de Janeiro Brazil
| | - Giorgio Valentini
- AnacletoLab - Computer Science Department, Universitá degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy.,CINI, Infolife National Laboratory, Roma, Italy.,DSRC UNIMI, Data Science Research Center, Milano, 20135, Italy.,ELLIS, European Laboratory for Learning and Intelligent Systems, Berlin, Germany
| | - Elena Casiraghi
- AnacletoLab - Computer Science Department, Universitá degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy.,CINI, Infolife National Laboratory, Roma, Italy
| |
Collapse
|
3
|
Marini S, Oliva M, Slizovskiy IB, Das RA, Noyes NR, Kahveci T, Boucher C, Prosperi M. AMR-meta: a k-mer and metafeature approach to classify antimicrobial resistance from high-throughput short-read metagenomics data. Gigascience 2022; 11:6588116. [PMID: 35583675 PMCID: PMC9116207 DOI: 10.1093/gigascience/giac029] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2021] [Revised: 01/27/2022] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Antimicrobial resistance (AMR) is a global health concern. High-throughput metagenomic sequencing of microbial samples enables profiling of AMR genes through comparison with curated AMR databases. However, the performance of current methods is often hampered by database incompleteness and the presence of homology/homoplasy with other non-AMR genes in sequenced samples. RESULTS We present AMR-meta, a database-free and alignment-free approach, based on k-mers, which combines algebraic matrix factorization into metafeatures with regularized regression. Metafeatures capture multi-level gene diversity across the main antibiotic classes. AMR-meta takes in reads from metagenomic shotgun sequencing and outputs predictions about whether those reads contribute to resistance against specific classes of antibiotics. In addition, AMR-meta uses an augmented training strategy that joins an AMR gene database with non-AMR genes (used as negative examples). We compare AMR-meta with AMRPlusPlus, DeepARG, and Meta-MARC, further testing their ensemble via a voting system. In cross-validation, AMR-meta has a median f-score of 0.7 (interquartile range, 0.2-0.9). On semi-synthetic metagenomic data-external test-on average AMR-meta yields a 1.3-fold hit rate increase over existing methods. In terms of run-time, AMR-meta is 3 times faster than DeepARG, 30 times faster than Meta-MARC, and as fast as AMRPlusPlus. Finally, we note that differences in AMR ontologies and observed variance of all tools in classification outputs call for further development on standardization of benchmarking data and protocols. CONCLUSIONS AMR-meta is a fast, accurate classifier that exploits non-AMR negative sets to improve sensitivity and specificity. The differences in AMR ontologies and the high variance of all tools in classification outputs call for the deployment of standard benchmarking data and protocols, to fairly compare AMR prediction tools.
Collapse
Affiliation(s)
- Simone Marini
- Department of Computer and Information Science and Engineering, University of Florida, 2004 Mowry Road Gainesville, FL 32610, USA
| | - Marco Oliva
- Department of Computer and Information Science and Engineering, University of Florida, 432 Newell Dr, Gainesville, FL 32611, USA
| | - Ilya B Slizovskiy
- Department of Veterinary Population Medicine, University of Minnesota, 1365 Gortner Avenue 225, St. Paul, MN 55108, USA
| | - Rishabh A Das
- Department of Computer and Information Science and Engineering, University of Florida, 2004 Mowry Road Gainesville, FL 32610, USA
| | - Noelle Robertson Noyes
- Department of Veterinary Population Medicine, University of Minnesota, 1365 Gortner Avenue 225, St. Paul, MN 55108, USA
| | - Tamer Kahveci
- Department of Computer and Information Science and Engineering, University of Florida, 432 Newell Dr, Gainesville, FL 32611, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, 432 Newell Dr, Gainesville, FL 32611, USA
| | - Mattia Prosperi
- Department of Computer and Information Science and Engineering, University of Florida, 2004 Mowry Road Gainesville, FL 32610, USA
| |
Collapse
|
4
|
Arici MK, Tuncbag N. Performance Assessment of the Network Reconstruction Approaches on Various Interactomes. Front Mol Biosci 2021; 8:666705. [PMID: 34676243 PMCID: PMC8523993 DOI: 10.3389/fmolb.2021.666705] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2021] [Accepted: 07/14/2021] [Indexed: 01/04/2023] Open
Abstract
Beyond the list of molecules, there is a necessity to collectively consider multiple sets of omic data and to reconstruct the connections between the molecules. Especially, pathway reconstruction is crucial to understanding disease biology because abnormal cellular signaling may be pathological. The main challenge is how to integrate the data together in an accurate way. In this study, we aim to comparatively analyze the performance of a set of network reconstruction algorithms on multiple reference interactomes. We first explored several human protein interactomes, including PathwayCommons, OmniPath, HIPPIE, iRefWeb, STRING, and ConsensusPathDB. The comparison is based on the coverage of each interactome in terms of cancer driver proteins, structural information of protein interactions, and the bias toward well-studied proteins. We next used these interactomes to evaluate the performance of network reconstruction algorithms including all-pair shortest path, heat diffusion with flux, personalized PageRank with flux, and prize-collecting Steiner forest (PCSF) approaches. Each approach has its own merits and weaknesses. Among them, PCSF had the most balanced performance in terms of precision and recall scores when 28 pathways from NetPath were reconstructed using the listed algorithms. Additionally, the reference interactome affects the performance of the network reconstruction approaches. The coverage and disease- or tissue-specificity of each interactome may vary, which may result in differences in the reconstructed networks.
Collapse
Affiliation(s)
- M Kaan Arici
- Graduate School of Informatics, Middle East Technical University, Ankara, Turkey.,Foot and Mouth Diseases Institute, Ministry of Agriculture and Forestry, Ankara, Turkey
| | - Nurcan Tuncbag
- Chemical and Biological Engineering, College of Engineering, Koc University, Istanbul, Turkey.,School of Medicine, Koc University, Istanbul, Turkey
| |
Collapse
|
5
|
Salazar DA, Pržulj N, Valencia CF. Multi-project and Multi-profile joint Non-negative Matrix Factorization for cancer omic datasets. Bioinformatics 2021; 37:4801-4809. [PMID: 34375392 DOI: 10.1093/bioinformatics/btab579] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Revised: 07/31/2021] [Accepted: 08/06/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The integration of multi-omic data using machine learning methods has been focused on solving relevant tasks such as predicting sensitivity to a drug or subtyping patients. Recent integration methods, such as joint Non-negative Matrix Factorization (jNMF), have allowed researchers to exploit the information in the data to unravel the biological processes of multi-omic datasets. RESULTS We present a novel method called Multi-project and Multi-profile joint Non-negative Matrix Factorization (M&M-jNMF) capable of integrating data from different sources, such as experimental and observational multi-omic data. The method can generate co-clusters between observations, predict profiles and relate latent variables. We applied the method to integrate low-grade glioma omic profiles from The Cancer Genome Atlas (TCGA) and Cell Line Encyclopedia (CCLE) projects. The method allowed us to find gene clusters mainly enriched in cancer-associated terms. We identified groups of patients and cell lines similar to each other by comparing biological processes. We predicted the drug profile for patients, and we identified genetic signatures for resistant and sensitive tumors to a specific drug. AVAILABILITY AND IMPLEMENTATION Source code repository is publicly available at https://bitbucket.org/dsalazarb/mmjnmf/ - Zenodo DOI: 10.5281/zenodo.5150920. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- D A Salazar
- Industrial Engineering Department, University of los Andes, Bogota, 111711, Colombia.,Center for optimization and applied probability, University of los Andes, Bogota, 111711, Colombia
| | - N Pržulj
- Barcelona Supercomputing Center (BSC), Barcelona, 08034, Spain.,Department of Computer Science, University College London, London, WC1E 6BT, UK.,ICREA, Pg. Lluis Companys 23, Barcelona, 08010, Spain
| | - C F Valencia
- Industrial Engineering Department, University of los Andes, Bogota, 111711, Colombia.,Center for optimization and applied probability, University of los Andes, Bogota, 111711, Colombia
| |
Collapse
|
6
|
Oei RW, Fang HSA, Tan WY, Hsu W, Lee ML, Tan NC. Using Domain Knowledge and Data-Driven Insights for Patient Similarity Analytics. J Pers Med 2021; 11:jpm11080699. [PMID: 34442343 PMCID: PMC8398126 DOI: 10.3390/jpm11080699] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 07/15/2021] [Accepted: 07/21/2021] [Indexed: 12/23/2022] Open
Abstract
Patient similarity analytics has emerged as an essential tool to identify cohorts of patients who have similar clinical characteristics to some specific patient of interest. In this study, we propose a patient similarity measure called D3K that incorporates domain knowledge and data-driven insights. Using the electronic health records (EHRs) of 169,434 patients with either diabetes, hypertension or dyslipidaemia (DHL), we construct patient feature vectors containing demographics, vital signs, laboratory test results, and prescribed medications. We discretize the variables of interest into various bins based on domain knowledge and make the patient similarity computation to be aligned with clinical guidelines. Key findings from this study are: (1) D3K outperforms baseline approaches in all seven sub-cohorts; (2) our domain knowledge-based binning strategy outperformed the traditional percentile-based binning in all seven sub-cohorts; (3) there is substantial agreement between D3K and physicians (κ = 0.746), indicating that D3K can be applied to facilitate shared decision making. This is the first study to use patient similarity analytics on a cardiometabolic syndrome-related dataset sourced from medical institutions in Singapore. We consider patient similarity among patient cohorts with the same medical conditions to develop localized models for personalized decision support to improve the outcomes of a target patient.
Collapse
Affiliation(s)
- Ronald Wihal Oei
- Institute of Data Science, National University of Singapore, Singapore 117602, Singapore; (W.-Y.T.); (W.H.); (M.-L.L.)
- Correspondence:
| | - Hao Sen Andrew Fang
- SingHealth Polyclinics, SingHealth, Singapore 150167, Singapore; (H.S.A.F.); (N.-C.T.)
| | - Wei-Ying Tan
- Institute of Data Science, National University of Singapore, Singapore 117602, Singapore; (W.-Y.T.); (W.H.); (M.-L.L.)
| | - Wynne Hsu
- Institute of Data Science, National University of Singapore, Singapore 117602, Singapore; (W.-Y.T.); (W.H.); (M.-L.L.)
- School of Computing, National University of Singapore, Singapore 117417, Singapore
| | - Mong-Li Lee
- Institute of Data Science, National University of Singapore, Singapore 117602, Singapore; (W.-Y.T.); (W.H.); (M.-L.L.)
- School of Computing, National University of Singapore, Singapore 117417, Singapore
| | - Ngiap-Chuan Tan
- SingHealth Polyclinics, SingHealth, Singapore 150167, Singapore; (H.S.A.F.); (N.-C.T.)
| |
Collapse
|
7
|
Xenos A, Malod-Dognin N, Milinković S, Pržulj N. Linear functional organization of the omic embedding space. Bioinformatics 2021; 37:3839-3847. [PMID: 34213534 PMCID: PMC8570782 DOI: 10.1093/bioinformatics/btab487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2020] [Revised: 06/21/2021] [Accepted: 06/30/2021] [Indexed: 11/21/2022] Open
Abstract
Motivation We are increasingly accumulating complex omics data that capture different aspects of cellular functioning. A key challenge is to untangle their complexity and effectively mine them for new biomedical information. To decipher this new information, we introduce algorithms based on network embeddings. Such algorithms represent biological macromolecules as vectors in d-dimensional space, in which topologically similar molecules are embedded close in space and knowledge is extracted directly by vector operations. Recently, it has been shown that neural networks used to obtain vectorial representations (embeddings) are implicitly factorizing a mutual information matrix, called Positive Pointwise Mutual Information (PPMI) matrix. Thus, we propose the use of the PPMI matrix to represent the human protein–protein interaction (PPI) network and also introduce the graphlet degree vector PPMI matrix of the PPI network to capture different topological (structural) similarities of the nodes in the molecular network. Results We generate the embeddings by decomposing these matrices with Nonnegative Matrix Tri-Factorization. We demonstrate that genes that are embedded close in these spaces have similar biological functions, so we can extract new biomedical knowledge directly by doing linear operations on their embedding vector representations. We exploit this property to predict new genes participating in protein complexes and to identify new cancer-related genes based on the cosine similarities between the vector representations of the genes. We validate 80% of our novel cancer-related gene predictions in the literature and also by patient survival curves that demonstrating that 93.3% of them have a potential clinical relevance as biomarkers of cancer. Availability and implementation Code and data are available online at https://gitlab.bsc.es/axenos/embedded-omics-data-geometry/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- A Xenos
- Barcelona Supercomputing Center (BSC), 08034 Barcelona, Spain.,Universitat Politecnica de Catalunya (UPC), 08034 Barcelona, Spain
| | - N Malod-Dognin
- Barcelona Supercomputing Center (BSC), 08034 Barcelona, Spain.,Department of Computer Science, University College London, WC1E 6BT London, United Kingdom
| | - S Milinković
- RAF School of Computing, Union University, Belgrade, Serbia
| | - N Pržulj
- Barcelona Supercomputing Center (BSC), 08034 Barcelona, Spain.,Department of Computer Science, University College London, WC1E 6BT London, United Kingdom.,ICREA, Pg. Lluís Companys 23, 08010 Barcelona, Spain
| |
Collapse
|
8
|
Nicora G, Moretti F, Sauta E, Della Porta M, Malcovati L, Cazzola M, Quaglini S, Bellazzi R. A continuous-time Markov model approach for modeling myelodysplastic syndromes progression from cross-sectional data. J Biomed Inform 2020; 104:103398. [PMID: 32113003 DOI: 10.1016/j.jbi.2020.103398] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2019] [Revised: 01/31/2020] [Accepted: 02/25/2020] [Indexed: 01/27/2023]
Abstract
The integration of both genomics and clinical data to model disease progression is now possible, thanks to the increasing availability of molecular patients' profiles. This may lead to the definition of novel decision support tools, able to tailor therapeutic interventions on the basis of a "precise" patients' risk stratification, given their health status evolution. However, longitudinal analysis requires long-term data collection and curation, which can be time demanding, expensive and sometimes unfeasible. Here we present a clinical decision support framework that combines the simulation of disease progression from cross-sectional data with a Markov model that exploits continuous-time transition probabilities derived from Cox regression. Trajectories between patients at different disease stages are stochastically built according to a measure of patient similarity, computed with a matrix tri-factorization technique. Such trajectories are seen as realizations drawn from the stochastic process driving the transitions between the disease stages. Eventually, Markov models applied to the resulting longitudinal dataset highlight potentially relevant clinical information. We applied our method to cross-sectional genomic and clinical data from a cohort of Myelodysplastic syndromes (MDS) patients. MDS are heterogeneous clonal hematopoietic disorders whose patients are characterized by different risks of Acute Myeloid Leukemia (AML) development, defined by an international score. We computed patients' trajectories across increasing and subsequent levels of risk of developing AML, and we applied a Cox model to the simulated longitudinal dataset to assess whether genomic characteristics could be associated with a higher or lower probability of disease progression. We then used the learned parameters of such Cox model to calculate the transition probabilities of a continuous-time Markov model that describes the patients' evolution across stages. Our results are in most cases confirmed by previous studies, thus demonstrating that simulated longitudinal data represent a valuable resource to investigate disease progression of MDS patients.
Collapse
Affiliation(s)
- G Nicora
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Italy
| | - F Moretti
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Italy
| | - E Sauta
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Italy
| | - M Della Porta
- Cancer Center, Humanitas Research Hospital and Humanitas University, Milan, Italy
| | - L Malcovati
- Department of Hematology and Oncology, IRCCS Policlinico San Matteo, Pavia, Italy
| | - M Cazzola
- Department of Hematology and Oncology, IRCCS Policlinico San Matteo, Pavia, Italy
| | - S Quaglini
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Italy
| | - R Bellazzi
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Italy
| |
Collapse
|
9
|
Marini S, Vitali F, Rampazzi S, Demartini A, Akutsu T. Protease target prediction via matrix factorization. Bioinformatics 2019; 35:923-929. [PMID: 30169576 DOI: 10.1093/bioinformatics/bty746] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2018] [Revised: 08/20/2018] [Accepted: 08/27/2018] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Protein cleavage is an important cellular event, involved in a myriad of processes, from apoptosis to immune response. Bioinformatics provides in silico tools, such as machine learning-based models, to guide the discovery of targets for the proteases responsible for protein cleavage. State-of-the-art models have a scope limited to specific protease families (such as Caspases), and do not explicitly include biological or medical knowledge (such as the hierarchical protein domain similarity or gene-gene interactions). To fill this gap, we present a novel approach for protease target prediction based on data integration. RESULTS By representing protease-protein target information in the form of relational matrices, we design a model (i) that is general and not limited to a single protease family, and (b) leverages on the available knowledge, managing extremely sparse data from heterogeneous data sources, including primary sequence, pathways, domains and interactions. When compared with other algorithms on test data, our approach provides a better performance even for models specifically focusing on a single protease family. AVAILABILITY AND IMPLEMENTATION https://gitlab.com/smarini/MaDDA/ (Matlab code and utilized data.). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Simone Marini
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Francesca Vitali
- Department of Medicine, Center for Biomedical Informatics and Biostatistics, BIO5 Institute), University of Arizona, Tucson, AZ, USA
| | - Sara Rampazzi
- Department of Computer Science and Engineering, University of Michigan, Ann Arbor, MI, USA
| | - Andrea Demartini
- Department of Electrical Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto, Japan
| |
Collapse
|
10
|
A patient-similarity-based model for diagnostic prediction. Int J Med Inform 2019; 135:104073. [PMID: 31923816 DOI: 10.1016/j.ijmedinf.2019.104073] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2019] [Revised: 11/26/2019] [Accepted: 12/30/2019] [Indexed: 12/28/2022]
Abstract
OBJECTIVE To simulate the clinical reasoning of doctors, retrieve analogous patients of an index patient automatically and predict diagnoses by the similar/dissimilar patients. METHODS We proposed a novel patient-similarity-based framework for diagnostic prediction, which is inspired by the structure-mapping theory about analogy reasoning in psychology. Patient similarity is defined as the similarity between two patients' diagnoses sets rather than a dichotomous (absence/presence of just one disease). The multilabel classification problem is converted to a single-value regression problem by integrating the pairwise patients' clinical features into a vector and taking the vector as the input and the patient similarity as the output. In contrast to the common k-NN method which only considering the nearest neighbors, we not only utilize similar patients (positive analogy) to generate diagnostic hypotheses, but also utilize dissimilar patients (negative analogy) are used to reject diagnostic hypotheses. RESULTS The patient-similarity-based models perform better than the one-vs-all baseline and traditional k-NN methods. The f-1 score of positive-analogy-based prediction is 0.698, significantly higher than the scores of baselines ranging from 0.368 to 0.661. It increases to 0.703 when the negative analogy method is applied to modify the prediction results of positive analogy. The performance of this method is highly promising for larger datasets. CONCLUSION The patient-similarity-based model provides diagnostic decision support that is more accurate, generalizable, and interpretable than those of previous methods and is based on heterogeneous and incomplete data. The model also serves as a new application for the use of clinical big data through artificial intelligence technology.
Collapse
|
11
|
Čopar A, Zupan B, Zitnik M. Fast optimization of non-negative matrix tri-factorization. PLoS One 2019; 14:e0217994. [PMID: 31185054 PMCID: PMC6559648 DOI: 10.1371/journal.pone.0217994] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2019] [Accepted: 05/22/2019] [Indexed: 11/18/2022] Open
Abstract
Non-negative matrix tri-factorization (NMTF) is a popular technique for learning low-dimensional feature representation of relational data. Currently, NMTF learns a representation of a dataset through an optimization procedure that typically uses multiplicative update rules. This procedure has had limited success, and its failure cases have not been well understood. We here perform an empirical study involving six large datasets comparing multiplicative update rules with three alternative optimization methods, including alternating least squares, projected gradients, and coordinate descent. We find that methods based on projected gradients and coordinate descent converge up to twenty-four times faster than multiplicative update rules. Furthermore, alternating least squares method can quickly train NMTF models on sparse datasets but often fails on dense datasets. Coordinate descent-based NMTF converges up to sixteen times faster compared to well-established methods.
Collapse
Affiliation(s)
- Andrej Čopar
- Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
| | - Blaž Zupan
- Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, United States of America
| | - Marinka Zitnik
- Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
- Department of Computer Science, Stanford University, Stanford, CA, United States of America
| |
Collapse
|
12
|
Zhang A, Li A, He J, Wang M. LSCDFS-MKL: A multiple kernel based method for lung squamous cell carcinomas disease-free survival prediction with pathological and genomic data. J Biomed Inform 2019; 94:103194. [PMID: 31048071 DOI: 10.1016/j.jbi.2019.103194] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2018] [Revised: 04/14/2019] [Accepted: 04/29/2019] [Indexed: 11/18/2022]
Abstract
Lung squamous cell carcinoma (SCC) is a fatal disease in both male and female, for which current treatments are inadequate. Surgical resection is regarded as the cornerstone of treatment for patients with lung SCC, but even for the same stage patients, the wide spectrum of disease-free survival (DFS) times exits. Therefore, how to improve the DFS prediction performance of lung SCC becomes one major research area. In this study, we proposed a novel method called LSCDFS-MKL, which was on the basis of multiple kernel learning to predict DFS of lung SCC. In LSCDFS-MKL, we first efficiently integrated pathological images and genomic data (copy number aberration, gene expression, protein expression) from lung SCC. The results of LSCDFS-MKL between different types of data show that the features extracted from pathological images play an important role in DFS prediction of lung SCC. Then we compared our method LSCDFS-MKL with other existing methods and performance analysis indicates that LSCDFS-MKL has a significantly better performance than other prediction methods. After that, we applied the proposed method on different stage stratums and the performance demonstrates that LSCDFS-MKL remains efficient in DFS prediction of lung SCC patients. Finally, we performed LSCDFS-MKL on an independent validation dataset and the accuracy of DFS prediction achieves 100%, which is promising.
Collapse
Affiliation(s)
- Aoshuang Zhang
- School of Information Science and Technology, University of Science and Technology of China, 443 Huangshan Road, Hefei 230027, China.
| | - Ao Li
- School of Information Science and Technology, University of Science and Technology of China, 443 Huangshan Road, Hefei 230027, China; Research Centers for Biomedical Engineering, University of Science and Technology of China, 443 Huangshan Road, Hefei 230027, China.
| | - Jie He
- Department of Pathology, The First Affiliated Hospital of University of Science and Technology of China, Hefei 230031, China; Department of Pathology, Anhui Provincial Cancer Hospital, Hefei 230031, China.
| | - Minghui Wang
- School of Information Science and Technology, University of Science and Technology of China, 443 Huangshan Road, Hefei 230027, China; Research Centers for Biomedical Engineering, University of Science and Technology of China, 443 Huangshan Road, Hefei 230027, China.
| |
Collapse
|
13
|
Malod-Dognin N, Petschnigg J, Windels SFL, Povh J, Hemingway H, Ketteler R, Pržulj N. Towards a data-integrated cell. Nat Commun 2019; 10:805. [PMID: 30778056 PMCID: PMC6379402 DOI: 10.1038/s41467-019-08797-8] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2018] [Revised: 01/18/2019] [Accepted: 01/25/2019] [Indexed: 01/01/2023] Open
Abstract
We are increasingly accumulating molecular data about a cell. The challenge is how to integrate them within a unified conceptual and computational framework enabling new discoveries. Hence, we propose a novel, data-driven concept of an integrated cell, iCell. Also, we introduce a computational prototype of an iCell, which integrates three omics, tissue-specific molecular interaction network types. We construct iCells of four cancers and the corresponding tissue controls and identify the most rewired genes in cancer. Many of them are of unknown function and cannot be identified as different in cancer in any specific molecular network. We biologically validate that they have a role in cancer by knockdown experiments followed by cell viability assays. We find additional support through Kaplan-Meier survival curves of thousands of patients. Finally, we extend this analysis to uncover pan-cancer genes. Our methodology is universal and enables integrative comparisons of diverse omics data over cells and tissues.
Collapse
Affiliation(s)
- Noël Malod-Dognin
- Department of Computer Science, University College London, London, WC1E 6BT, UK
- Department of Life Science, Barcelona Supercomputing Center (BSC), Barcelona, 08034, Spain
| | - Julia Petschnigg
- Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - Sam F L Windels
- Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - Janez Povh
- Faculty of Mechanical Engineering, University of Ljubljana, Ljubljana, 1000, Slovenia
| | - Harry Hemingway
- Health Data Research UK London, University College London, London, WC1E 6BT, UK
- Institute of Health Informatics, University College London, London, WC1E 6BT, UK
- The National Institute for Health Research University College London Hospitals Biomedical Research Centre, University College London, London, W1T 7DN, UK
| | - Robin Ketteler
- MRC Laboratory for Molecular Cell Biology, University College London, London, WC1E 6BT, UK
| | - Nataša Pržulj
- Department of Computer Science, University College London, London, WC1E 6BT, UK.
- Department of Life Science, Barcelona Supercomputing Center (BSC), Barcelona, 08034, Spain.
- ICREA, Pg. Lluís Companys 23, 08010, Barcelona, Spain.
| |
Collapse
|